LinuxLists.cc - [PATCH] nvme/tcp: Add support to set the tcp worker cpu affinity

2023-04-13 06:29:19

Subject: [PATCH] nvme/tcp: Add support to set the tcp worker cpu affinity

The default worker affinity policy is using all online cpus, e.g. from 0
to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
have a bad performance.

This patch adds a module parameter to set the cpu affinity for the nvme-tcp
socket worker threads. The parameter is a comma separated list of CPU
numbers. The list is parsed and the resulting cpumask is used to set the
affinity of the socket worker threads. If the list is empty or the
parsing fails, the default affinity is used.

Signed-off-by: Li Feng <[email protected]>
---
drivers/nvme/host/tcp.c | 54 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 49c9e7bc9116..a82c50adb12b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -31,6 +31,18 @@ static int so_priority;
module_param(so_priority, int, 0644);
MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");

+/* Support for specifying the CPU affinity for the nvme-tcp socket worker
+ * threads. This is a comma separated list of CPU numbers. The list is
+ * parsed and the resulting cpumask is used to set the affinity of the
+ * socket worker threads. If the list is empty or the parsing fails, the
+ * default affinity is used.
+ */
+static char *cpu_affinity_list;
+module_param(cpu_affinity_list, charp, 0644);
+MODULE_PARM_DESC(cpu_affinity_list, "nvme tcp socket worker cpu affinity list");
+
+struct cpumask cpu_affinity_mask;
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/* lockdep can detect a circular dependency of the form
* sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
@@ -1483,6 +1495,41 @@ static bool nvme_tcp_poll_queue(struct nvme_tcp_queue *queue)
ctrl->io_queues[HCTX_TYPE_POLL];
}

+static ssize_t update_cpu_affinity(const char *buf)
+{
+ cpumask_var_t new_value;
+ cpumask_var_t dst_value;
+ int err = 0;
+
+ if (!zalloc_cpumask_var(&new_value, GFP_KERNEL))
+ return -ENOMEM;
+
+ err = bitmap_parselist(buf, cpumask_bits(new_value), nr_cpumask_bits);
+ if (err)
+ goto free_new_cpumask;
+
+ if (!zalloc_cpumask_var(&dst_value, GFP_KERNEL)) {
+ err = -ENOMEM;
+ goto free_new_cpumask;
+ }
+
+ /*
+ * If the new_value does not have any intersection with the cpu_online_mask,
+ * the dst_value will be empty, then keep the cpu_affinity_mask as cpu_online_mask.
+ */
+ if (cpumask_and(dst_value, new_value, cpu_online_mask))
+ cpu_affinity_mask = *dst_value;
+
+ free_cpumask_var(dst_value);
+
+free_new_cpumask:
+ free_cpumask_var(new_value);
+ if (err)
+ pr_err("failed to update cpu affinity mask, bad affinity list [%s], err %d\n",
+ buf, err);
+ return err;
+}
+
static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
{
struct nvme_tcp_ctrl *ctrl = queue->ctrl;
@@ -1496,7 +1543,12 @@ static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
else if (nvme_tcp_poll_queue(queue))
n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
ctrl->io_queues[HCTX_TYPE_READ] - 1;
- queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
+
+ if (!cpu_affinity_list || update_cpu_affinity(cpu_affinity_list) != 0) {
+ // Set the default cpu_affinity_mask to cpu_online_mask
+ cpu_affinity_mask = *cpu_online_mask;
+ }
+ queue->io_cpu = cpumask_next_wrap(n - 1, &cpu_affinity_mask, -1, false);
}

static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
--
2.40.0

2023-04-13 06:41:33

by Li Feng

[permalink] [raw]

2023-04-27 14:39:59

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] nvme/tcp: Add support to set the tcp worker cpu affinity

On Wed, Apr 26, 2023 at 7:34 PM Hannes Reinecke <[email protected]> wrote:
>
> On 4/25/23 10:32, Li Feng wrote:
> > Hi Sagi,
> >
> > On Wed, Apr 19, 2023 at 5:32 PM Sagi Grimberg <[email protected]> wrote:
> >>
> >>
> >>>> Hey Li,
> >>>>
> >>>>> The default worker affinity policy is using all online cpus, e.g. from 0
> >>>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
> >>>>> have a bad performance.
> >>>>> This patch adds a module parameter to set the cpu affinity for the nvme-tcp
> >>>>> socket worker threads. The parameter is a comma separated list of CPU
> >>>>> numbers. The list is parsed and the resulting cpumask is used to set the
> >>>>> affinity of the socket worker threads. If the list is empty or the
> >>>>> parsing fails, the default affinity is used.
> >>>>
> >>>> I can see how this may benefit a specific set of workloads, but I have a
> >>>> few issues with this.
> >>>>
> >>>> - This is exposing a user interface for something that is really
> >>>> internal to the driver.
> >>>>
> >>>> - This is something that can be misleading and could be tricky to get
> >>>> right, my concern is that this would only benefit a very niche case.
> >>> Our storage products needs this feature~
> >>> If the user doesn’t know what this is, they can keep it default, so I thinks this is
> >>> not unacceptable.
> >>
> >> It doesn't work like that. A user interface is not something exposed to
> >> a specific consumer.
> >>
> >>>> - If the setting should exist, it should not be global.
> >>> V2 has fixed it.
> >>>>
> >>>> - I prefer not to introduce new modparams.
> >>>>
> >>>> - I'd prefer to find a way to support your use-case without introducing
> >>>> a config knob for it.
> >>>>
> >>> I’m looking forward to it.
> >>
> >> If you change queue_work_on to queue_work, ignoring the io_cpu, does it
> >> address your problem?
> > Sorry for the late response, I just got my machine back.
> > Replace the queue_work_on to queue_work, looks like it has a little
> > good performance.
> > The busy worker is `kworker/56:1H+nvme_tcp_wq`, and fio binds to
> > 90('cpus_allowed=90'),
> > I don't know why the worker 56 is selected.
> > The performance of 256k read up from 1.15GB/s to 1.35GB/s.
> >
> >>
> >> Not saying that this should be a solution though.
> >>
> >> How many queues does your controller support that you happen to use
> >> queue 0 ?
> > Our controller only support one io queue currently.
>
> Ouch.
> Remember, NVMe gets most of the performance improvements by using
> several queues, and be able to bind the queues to cpu sets.
> Exposing just one queue will be invalidating any assumptions we do,
> and trying to improve interrupt steering won't work anyway.
>
> I sincerely doubt we should try to 'optimize' for this rather peculiar
> setup.

Queue isn't free, and it does consume both host and device resources,
especially blk-mq takes static mapping.

Also it may depend on how application uses the nvme/tcp, such as,
io_uring may saturate one device easily in very limited tasks/queues
(one or two or a little more, depends on the device and driver implementation)

Thanks,
Ming