2022-04-22 20:06:26

by 许春光

[permalink] [raw]
Subject: [RFC PATCH] nvme-pci: allowed to modify IRQ affinity in latency sensitive scenarios

From: Chunguang Xu <[email protected]>

In most cases, setting the affinity through managed IRQ is a better
choice. But in some scenarios that use isolcpus, such as DPDK, because
managed IRQ does not distinguish between housekeeping CPU and isolated
CPU when selecting CPU, this will cause IO interrupts triggered by
housekeeping CPU to be routed to isolated CPU, which will affect the
tasks running on isolated CPU. commit 11ea68f553e2 ("genirq,
sched/isolation: Isolate from handling managed interrupts") tries to
fix this in a best effort way. However, in a real production environment,
latency-sensitive business needs more of a deterministic result. So,
similar to the mpt3sas driver, we might can add a module parameter
smp_affinity_enable to the Nvme driver.

By default, we use managed IRQ. When smp_affinity_enable is set to 0,
we alloc normal interrupts for Nvme. Therefore, users can modify the
interrupt affinity according to their actual needs when the managed IRQ
cannot satisfy them. This method is not a good choice in most scenarios.
But for users who clear what they are doing, it may be better than not
being able to do anything.

Signed-off-by: Chunguang Xu <[email protected]>
---
drivers/nvme/host/pci.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aacf1c0d5a5..f8fd591b1839 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -74,6 +74,10 @@ static unsigned int io_queue_depth = 1024;
module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644);
MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2 and < 4096");

+static unsigned int smp_affinity_enable = 1;
+module_param(smp_affinity_enable, uint, 0644);
+MODULE_PARM_DESC(smp_affinity_enable, "SMP affinity feature enable/disable Default: enable(1)");
+
static int io_queue_count_set(const char *val, const struct kernel_param *kp)
{
unsigned int n;
@@ -471,7 +475,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
* affinity), so use the regular blk-mq cpu mapping
*/
map->queue_offset = qoff;
- if (i != HCTX_TYPE_POLL && offset)
+ if (i != HCTX_TYPE_POLL && offset && smp_affinity_enable)
blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
else
blk_mq_map_queues(map);
@@ -2293,7 +2297,11 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
.calc_sets = nvme_calc_irq_sets,
.priv = dev,
};
+ struct irq_affinity *p_affd = &affd;
unsigned int irq_queues, poll_queues;
+ unsigned int flags = PCI_IRQ_ALL_TYPES;
+ unsigned int affvecs;
+ int nr_irqs;

/*
* Poll queues don't need interrupts, but we need at least one I/O queue
@@ -2317,8 +2325,24 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
irq_queues = 1;
if (!(dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR))
irq_queues += (nr_io_queues - poll_queues);
- return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
- PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
+
+ if (smp_affinity_enable)
+ flags |= PCI_IRQ_AFFINITY;
+ else
+ p_affd = NULL;
+
+ nr_irqs = pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags, p_affd);
+
+ if (nr_irqs > 0 && !smp_affinity_enable) {
+ if (nr_irqs > affd.pre_vectors)
+ affvecs = nr_irqs - affd.pre_vectors;
+ else
+ affvecs = 0;
+
+ nvme_calc_irq_sets(&affd, affvecs);
+ }
+
+ return nr_irqs;
}

static void nvme_disable_io_queues(struct nvme_dev *dev)
--
2.30.0


2022-04-23 07:13:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] nvme-pci: allowed to modify IRQ affinity in latency sensitive scenarios

On Fri, Apr 22, 2022 at 06:58:26PM +0800, brookxu.cn wrote:
> From: Chunguang Xu <[email protected]>
>
> In most cases, setting the affinity through managed IRQ is a better
> choice. But in some scenarios that use isolcpus, such as DPDK, because
> managed IRQ does not distinguish between housekeeping CPU and isolated
> CPU when selecting CPU, this will cause IO interrupts triggered by
> housekeeping CPU to be routed to isolated CPU, which will affect the
> tasks running on isolated CPU. commit 11ea68f553e2 ("genirq,
> sched/isolation: Isolate from handling managed interrupts") tries to
> fix this in a best effort way. However, in a real production environment,
> latency-sensitive business needs more of a deterministic result. So,
> similar to the mpt3sas driver, we might can add a module parameter
> smp_affinity_enable to the Nvme driver.

This kind of boilerplate code in random drivers is not sustainable.

I really think we need to handle this whole housekeeping CPU case in
common code. That is designed CPUs as housekeeping vs non-housekeeping
and let the generic affinity assignment code deal with it and solve
it for all drivers using the proper affinity masks instead of having
random slighty overrides in all drivers anyone ever wants to use in
such a system.