LinuxLists.cc - [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-02-28 15:50:09

Subject: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Currently, adminq and ioq0 share the same irq vector. This is
unfair for both amdinq and ioq0.
- For adminq, its completion irq has to be bound on cpu0. It
just has only one hw queue, it is unreasonable to do this.
- For ioq0, when the irq fires for io completion, the adminq irq
action on this irq vector will introduce an uncached access on
adminq cqe at least, even worse when adminq is busy.

To improve this, allocate separate irq vectors for adminq and
ioq0, and not set irq affinity for adminq one. If just one irq
vector, setup adminq + 1 ioq and let them share it. In addition
add new helper interface nvme_ioq_vector to get ioq vector.

V1->V2
- add case to handle the scenario where there is only one irq
vector
- add nvme_ioq_vector to map ioq vector and qid

Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/nvme/host/pci.c | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2..273b157 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -84,6 +84,7 @@ struct nvme_dev {
struct dma_pool *prp_small_pool;
unsigned online_queues;
unsigned max_qid;
+ unsigned int num_vecs;
int q_depth;
u32 db_stride;
void __iomem *bar;
@@ -139,6 +140,17 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
return container_of(ctrl, struct nvme_dev, ctrl);
}

+static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
+ unsigned int qid)
+{
+ /*
+ * If controller has only legacy or single-message MSI, there will
+ * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
+ * and let them share irq vector.
+ */
+ return (dev->num_vecs == 1) ? 0 : qid;
+}
+
/*
* An NVM Express queue. Each device has at least two (one for admin
* commands and one for I/O commands).
@@ -1456,7 +1468,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
nvmeq->sq_cmds_io = dev->cmb + offset;
}

- nvmeq->cq_vector = qid - 1;
+ nvmeq->cq_vector = nvme_ioq_vector(dev, qid);
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
return result;
@@ -1626,9 +1638,9 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
int ret = 0;

for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
- /* vector == qid - 1, match nvme_create_queue */
if (nvme_alloc_queue(dev, i, dev->q_depth,
- pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
+ pci_irq_get_node(to_pci_dev(dev->dev),
+ nvme_ioq_vector(dev, i)))) {
ret = -ENOMEM;
break;
}
@@ -1909,6 +1921,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int result, nr_io_queues;
unsigned long size;
+ struct irq_affinity affd = {.pre_vectors = 1};
+ int ret;

nr_io_queues = num_present_cpus();
result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1945,12 +1959,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
* setting up the full range we need.
*/
pci_free_irq_vectors(pdev);
- nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
- PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
- if (nr_io_queues <= 0)
+ ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
+ PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
+ if (ret <= 0)
return -EIO;
- dev->max_qid = nr_io_queues;
-
+ dev->num_vecs = ret;
+ dev->max_qid = (ret > 1) ? (ret - 1) : 1;
/*
* Should investigate if there's a performance win from allocating
* more queues than interrupt vectors; it might allow the submission
--
2.7.4

2018-02-28 16:01:16

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

On Wed, Feb 28, 2018 at 5:48 PM, Jianchao Wang
<[email protected]> wrote:
> Currently, adminq and ioq0 share the same irq vector. This is
> unfair for both amdinq and ioq0.
> - For adminq, its completion irq has to be bound on cpu0. It
> just has only one hw queue, it is unreasonable to do this.
> - For ioq0, when the irq fires for io completion, the adminq irq
> action on this irq vector will introduce an uncached access on
> adminq cqe at least, even worse when adminq is busy.
>
> To improve this, allocate separate irq vectors for adminq and
> ioq0, and not set irq affinity for adminq one. If just one irq
> vector, setup adminq + 1 ioq and let them share it. In addition
> add new helper interface nvme_ioq_vector to get ioq vector.

> +static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
> + unsigned int qid)
> +{
> + /*
> + * If controller has only legacy or single-message MSI, there will
> + * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
> + * and let them share irq vector.
> + */
> + return (dev->num_vecs == 1) ? 0 : qid;

Redundant parens.

> +}

>
> for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
> - /* vector == qid - 1, match nvme_create_queue */

> if (nvme_alloc_queue(dev, i, dev->q_depth,
> - pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
> + pci_irq_get_node(to_pci_dev(dev->dev),
> + nvme_ioq_vector(dev, i)))) {

Perhaps better to introduce a temporary variable to make it readable?

> ret = -ENOMEM;
> break;
> }

> + ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
> + PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
> + if (ret <= 0)
> return -EIO;
> - dev->max_qid = nr_io_queues;
> -
> + dev->num_vecs = ret;
> + dev->max_qid = (ret > 1) ? (ret - 1) : 1;

I don not see how ret can possible be < 1 here.

Thus, the logic looks like this:
if ret >= 2 => return ret - 1; // Possible variants [1..ret - 1]
if ret == 1 => return 1;

So, for ret == 1 or ret == 2 we still use 1.

Is it by design?

Can it be written like

dev->max_qid = max(ret - 1, 1);

--
With Best Regards,
Andy Shevchenko

2018-02-28 16:48:47

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Note that we originally allocates irqs this way, and Keith changed
it a while ago for good reasons. So I'd really like to see good
reasons for moving away from this, and some heuristics to figure
out which way to use. E.g. if the device supports more irqs than
I/O queues your scheme might always be fine.

2018-03-01 09:30:22

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons. So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use. E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

I still don't understand what this buys us in practice. Seems redundant
to allocate another vector without any (even marginal) difference.

2018-03-01 10:08:26

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi sagi

Thanks for your kindly response.

On 03/01/2018 05:28 PM, Sagi Grimberg wrote:
>
>> Note that we originally allocates irqs this way, and Keith changed
>> it a while ago for good reasons. So I'd really like to see good
>> reasons for moving away from this, and some heuristics to figure
>> out which way to use. E.g. if the device supports more irqs than
>> I/O queues your scheme might always be fine.
>
> I still don't understand what this buys us in practice. Seems redundant
> to allocate another vector without any (even marginal) difference.
>

When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
one for adminq completion irq action.
We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
If we have enough vectors, we could allocate another irq vector for adminq to avoid this.

Sincerely
Jianchao

2018-03-01 15:05:28

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

On Wed, Feb 28, 2018 at 05:47:26PM +0100, Christoph Hellwig wrote:
> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons. So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use. E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

If all CPUs for the 1st IRQ vector of admin queue are offline, then I
guess NVMe can't work any more.

So looks it is a good idea to make admin queue's IRQ vector assigned as
non-managed IRQs.

Thanks,
Ming

2018-03-01 15:16:15

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

On Thu, Mar 01, 2018 at 06:05:53PM +0800, jianchao.wang wrote:
> When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
> one for adminq completion irq action.

Let's be a little more careful on the terminology when referring to spec
defined features: there is no such thing as "ioq0". The IO queues start
at 1. The admin queue is the '0' index queue.

> We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
> If we have enough vectors, we could allocate another irq vector for adminq to avoid this.

Please understand the _overwhelming_ majority of time spent for IRQ
handling is the context switches. There's a reason you're not able to
measure a perf difference between IOQ1 and IOQ2: the number of CPU cycles
to chain a second action is negligible.

2018-03-01 16:09:46

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

On Thu, Mar 01, 2018 at 11:03:30PM +0800, Ming Lei wrote:
> If all CPUs for the 1st IRQ vector of admin queue are offline, then I
> guess NVMe can't work any more.

Yikes, with respect to admin commands, it appears you're right if your
system allows offlining CPU0.

> So looks it is a good idea to make admin queue's IRQ vector assigned as
> non-managed IRQs.

It'd still be considered managed even if it's a 'pre_vector', though
it would get the default mask with all possible CPUs.

2018-03-02 03:54:43

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi Christoph

Thanks for your kindly response and directive

On 03/01/2018 12:47 AM, Christoph Hellwig wrote:
> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons. So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use. E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

maybe we could add a logic that when get enough irq vectors, assign
separate irq vector to adminq, otherwise sharing.

Sincerely
Jianchao

2018-03-02 03:55:45

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi Keith

Thanks for your kindly directive and precious time for this.

On 03/01/2018 11:15 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 06:05:53PM +0800, jianchao.wang wrote:
>> When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
>> one for adminq completion irq action.
>
> Let's be a little more careful on the terminology when referring to spec
> defined features: there is no such thing as "ioq0". The IO queues start
> at 1. The admin queue is the '0' index queue.

Yes, indeed, sorry for my bad description.

>> We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
>> If we have enough vectors, we could allocate another irq vector for adminq to avoid this.
>
> Please understand the _overwhelming_ majority of time spent for IRQ
> handling is the context switches. There's a reason you're not able to
> measure a perf difference between IOQ1 and IOQ2: the number of CPU cycles
> to chain a second action is negligible.
>

Yes, indeed

Sincerely
Jianchao

2018-03-02 04:28:58

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi Andy

Thanks for your precious time for this and kindly reminding.

On 02/28/2018 11:59 PM, Andy Shevchenko wrote:
> On Wed, Feb 28, 2018 at 5:48 PM, Jianchao Wang
> <[email protected]> wrote:
>> Currently, adminq and ioq0 share the same irq vector. This is
>> unfair for both amdinq and ioq0.
>> - For adminq, its completion irq has to be bound on cpu0. It
>> just has only one hw queue, it is unreasonable to do this.
>> - For ioq0, when the irq fires for io completion, the adminq irq
>> action on this irq vector will introduce an uncached access on
>> adminq cqe at least, even worse when adminq is busy.
>>
>> To improve this, allocate separate irq vectors for adminq and
>> ioq0, and not set irq affinity for adminq one. If just one irq
>> vector, setup adminq + 1 ioq and let them share it. In addition
>> add new helper interface nvme_ioq_vector to get ioq vector.
>
>> +static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
>> + unsigned int qid)
>> +{
>> + /*
>> + * If controller has only legacy or single-message MSI, there will
>> + * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
>> + * and let them share irq vector.
>> + */
>> + return (dev->num_vecs == 1) ? 0 : qid;
>
> Redundant parens.

Yes, but parens make it more clearly

>
>> +}
>
>>
>> for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
>> - /* vector == qid - 1, match nvme_create_queue */
>
>> if (nvme_alloc_queue(dev, i, dev->q_depth,
>> - pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
>> + pci_irq_get_node(to_pci_dev(dev->dev),
>> + nvme_ioq_vector(dev, i)))) {
>
> Perhaps better to introduce a temporary variable to make it readable?

yes, indeed.

>
>> ret = -ENOMEM;
>> break;
>> }
>
>> + ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
>> + PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
>> + if (ret <= 0)
>> return -EIO;
>> - dev->max_qid = nr_io_queues;
>> -
>> + dev->num_vecs = ret;
>> + dev->max_qid = (ret > 1) ? (ret - 1) : 1;
>
> I don not see how ret can possible be < 1 here.
>
> Thus, the logic looks like this:
> if ret >= 2 => return ret - 1; // Possible variants [1..ret - 1]
> if ret == 1 => return 1;
>
> So, for ret == 1 or ret == 2 we still use 1.
>
> Is it by design?
>
> Can it be written like
>
> dev->max_qid = max(ret - 1, 1);
>

Yes, it looks like more clearly.

Thanks
Jianchao

2018-03-08 07:43:49

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

On Thu, Mar 01, 2018 at 09:10:42AM -0700, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 11:03:30PM +0800, Ming Lei wrote:
> > If all CPUs for the 1st IRQ vector of admin queue are offline, then I
> > guess NVMe can't work any more.
>
> Yikes, with respect to admin commands, it appears you're right if your
> system allows offlining CPU0.
>
> > So looks it is a good idea to make admin queue's IRQ vector assigned as
> > non-managed IRQs.
>
> It'd still be considered managed even if it's a 'pre_vector', though
> it would get the default mask with all possible CPUs.

Which basically does the right thing. So I suspect we'll need to
go with a patch like this, just with a way better changelog.

2018-03-09 17:24:12

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote:
>
> So I suspect we'll need to go with a patch like this, just with a way
> better changelog.

I have to agree this is required for that use case. I'll run some
quick tests and propose an alternate changelog.

Longer term, the current way we're including offline present cpus either
(a) has the driver allocate resources it can't use or (b) spreads the
ones it can use thinner than they need to be. Why don't we rerun the
irq spread under a hot cpu notifier for only online CPUs?

2018-03-12 09:10:45

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

On Fri, Mar 09, 2018 at 10:24:45AM -0700, Keith Busch wrote:
> On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote:
> >
> > So I suspect we'll need to go with a patch like this, just with a way
> > better changelog.
>
> I have to agree this is required for that use case. I'll run some
> quick tests and propose an alternate changelog.
>
> Longer term, the current way we're including offline present cpus either
> (a) has the driver allocate resources it can't use or (b) spreads the
> ones it can use thinner than they need to be. Why don't we rerun the
> irq spread under a hot cpu notifier for only online CPUs?

4b855ad371 ("blk-mq: Create hctx for each present CPU") removes handling
mapping change via hot cpu notifier. Not only code is cleaned up, but
also fixes very complicated queue dependency issue:

- loop/dm-rq queue depends on underlying queue
- for NVMe, IO queue depends on admin queue

If freezing queue can be avoided in CPU notifier, it should be fine to
do that, otherwise it need to be avoided.

Thanks,
Ming

2018-03-12 18:59:09

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi Jianchao,

The patch tests fine on all hardware I had. I'd like to queue this up
for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
suggested and a changelog aligned with Ming's insights?

Thanks,
Keith

2018-03-13 01:49:04

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi Keith

Thanks for your precious time for testing and reviewing.
I will send out V3 next.

Sincerely
Jianchao

On 03/13/2018 02:59 AM, Keith Busch wrote:
> Hi Jianchao,
>
> The patch tests fine on all hardware I had. I'd like to queue this up
> for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
> suggested and a changelog aligned with Ming's insights?
>
> Thanks,
> Keith
>