LinuxLists.cc - [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-02-27 08:56:31

Subject: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

Currently, adminq and ioq0 share the same irq vector. This is
unfair for both amdinq and ioq0.
- For adminq, its completion irq has to be bound on cpu0.
- For ioq0, when the irq fires for io completion, the adminq irq
action has to be checked also.

To improve this, allocate separate irq vectors for adminq and
ioq0, and not set irq affinity for adminq one.

Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/nvme/host/pci.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2..7f421b7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1456,7 +1456,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
nvmeq->sq_cmds_io = dev->cmb + offset;
}

- nvmeq->cq_vector = qid - 1;
+ nvmeq->cq_vector = qid;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
return result;
@@ -1909,6 +1909,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int result, nr_io_queues;
unsigned long size;
+ struct irq_affinity affd = {.pre_vectors = 1};
+ int ret;

nr_io_queues = num_present_cpus();
result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1945,11 +1947,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
* setting up the full range we need.
*/
pci_free_irq_vectors(pdev);
- nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
- PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
- if (nr_io_queues <= 0)
+ ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
+ PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
+ if (ret <= 0)
return -EIO;
- dev->max_qid = nr_io_queues;
+ dev->max_qid = ret - 1;

/*
* Should investigate if there's a performance win from allocating
--
2.7.4

2018-02-27 16:38:12

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
> Currently, adminq and ioq0 share the same irq vector. This is
> unfair for both amdinq and ioq0.
> - For adminq, its completion irq has to be bound on cpu0.
> - For ioq0, when the irq fires for io completion, the adminq irq
> action has to be checked also.

This change log could use some improvements. Why is it bad if admin
interrupts affinity is with cpu0?

Are you able to measure _any_ performance difference on IO queue 1 vs IO
queue 2 that you can attribute to IO queue 1's sharing vector 0?

> @@ -1945,11 +1947,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
> * setting up the full range we need.
> */
> pci_free_irq_vectors(pdev);
> - nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
> - PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
> - if (nr_io_queues <= 0)
> + ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
> + PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
> + if (ret <= 0)
> return -EIO;
> - dev->max_qid = nr_io_queues;
> + dev->max_qid = ret - 1;

So controllers that have only legacy or single-message MSI don't get any
IO queues?

2018-02-28 02:55:38

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi Keith

Thanks for your precious time to review this.

On 02/27/2018 11:13 PM, Keith Busch wrote:
> On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
>> Currently, adminq and ioq0 share the same irq vector. This is
>> unfair for both amdinq and ioq0.
>> - For adminq, its completion irq has to be bound on cpu0.
>> - For ioq0, when the irq fires for io completion, the adminq irq
>> action has to be checked also.
>
> This change log could use some improvements. Why is it bad if admin
> interrupts affinity is with cpu0?

adminq interrupts should be able to fire everywhere.
do we have any reason to bound it on cpu0 ?

>
> Are you able to measure _any_ performance difference on IO queue 1 vs IO
> queue 2 that you can attribute to IO queue 1's sharing vector 0?

Actually, I didn't get any performance improving on my own NVMe card.
But it may be needed on some enterprise card, especially the media is persist memory.
nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
accessing on cq entry.

>
>> @@ -1945,11 +1947,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>> * setting up the full range we need.
>> */
>> pci_free_irq_vectors(pdev);
>> - nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
>> - PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
>> - if (nr_io_queues <= 0)
>> + ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
>> + PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
>> + if (ret <= 0)
>> return -EIO;
>> - dev->max_qid = nr_io_queues;
>> + dev->max_qid = ret - 1;
>
> So controllers that have only legacy or single-message MSI don't get any
> IO queues?
>

Yes. At the moment, we have to share the only one irq vector.

Thanks for your directive. :)
Jianchao

2018-02-28 15:28:35

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

On Wed, Feb 28, 2018 at 10:53:31AM +0800, jianchao.wang wrote:
> On 02/27/2018 11:13 PM, Keith Busch wrote:
> > On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
> >> Currently, adminq and ioq0 share the same irq vector. This is
> >> unfair for both amdinq and ioq0.
> >> - For adminq, its completion irq has to be bound on cpu0.
> >> - For ioq0, when the irq fires for io completion, the adminq irq
> >> action has to be checked also.
> >
> > This change log could use some improvements. Why is it bad if admin
> > interrupts affinity is with cpu0?
>
> adminq interrupts should be able to fire everywhere.
> do we have any reason to bound it on cpu0 ?

Your patch will have the admin vector CPU affinity mask set to
0xff..ff. The first set bit for an online CPU is the one the IRQ handler
will run on, so the admin queue will still only run on CPU 0.

> > Are you able to measure _any_ performance difference on IO queue 1 vs IO
> > queue 2 that you can attribute to IO queue 1's sharing vector 0?
>
> Actually, I didn't get any performance improving on my own NVMe card.
> But it may be needed on some enterprise card, especially the media is persist memory.
> nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
> accessing on cq entry.

A CPU reading its own memory isn't a DMA. It's just a cheap memory read.

2018-02-28 15:45:38

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

Hi Keith

Thanks for your kindly response and directive

On 02/28/2018 11:27 PM, Keith Busch wrote:
> On Wed, Feb 28, 2018 at 10:53:31AM +0800, jianchao.wang wrote:
>> On 02/27/2018 11:13 PM, Keith Busch wrote:
>>> On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
>>>> Currently, adminq and ioq0 share the same irq vector. This is
>>>> unfair for both amdinq and ioq0.
>>>> - For adminq, its completion irq has to be bound on cpu0.
>>>> - For ioq0, when the irq fires for io completion, the adminq irq
>>>> action has to be checked also.
>>>
>>> This change log could use some improvements. Why is it bad if admin
>>> interrupts affinity is with cpu0?
>>
>> adminq interrupts should be able to fire everywhere.
>> do we have any reason to bound it on cpu0 ?
>
> Your patch will have the admin vector CPU affinity mask set to
> 0xff..ff. The first set bit for an online CPU is the one the IRQ handler
> will run on, so the admin queue will still only run on CPU 0.

hmmm...yes.
When I test there is only one irq vector, I get following result:
124: 0 0 253541 0 0 0 0 0 IR-PCI-MSI 1048576-edge nvme0q0, nvme0q1

>
>>> Are you able to measure _any_ performance difference on IO queue 1 vs IO
>>> queue 2 that you can attribute to IO queue 1's sharing vector 0?
>>
>> Actually, I didn't get any performance improving on my own NVMe card.
>> But it may be needed on some enterprise card, especially the media is persist memory.
>> nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
>> accessing on cq entry.
>
> A CPU reading its own memory isn't a DMA. It's just a cheap memory read.

Oh sorry, my bad, I mean it is operation on DMA address, it is uncached.
nvme_irq
-> nvme_process_cq
-> nvme_read_cqe
-> nvme_cqe_valid

static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head,
u16 phase)
{
return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
}

Sincerely
Jianchao

2018-02-28 15:55:39

by jianchao.wang

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

On 02/28/2018 11:42 PM, jianchao.wang wrote:
> Hi Keith
>
> Thanks for your kindly response and directive
>
> On 02/28/2018 11:27 PM, Keith Busch wrote:
>> On Wed, Feb 28, 2018 at 10:53:31AM +0800, jianchao.wang wrote:
>>> On 02/27/2018 11:13 PM, Keith Busch wrote:
>>>> On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
>>>>> Currently, adminq and ioq0 share the same irq vector. This is
>>>>> unfair for both amdinq and ioq0.
>>>>> - For adminq, its completion irq has to be bound on cpu0.
>>>>> - For ioq0, when the irq fires for io completion, the adminq irq
>>>>> action has to be checked also.
>>>>
>>>> This change log could use some improvements. Why is it bad if admin
>>>> interrupts affinity is with cpu0?
>>>
>>> adminq interrupts should be able to fire everywhere.
>>> do we have any reason to bound it on cpu0 ?
>>
>> Your patch will have the admin vector CPU affinity mask set to
>> 0xff..ff. The first set bit for an online CPU is the one the IRQ handler
>> will run on, so the admin queue will still only run on CPU 0.
>
> hmmm...yes.
> When I test there is only one irq vector, I get following result:
> 124: 0 0 253541 0 0 0 0 0 IR-PCI-MSI 1048576-edge nvme0q0, nvme0q1
>

the irqbalance may migrate the adminq irq away from cpu0.

>>
>>>> Are you able to measure _any_ performance difference on IO queue 1 vs IO
>>>> queue 2 that you can attribute to IO queue 1's sharing vector 0?
>>>
>>> Actually, I didn't get any performance improving on my own NVMe card.
>>> But it may be needed on some enterprise card, especially the media is persist memory.
>>> nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
>>> accessing on cq entry.
>>
>> A CPU reading its own memory isn't a DMA. It's just a cheap memory read.
>
> Oh sorry, my bad, I mean it is operation on DMA address, it is uncached.
> nvme_irq
> -> nvme_process_cq
> -> nvme_read_cqe
> -> nvme_cqe_valid
>
> static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head,
> u16 phase)
> {
> return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
> }
>
> Sincerely
> Jianchao
>

2018-02-28 15:56:15

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

On Wed, Feb 28, 2018 at 11:46:20PM +0800, jianchao.wang wrote:
>
> the irqbalance may migrate the adminq irq away from cpu0.

No, irqbalance can't touch managed IRQs. See irq_can_set_affinity_usr().