2017-07-17 22:36:33

by Sinan Kaya

[permalink] [raw]
Subject: [PATCH] nvme: Acknowledge completion queue on each iteration

Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.

This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.

Signed-off-by: Sinan Kaya <[email protected]>
---
drivers/nvme/host/pci.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d10d2f2..33d9b5b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)

while (nvme_read_cqe(nvmeq, &cqe)) {
nvme_handle_cqe(nvmeq, &cqe);
+ nvme_ring_cq_doorbell(nvmeq);
consumed++;
}

- if (consumed) {
- nvme_ring_cq_doorbell(nvmeq);
+ if (consumed)
nvmeq->cqe_seen = 1;
- }
}

static irqreturn_t nvme_irq(int irq, void *data)
--
1.9.1


2017-07-17 22:39:52

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
>
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.

That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.

2017-07-17 22:46:15

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

Hi Keith,

On 7/17/2017 6:45 PM, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
>
> That doesn't make sense. Aggregating doorbell writes should be much more
> efficient for high depth workloads.
>

Problem is that code is throttling the HW as HW cannot queue more completions until
SW get a chance to clear it.

As an example:

for each in N
(
blk_layer()
)
ring door bell

HW cannot queue new job until N x blk_layer operations are processed and queue
element ownership is passed to the HW after the loop. HW is just sitting idle
there if no queue entries are available.

Sinan

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-07-17 22:50:17

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

On Mon, Jul 17, 2017 at 06:46:11PM -0400, Sinan Kaya wrote:
> Hi Keith,
>
> On 7/17/2017 6:45 PM, Keith Busch wrote:
> > On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> >> Code is moving the completion queue doorbell after processing all completed
> >> events and sending callbacks to the block layer on each iteration.
> >>
> >> This is causing a performance drop when a lot of jobs are queued towards
> >> the HW. Move the completion queue doorbell on each loop instead and allow new
> >> jobs to be queued by the HW.
> >
> > That doesn't make sense. Aggregating doorbell writes should be much more
> > efficient for high depth workloads.
> >
>
> Problem is that code is throttling the HW as HW cannot queue more completions until
> SW get a chance to clear it.
>
> As an example:
>
> for each in N
> (
> blk_layer()
> )
> ring door bell
>
> HW cannot queue new job until N x blk_layer operations are processed and queue
> element ownership is passed to the HW after the loop. HW is just sitting idle
> there if no queue entries are available.

If no completion queue entries are available, then there can't possibly
be any submission queue entries for the HW to work on either.

2017-07-17 23:07:03

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

On 2017-07-17 18:56, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 06:46:11PM -0400, Sinan Kaya wrote:
>> Hi Keith,
>>
>> On 7/17/2017 6:45 PM, Keith Busch wrote:
>> > On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
>> >> Code is moving the completion queue doorbell after processing all completed
>> >> events and sending callbacks to the block layer on each iteration.
>> >>
>> >> This is causing a performance drop when a lot of jobs are queued towards
>> >> the HW. Move the completion queue doorbell on each loop instead and allow new
>> >> jobs to be queued by the HW.
>> >
>> > That doesn't make sense. Aggregating doorbell writes should be much more
>> > efficient for high depth workloads.
>> >
>>
>> Problem is that code is throttling the HW as HW cannot queue more
>> completions until
>> SW get a chance to clear it.
>>
>> As an example:
>>
>> for each in N
>> (
>> blk_layer()
>> )
>> ring door bell
>>
>> HW cannot queue new job until N x blk_layer operations are processed
>> and queue
>> element ownership is passed to the HW after the loop. HW is just
>> sitting idle
>> there if no queue entries are available.
>
> If no completion queue entries are available, then there can't possibly
> be any submission queue entries for the HW to work on either.

Maybe, I need to understand the design better. I was curious why
completion and submission queues were protected by a single lock causing
lock contention.

I was treating each queue independently. I have seen slightly better
performance by an early doorbell. That was my explanation.

2017-07-18 14:30:18

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

On Mon, Jul 17, 2017 at 07:07:00PM -0400, [email protected] wrote:
> Maybe, I need to understand the design better. I was curious why completion
> and submission queues were protected by a single lock causing lock
> contention.

Ideally the queues are tied to CPUs, so you couldn't have one thread
submitting to a particular queue-pair while another thread is reaping
completions from it. Such a setup wouldn't get lock contention.

Some machines have so many CPUs, though, that sharing hardware queues
is required. We've experimented with separate submission and completion
locks for such cases, but I've never seen an improved performance as a
result.

2017-07-18 18:52:31

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

On 7/18/2017 10:36 AM, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 07:07:00PM -0400, [email protected] wrote:
>> Maybe, I need to understand the design better. I was curious why completion
>> and submission queues were protected by a single lock causing lock
>> contention.
> Ideally the queues are tied to CPUs, so you couldn't have one thread
> submitting to a particular queue-pair while another thread is reaping
> completions from it. Such a setup wouldn't get lock contention.

I do see that the NVMe driver is creating a completion interrupt on
each CPU core for the completions. No problems with that.

However, I don't think you can guarantee that there will always be a single
CPU core targeting one submission queue especially with asynchronous IO.

Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
in my FIO tests.

Did I miss something?

>
> Some machines have so many CPUs, though, that sharing hardware queues
> is required. We've experimented with separate submission and completion
> locks for such cases, but I've never seen an improved performance as a
> result.
>

I have also experimented with multiple locks with no significant gains.
However, I was curious if somebody else had a better implementation than mine.

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2017-07-18 21:20:35

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

On Tue, Jul 18, 2017 at 02:52:26PM -0400, Sinan Kaya wrote:
> On 7/18/2017 10:36 AM, Keith Busch wrote:
>
> I do see that the NVMe driver is creating a completion interrupt on
> each CPU core for the completions. No problems with that.
>
> However, I don't think you can guarantee that there will always be a single
> CPU core targeting one submission queue especially with asynchronous IO.
>
> Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
> in my FIO tests.
>
> Did I miss something?

I think that must mean your machine has many more CPUs than your nvme
controller has IO queues.

2017-07-19 09:21:04

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
>
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.
>
> Signed-off-by: Sinan Kaya <[email protected]>
> ---
> drivers/nvme/host/pci.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d10d2f2..33d9b5b 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>
> while (nvme_read_cqe(nvmeq, &cqe)) {
> nvme_handle_cqe(nvmeq, &cqe);
> + nvme_ring_cq_doorbell(nvmeq);
> consumed++;
> }
>
> - if (consumed) {
> - nvme_ring_cq_doorbell(nvmeq);
> + if (consumed)
> nvmeq->cqe_seen = 1;
> - }
> }

Agree with Keith that this is definitely not the way to go, it
adds mmio operations in the hot path with very little gain (if
at all).

2017-07-19 10:37:30

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH] nvme: Acknowledge completion queue on each iteration

On 7/19/2017 5:20 AM, Sagi Grimberg wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
>>
>> Signed-off-by: Sinan Kaya <[email protected]>
>> ---
>> drivers/nvme/host/pci.c | 5 ++---
>> 1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index d10d2f2..33d9b5b 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>> while (nvme_read_cqe(nvmeq, &cqe)) {
>> nvme_handle_cqe(nvmeq, &cqe);
>> + nvme_ring_cq_doorbell(nvmeq);
>> consumed++;
>> }
>> - if (consumed) {
>> - nvme_ring_cq_doorbell(nvmeq);
>> + if (consumed)
>> nvmeq->cqe_seen = 1;
>> - }
>> }
>
> Agree with Keith that this is definitely not the way to go, it
> adds mmio operations in the hot path with very little gain (if
> at all).
>

Understood, different architectures might have different latency accessing the HW
registers. It might be expansive on some platform like you indicated and this change
would make it worse.

I'm doing a self NACK as well.

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.