2015-05-21 18:36:41

by Parav Pandit

[permalink] [raw]
Subject: [PATCH] NVMe: Avoid interrupt disable during queue init.

Avoid diabling interrupt and holding q_lock for the queue
which is just getting initialized.

With this change, online_queues is also incremented without
lock during queue setup stage.
if Power management nvme_suspend() kicks in during queue setup time,
per nvmeq based q_lock spinlock cannot protect device wide
online_queues variable anyway.

Signed-off-by: Parav Pandit <[email protected]>
---
drivers/block/nvme-core.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 58041c7..7f09e5e 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1437,14 +1437,12 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
{
struct nvme_dev *dev = nvmeq->dev;

- spin_lock_irq(&nvmeq->q_lock);
nvmeq->sq_tail = 0;
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
dev->online_queues++;
- spin_unlock_irq(&nvmeq->q_lock);
}

static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
--
1.8.3.1


2015-05-21 18:39:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On 05/21/2015 06:12 PM, Parav Pandit wrote:
> Avoid diabling interrupt and holding q_lock for the queue
> which is just getting initialized.
>
> With this change, online_queues is also incremented without
> lock during queue setup stage.
> if Power management nvme_suspend() kicks in during queue setup time,
> per nvmeq based q_lock spinlock cannot protect device wide
> online_queues variable anyway.

Seems fairly pointless, it's not like it's a hot path...

--
Jens Axboe

2015-05-21 19:14:55

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, May 22, 2015 at 12:09 AM, Jens Axboe <[email protected]> wrote:
> On 05/21/2015 06:12 PM, Parav Pandit wrote:
>>
>> Avoid diabling interrupt and holding q_lock for the queue
>> which is just getting initialized.
>>
>> With this change, online_queues is also incremented without
>> lock during queue setup stage.
>> if Power management nvme_suspend() kicks in during queue setup time,
>> per nvmeq based q_lock spinlock cannot protect device wide
>> online_queues variable anyway.
>
>
> Seems fairly pointless, it's not like it's a hot path...
>
I didn't follow your comments.
Do you mean we should still hold the lock, even if its not needed?

I meant to say in above patch/comment that holding q_lock is not
necessary in this path. So I removed it.

> --
> Jens Axboe
>

2015-05-21 19:34:32

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Thu, 21 May 2015, Parav Pandit wrote:
> Avoid diabling interrupt and holding q_lock for the queue
> which is just getting initialized.
>
> With this change, online_queues is also incremented without
> lock during queue setup stage.
> if Power management nvme_suspend() kicks in during queue setup time,
> per nvmeq based q_lock spinlock cannot protect device wide
> online_queues variable anyway.

The q_lock is held to protect polling from reading inconsistent data.

2015-05-22 04:15:18

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, May 22, 2015 at 1:04 AM, Keith Busch <[email protected]> wrote:
> On Thu, 21 May 2015, Parav Pandit wrote:
>>
>> Avoid diabling interrupt and holding q_lock for the queue
>> which is just getting initialized.
>>
>> With this change, online_queues is also incremented without
>> lock during queue setup stage.
>> if Power management nvme_suspend() kicks in during queue setup time,
>> per nvmeq based q_lock spinlock cannot protect device wide
>> online_queues variable anyway.
>
>
> The q_lock is held to protect polling from reading inconsistent data.

ah, yes. I can see the nvme_kthread can poll the CQ while its getting
created through the nvme_resume().
I think this opens up other issue.

nvme_kthread() should,

Instead of,
struct nvme_queue *nvmeq = dev->queues[i];

it should do,
struct nvme_queue *nvmeq = rcu_dereference(dev->queues[i]);

And,
nvme_alloc_queue()
dev->queues[qid] = nvmeq;

should be,
rcu_assign_pointer(dev->queues[qid], nvmeq);

Otherwise nvme_kthread could get stale value for elements of nvmeq.
I will send patch for fix.

2015-05-22 14:48:16

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Thu, 21 May 2015, Parav Pandit wrote:
> On Fri, May 22, 2015 at 1:04 AM, Keith Busch <[email protected]> wrote:
>> The q_lock is held to protect polling from reading inconsistent data.
>
> ah, yes. I can see the nvme_kthread can poll the CQ while its getting
> created through the nvme_resume().
> I think this opens up other issue.
>
> nvme_kthread() should,
>
> Instead of,
> struct nvme_queue *nvmeq = dev->queues[i];
>
> it should do,
> struct nvme_queue *nvmeq = rcu_dereference(dev->queues[i]);

The rcu protection on nvme queues was removed with the blk-mq conversion
as we rely on that layer for h/w access.

2015-05-22 14:51:32

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, May 22, 2015 at 8:18 PM, Keith Busch <[email protected]> wrote:
> On Thu, 21 May 2015, Parav Pandit wrote:
>>
>> On Fri, May 22, 2015 at 1:04 AM, Keith Busch <[email protected]>
>> wrote:
>>>
>>> The q_lock is held to protect polling from reading inconsistent data.
>>
>>
>> ah, yes. I can see the nvme_kthread can poll the CQ while its getting
>> created through the nvme_resume().
>> I think this opens up other issue.
>>
>> nvme_kthread() should,
>>
>> Instead of,
>> struct nvme_queue *nvmeq = dev->queues[i];
>>
>> it should do,
>> struct nvme_queue *nvmeq = rcu_dereference(dev->queues[i]);
>
>
> The rcu protection on nvme queues was removed with the blk-mq conversion
> as we rely on that layer for h/w access.

o.k. But above is at level where data I/Os are not even active. Its
between nvme_kthread and nvme_resume() from power management
subsystem.
I must be missing something.

2015-05-22 15:11:47

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, 22 May 2015, Parav Pandit wrote:
> On Fri, May 22, 2015 at 8:18 PM, Keith Busch <[email protected]> wrote:
>> The rcu protection on nvme queues was removed with the blk-mq conversion
>> as we rely on that layer for h/w access.
>
> o.k. But above is at level where data I/Os are not even active. Its
> between nvme_kthread and nvme_resume() from power management
> subsystem.
> I must be missing something.

On resume, everything is already reaped from the queues, so there should
be no harm letting the kthread poll an inactive queue. The proposal to
remove the q_lock during queue init makes it possible for the thread to
see the wrong cq phase bit and mess up the completion queue's head from
reaping non-existent entries.

But beyond nvme_resume, it appears a race condition is possible on any
scenario when a device is reinitialized if it cannot create the same
number of IO queues as it had in originally. Part of the problem is there
doesn't seem to be a way to change a tagset's nr_hw_queues after it was
created. The conditions that leads to this scenario should be uncommon,
so I haven't given it much thought; I need to untangle dynamic namespaces
first. :)

2015-05-22 16:03:30

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, May 22, 2015 at 8:41 PM, Keith Busch <[email protected]> wrote:
> On Fri, 22 May 2015, Parav Pandit wrote:
>>
>> On Fri, May 22, 2015 at 8:18 PM, Keith Busch <[email protected]>
>> wrote:
>>>
>>> The rcu protection on nvme queues was removed with the blk-mq conversion
>>> as we rely on that layer for h/w access.
>>
>>
>> o.k. But above is at level where data I/Os are not even active. Its
>> between nvme_kthread and nvme_resume() from power management
>> subsystem.
>> I must be missing something.
>
>
> On resume, everything is already reaped from the queues, so there should
> be no harm letting the kthread poll an inactive queue. The proposal to
> remove the q_lock during queue init makes it possible for the thread to
> see the wrong cq phase bit and mess up the completion queue's head from
> reaping non-existent entries.
>
I agree. q_lock is necessary.

> But beyond nvme_resume, it appears a race condition is possible on any
> scenario when a device is reinitialized if it cannot create the same
> number of IO queues as it had in originally. Part of the problem is there
> doesn't seem to be a way to change a tagset's nr_hw_queues after it was
> created. The conditions that leads to this scenario should be uncommon,
> so I haven't given it much thought; I need to untangle dynamic namespaces
> first. :)

It probably doesn't matter with number of nr_hw_queues.
o.k. so we agree on the existence of race condition.

Let me point to another more generic race condition, which I believe
my rcu changes can fix it.

During normal positive path probe,
(a) device is added to dev_list in nvme_dev_start()
(b) nvme_kthread got created, which will eventually refers to
dev->queues[qid] to check for NULL.
(c) dev_start() worker thread has started probing device and creating
the queue using nvme_alloc_queue
This is is assigning the dev->queue[qid] new pointer.
If this is done out of order, nvme_kthread will pickup uninitialized
q_lock, cq_phase, q_db.

So to protect this also I need rcu().
This rc fix I am proposing addresses both the problem, one I described
previously, and one now. (Both are same race conditions occurring in
normal probe process and resume path as well).
More long term fix would be to have device flag for the state to
maintain. I haven't thought fully for this one yet.
Other thoughts to not create nvme_kthread until all the queues are active.
I will differ this design change discussion for now and keep focus on
fixing this primary race condition using RCU in next patch.

2015-05-22 16:23:33

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, 22 May 2015, Parav Pandit wrote:
> During normal positive path probe,
> (a) device is added to dev_list in nvme_dev_start()
> (b) nvme_kthread got created, which will eventually refers to
> dev->queues[qid] to check for NULL.
> (c) dev_start() worker thread has started probing device and creating
> the queue using nvme_alloc_queue
> This is is assigning the dev->queue[qid] new pointer.
> If this is done out of order, nvme_kthread will pickup uninitialized
> q_lock, cq_phase, q_db.

A memory barrier before incrementing the dev->queue_count (and assigning
the pointer in the array before that) should address this concern.

> Other thoughts to not create nvme_kthread until all the queues are active.

No good, we want to poll during queue creation to detect controller
errors and broken interrupts.

2015-05-22 16:48:47

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, May 22, 2015 at 9:53 PM, Keith Busch <[email protected]> wrote:
> On Fri, 22 May 2015, Parav Pandit wrote:
>>
>> During normal positive path probe,
>> (a) device is added to dev_list in nvme_dev_start()
>> (b) nvme_kthread got created, which will eventually refers to
>> dev->queues[qid] to check for NULL.
>> (c) dev_start() worker thread has started probing device and creating
>> the queue using nvme_alloc_queue
>> This is is assigning the dev->queue[qid] new pointer.
>> If this is done out of order, nvme_kthread will pickup uninitialized
>> q_lock, cq_phase, q_db.
>
>
> A memory barrier before incrementing the dev->queue_count (and assigning
> the pointer in the array before that) should address this concern.
>

Sure. mb() will solve the publisher side problem. RCU is wrapper around mb().
However mb() doesn't solve the issue of q_lock variable getting
fetched before if (!nvmeq) condition being executed, by value
compilation optimizations in nvme_kthread().
So I was inclined towards more preferred method of rcu.


>> Other thoughts to not create nvme_kthread until all the queues are active.
>
>
> No good, we want to poll during queue creation to detect controller
> errors and broken interrupts.

2015-05-22 17:07:29

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, 22 May 2015, Parav Pandit wrote:
> On Fri, May 22, 2015 at 9:53 PM, Keith Busch <[email protected]> wrote:
>> A memory barrier before incrementing the dev->queue_count (and assigning
>> the pointer in the array before that) should address this concern.
>
> Sure. mb() will solve the publisher side problem. RCU is wrapper around mb().
> However mb() doesn't solve the issue of q_lock variable getting
> fetched before if (!nvmeq) condition being executed, by value
> compilation optimizations in nvme_kthread().

Eh? The value of dev->queue_count prevents the thread's for-loop from
iterating that nvmeq before the q_lock is initialized.

2015-05-22 17:34:00

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, May 22, 2015 at 10:37 PM, Keith Busch <[email protected]> wrote:
> On Fri, 22 May 2015, Parav Pandit wrote:
>>
>> On Fri, May 22, 2015 at 9:53 PM, Keith Busch <[email protected]>
>> wrote:
>>>
>>> A memory barrier before incrementing the dev->queue_count (and assigning
>>> the pointer in the array before that) should address this concern.
>>
>>
>> Sure. mb() will solve the publisher side problem. RCU is wrapper around
>> mb().
>> However mb() doesn't solve the issue of q_lock variable getting
>> fetched before if (!nvmeq) condition being executed, by value
>> compilation optimizations in nvme_kthread().
>
>
> Eh? The value of dev->queue_count prevents the thread's for-loop from
> iterating that nvmeq before the q_lock is initialized.

I agree to it that nvmeq won't be null after mb(); That alone is not sufficient.

What I have proposed in previous email is,

Converting,

struct nvme_queue *nvmeq = dev->queues[i];
if (!nvmeq)
continue;
spin_lock_irq(nvmeq->q_lock);

to replace with,

struct nvme_queue *nvmeq = rcu_dereference(dev->queues[i]);
if (!nvmeq)
continue;
spin_lock_irq(nvmeq->q_lock);

This will prevent fetching content of q_lock before checking for NULL
condition. Classic usage or RCU.

2015-05-22 17:47:57

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, 22 May 2015, Parav Pandit wrote:
> I agree to it that nvmeq won't be null after mb(); That alone is not sufficient.
>
> What I have proposed in previous email is,
>
> Converting,
>
> struct nvme_queue *nvmeq = dev->queues[i];
> if (!nvmeq)
> continue;
> spin_lock_irq(nvmeq->q_lock);
>
> to replace with,
>
> struct nvme_queue *nvmeq = rcu_dereference(dev->queues[i]);
> if (!nvmeq)
> continue;
> spin_lock_irq(nvmeq->q_lock);
>
> This will prevent fetching content of q_lock before checking for NULL
> condition. Classic usage or RCU.

What the heck are you talking about? The value of dev->queue_count won't
even let the thread iterate an nvmeq before q_lock is initialized.

We used to rcu protect queue access, but that was to make nvme's
make_request_fn safe to surprise removal, not for the polling thread.

2015-05-22 18:18:43

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: Avoid interrupt disable during queue init.

On Fri, May 22, 2015 at 11:17 PM, Keith Busch <[email protected]> wrote:
> On Fri, 22 May 2015, Parav Pandit wrote:
>>
>> I agree to it that nvmeq won't be null after mb(); That alone is not
>> sufficient.
>>
>> What I have proposed in previous email is,
>>
>> Converting,
>>
>> struct nvme_queue *nvmeq = dev->queues[i];
>> if (!nvmeq)
>> continue;
>> spin_lock_irq(nvmeq->q_lock);
>>
>> to replace with,
>>
>> struct nvme_queue *nvmeq = rcu_dereference(dev->queues[i]);
>> if (!nvmeq)
>> continue;
>> spin_lock_irq(nvmeq->q_lock);
>>
>> This will prevent fetching content of q_lock before checking for NULL
>> condition. Classic usage or RCU.
>
>
> What the heck are you talking about? The value of dev->queue_count won't
> even let the thread iterate an nvmeq before q_lock is initialized.
>
Oh, right. My bad. I was getting confused with the possible reordering
in reset path within,
nvme_free_queues()
1320 dev->queue_count--;
1321 dev->queues[i] = NULL;

while nvme_kthread is working on it.
But thats unlikely case because device is removed from the list before.

Yes, so mb() is good enough.
I will send you the patch.
Thanks.


> We used to rcu protect queue access, but that was to make nvme's
> make_request_fn safe to surprise removal, not for the polling thread.