2018-08-14 22:19:45

by Michal Wnukowski

[permalink] [raw]
Subject: [PATCH] Bugfix for handling of shadow doorbell buffer.

This patch adds full memory barrier into nvme_dbbuf_update_and_check_event
function to ensure that the shadow doorbell is written before reading
EventIdx from memory. This is a critical bugfix for initial patch that
added support for shadow doorbell into NVMe driver
(f9f38e33389c019ec880f6825119c94867c1fde0).

This memory barrier is required because “Loads may be reordered with
older stores to different locations“ (quote from Intel 64 Architecture
Memory Ordering White Paper). The following two operations can be
reordered:
- Write shadow doorbell (dbbuf_db) into memory.
- Read EventIdx (dbbuf_ei) from memory.
This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell). If that occurs, then virtual NVMe controller may
decide to wait for MMIO doorbell from guest operating system, and guest
driver may decide not to issue MMIO doorbell on any of subsequent
commands.

With memory barrier in place, the volatile keyword around *dbbuf_ei is
redundant.

This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run “ORacle IO
Numbers” (orion) that is shipped with Oracle DB:

orion -run advanced -num_large 0 -size_small 8 -type rand -simulate
concat -write 40 -duration 120 -matrix row -testname nvme_test

Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).

Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.

Fixes: f9f38e33389c (“nvme: improve performance for virtual NVMe devices”)
Signed-off-by: Michal Wnukowski <[email protected]>
---
drivers/nvme/host/pci.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..091c2441b6fa 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -292,7 +292,7 @@ static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)

/* Update dbbuf and return true if an MMIO is required */
static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
- volatile u32 *dbbuf_ei)
+ u32 *dbbuf_ei)
{
if (dbbuf_db) {
u16 old_value;
@@ -306,6 +306,12 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
old_value = *dbbuf_db;
*dbbuf_db = value;

+ /*
+ * Ensure that the doorbell is updated before reading
+ * the EventIdx from memory
+ */
+ mb();
+
if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
return false;
}
--
2.18.0.865.gffc8e1a3cd6-goog



2018-08-14 22:57:36

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] Bugfix for handling of shadow doorbell buffer.

On Tue, Aug 14, 2018 at 03:17:35PM -0700, Michal Wnukowski wrote:
> /* Update dbbuf and return true if an MMIO is required */
> static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
> - volatile u32 *dbbuf_ei)
> + u32 *dbbuf_ei)
> {
> if (dbbuf_db) {
> u16 old_value;
> @@ -306,6 +306,12 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
> old_value = *dbbuf_db;
> *dbbuf_db = value;
>
> + /*
> + * Ensure that the doorbell is updated before reading
> + * the EventIdx from memory
> + */
> + mb();
> +
> if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
> return false;
> }

You just want to ensure the '*dbbuf_db = value' isn't reordered, right?
The order dependency might be more obvious if done as:

WRITE_ONCE(*dbbuf_db, value);

if (!nvme_dbbuf_need_event(READ_ONCE(*dbbuf_ei), value, old_value))
return false;

And 'volatile' is again redundant.

2018-08-14 23:17:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Bugfix for handling of shadow doorbell buffer.

Guys, you're both wrong.

On Tue, Aug 14, 2018 at 03:17:35PM -0700, Michal Wnukowski wrote:
>
> With memory barrier in place, the volatile keyword around *dbbuf_ei is
> redundant.

No. The memory barrier enforces _ordering_, but it doesn't enforce
that the accesses are only done once. So when you do

> *dbbuf_db = value;

to write to dbbuf_db, and

> *dbbuf_ei

to read from dbbuf_ei, without the volatile the write (or the read)
could be done multiple times, which can cause serious confusion.

So the "mb()" enforces ordering, and the volatile means that the
accesses will each be done as one single access.

Two different issues entirely.

However, there's a more serious problem with your patch:

> + /*
> + * Ensure that the doorbell is updated before reading
> + * the EventIdx from memory
> + */
> + mb();

Good comment. Except what about the other side?

When you use memory ordering rules, as opposed to locking, there's
always *two* sides to any access order. There's this "write dbbuf_db"
vs "read dbbuf_ei" ordering.

But there's the other side: what about the side that writes dbbuf_ei,
and reads dbbuf_db?

I'm assuming that's the actual controller hardware, but it needs a
comment about *that* access being ordered too, because if it isn't,
then ordering this side is pointless.

On Tue, Aug 14, 2018 at 3:56 PM Keith Busch <[email protected]> wrote:
>
> You just want to ensure the '*dbbuf_db = value' isn't reordered, right?
> The order dependency might be more obvious if done as:
>
> WRITE_ONCE(*dbbuf_db, value);
>
> if (!nvme_dbbuf_need_event(READ_ONCE(*dbbuf_ei), value, old_value))
> return false;
>
> And 'volatile' is again redundant.

Yes, using READ_ONCE/WRITE_ONCE obviates the need for volatile, but it
does *not* impose a memory ordering.

It imposes an ordering on the compiler, but not on the CPU, so you
still want the "mb()" there (or the accesses need to be to uncached
memory or something, but then you should be using "readl()/writel()",
so that's not the case here).

Linus

2018-08-14 23:50:52

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] Bugfix for handling of shadow doorbell buffer.

On Tue, Aug 14, 2018 at 04:16:41PM -0700, Linus Torvalds wrote:
> On Tue, Aug 14, 2018 at 3:56 PM Keith Busch <[email protected]> wrote:
> >
> > You just want to ensure the '*dbbuf_db = value' isn't reordered, right?
> > The order dependency might be more obvious if done as:
> >
> > WRITE_ONCE(*dbbuf_db, value);
> >
> > if (!nvme_dbbuf_need_event(READ_ONCE(*dbbuf_ei), value, old_value))
> > return false;
> >
> > And 'volatile' is again redundant.
>
> Yes, using READ_ONCE/WRITE_ONCE obviates the need for volatile, but it
> does *not* impose a memory ordering.
>
> It imposes an ordering on the compiler, but not on the CPU, so you
> still want the "mb()" there

I mistakenly recalled memory-barriers.txt mentioned order was enforced
on the CPU, but that's true only for overlapping memory, which this is
not. Thanks for the correction.

2018-08-15 01:36:40

by Michal Wnukowski

[permalink] [raw]
Subject: Re: [PATCH] Bugfix for handling of shadow doorbell buffer.



On 08/14/2018 04:16 PM, Linus Torvalds wrote:
> On Tue, Aug 14, 2018 at 03:17:35PM -0700, Michal Wnukowski wrote:
>>
>> With memory barrier in place, the volatile keyword around *dbbuf_ei is
>> redundant.
>
> No. The memory barrier enforces _ordering_, but it doesn't enforce
> that the accesses are only done once. So when you do
>
>> *dbbuf_db = value;
>
> to write to dbbuf_db, and
>
>> *dbbuf_ei
>
> to read from dbbuf_ei, without the volatile the write (or the read)
> could be done multiple times, which can cause serious confusion.
>

I got confused after comaring disassembly of this code with and
without volatile keyword. Thanks for the correction.

>
> However, there's a more serious problem with your patch:
>
>> + /*
>> + * Ensure that the doorbell is updated before reading
>> + * the EventIdx from memory
>> + */
>> + mb();
>
> Good comment. Except what about the other side?
>
> When you use memory ordering rules, as opposed to locking, there's
> always *two* sides to any access order. There's this "write dbbuf_db"
> vs "read dbbuf_ei" ordering.
>
> But there's the other side: what about the side that writes dbbuf_ei,
> and reads dbbuf_db?
>
> I'm assuming that's the actual controller hardware, but it needs a
> comment about *that* access being ordered too, because if it isn't,
> then ordering this side is pointless.
>

The other side in this case is not actual controller hardware, but
virtual one (the regular hardware should rely on normal MMIO
doorbells). I spent some time going through the code of internal
hypervisor and double-checking all guarantees around memory access
before asking the same question: "what about the other side?". This
execution ordering is mentioned in NVMe spec under "Controller
Architecture", and it turned out that the NVMe driver itself had
missing memory barrier.


Thanks,
Michal

2018-08-15 02:03:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Bugfix for handling of shadow doorbell buffer.

On Tue, Aug 14, 2018 at 6:35 PM Michal Wnukowski <[email protected]> wrote:
>
> I got confused after comaring disassembly of this code with and
> without volatile keyword. Thanks for the correction.

Note that _usually_, the "volatile" has absolutely no impact. When
there is one read in the source code, it's almost always one access in
the generated code too.

That's particularly true if the read like this access do dbbuf_ei:

if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))

is only used as an argument to a real function call.

But if that function is an inline function (which it is), and the
argument ends up getting used multiple times (which in this case it is
not), then it is possible in theory that gcc ends up saying "instead
of reading the value once, and keep it in a register, I'll just read
it again for the second time".

That happens rarely, but it _can_ happen without the volatile (or the
READ_ONCE()).

The advantage of READ_ONCE() over using "volatile" in a data structure
tends to be that you explicitly mark the memory accesses that you care
about. That's nice documentation for whoever reads the code (and it's
not unheard of that the _same_ data structure is sometimes volatile,
and sometimes not - for example, the data structure might be protected
by a lock - not volatile - but people might use an optimistic lockless
access to the value too - when it ends up being volatile. So then it's
really good to explicitly use READ_ONCE() for the volatile cases where
you show that you know that you're now doing something special that
really depends on memory ordering or other magic "access exactly once"
behavior.


> > I'm assuming that's the actual controller hardware, but it needs a
> > comment about *that* access being ordered too, because if it isn't,
> > then ordering this side is pointless.
>
> The other side in this case is not actual controller hardware, but
> virtual one (the regular hardware should rely on normal MMIO
> doorbells).

Ok, Maybe worth adding a one-line note about the ordering guarantees
by the virtual controller accesses.

Linus

2018-08-15 23:45:19

by Michal Wnukowski

[permalink] [raw]
Subject: [PATCH v2] Bugfix for handling of shadow doorbell buffer.

This patch adds full memory barrier into nvme_dbbuf_update_and_check_event
function to ensure that the shadow doorbell is written before reading
EventIdx from memory. This is a critical bugfix for initial patch that
added support for shadow doorbell into NVMe driver
(f9f38e33389c019ec880f6825119c94867c1fde0).

This memory barrier is required because “Loads may be reordered with
older stores to different locations“ (quote from Intel 64 Architecture
Memory Ordering White Paper). The following two operations can be
reordered:
- Write shadow doorbell (dbbuf_db) into memory.
- Read EventIdx (dbbuf_ei) from memory.
This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell). If that occurs, then virtual NVMe controller may
decide to wait for MMIO doorbell from guest operating system, and guest
driver may decide not to issue MMIO doorbell on any of subsequent
commands.

Note that NVMe controller should have similar ordering guarantees around
writing EventIdx and reading shadow doorbell. Otherwise, analogous race
condition may occur.

This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run “ORacle IO
Numbers” (orion) that is shipped with Oracle DB:

orion -run advanced -num_large 0 -size_small 8 -type rand -simulate
concat -write 40 -duration 120 -matrix row -testname nvme_test

Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).

Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.

Fixes: f9f38e33389c (“nvme: improve performance for virtual NVMe devices”)
Signed-off-by: Michal Wnukowski <[email protected]>

changes since v1:
- Additional note about NVMe controller behavior.
- Removal of volatile keyword has been reverted.

---
drivers/nvme/host/pci.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..4452f8553301 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -306,6 +306,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
old_value = *dbbuf_db;
*dbbuf_db = value;

+ /*
+ * Ensure that the doorbell is updated before reading
+ * the EventIdx from memory. NVMe controller should have
+ * similar ordering guarantees - update EventIdx before
+ * reading doorbell.
+ */
+ mb();
+
if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
return false;
}
--
2.18.0.865.gffc8e1a3cd6-goog


2018-08-16 18:34:28

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v2] Bugfix for handling of shadow doorbell buffer.

On Wed, Aug 15, 2018 at 03:51:57PM -0700, Michal Wnukowski wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 17a0190bd88f..4452f8553301 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -306,6 +306,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
> old_value = *dbbuf_db;
> *dbbuf_db = value;
>
> + /*
> + * Ensure that the doorbell is updated before reading
> + * the EventIdx from memory. NVMe controller should have
> + * similar ordering guarantees - update EventIdx before
> + * reading doorbell.
> + */
> + mb();
> +
> if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
> return false;
> }

Looks good to me. This should also be a stable candidate.

Reviewed-by: Keith Busch <[email protected]>

2018-08-16 21:21:32

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2] Bugfix for handling of shadow doorbell buffer.


> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 17a0190bd88f..4452f8553301 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -306,6 +306,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
> old_value = *dbbuf_db;
> *dbbuf_db = value;
>
> + /*
> + * Ensure that the doorbell is updated before reading
> + * the EventIdx from memory. NVMe controller should have
> + * similar ordering guarantees - update EventIdx before
> + * reading doorbell.
> + */
> + mb();
> +
> if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
> return false;
> }

Reviewed-by: Sagi Grimberg <[email protected]>

2018-08-17 07:07:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2] Bugfix for handling of shadow doorbell buffer.

I've applied this with some major updates to the subject, changelog
and the comment in the code. Please double check it all still makes
sense.

2018-08-17 07:11:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Bugfix for handling of shadow doorbell buffer.

On Tue, Aug 14, 2018 at 06:35:16PM -0700, Michal Wnukowski wrote:
>
> The other side in this case is not actual controller hardware, but
> virtual one (the regular hardware should rely on normal MMIO
> doorbells).

There could very much be real hardware there. We've made it clear
in the spec that while a typical use case is a virtualized controller
there is nothing peventing hardware implementations. There have
been previous hardware prototypes to use very similar tricks, e.g.:

https://www.usenix.org/node/179878

2018-08-20 20:10:41

by Michal Wnukowski

[permalink] [raw]
Subject: Re: [PATCH v2] Bugfix for handling of shadow doorbell buffer.

Thanks for applying and improving the comments. All still makes sense.
I will forward this patch to [email protected] once it get merged
into Linus’ tree.

On 08/17/2018 12:07 AM, Christoph Hellwig wrote:
> I've applied this with some major updates to the subject, changelog
> and the comment in the code. Please double check it all still makes
> sense.
>