Recent commit 52fde2c07da6 ("nvme: set dma alignment to dword") has caused
a regression on our platform. It turned out that the nvme_get_log() method
invocation caused the nvme_hwmon_data structure instance corruption. In
particular the nvme_hwmon_data.ctrl pointer was overwritten either with
zeros or with garbage. After some researches we discovered that the
problem happened even before the actual NVME DMA execution, but during the
buffer mapping. Since our platform was DMA-noncoherent the mapping implied
the cache-lines invalidations or write-backs depending on the
DMA-direction parameter. In case of the NVME SMART log getting the DMA
was performed from-device-to-memory, thus the cache-invalidation was
activated during the buffer mapping. Since the log-buffer wasn't
cache-line aligned the cache-invalidation caused the neighbour data
discard. The neighbouring data turned to be the data surrounding the
buffer in the framework of the nvme_hwmon_data structure.
In order to fix that we need to make sure that the whole log-buffer is
defined within the cache-line-aligned memory region so the
cache-invalidation procedure wouldn't involve the adjacent data. By doing
so we not only get rid from the denoted problem but also fulfill the
requirement explicitly described in [1].
After a deeper researches we found out that the denoted commit wasn't a
root cause of the problem. It just revealed the invalidity by activating
the DMA-based NVME SMART log getting performed in the framework of the
NVME hwmon driver. The problem was here since the initial commit of the
driver.
[1] Documentation/core-api/dma-api.rst
Fixes: 400b6a7b13a3 ("nvme: Add hardware monitoring support")
Signed-off-by: Serge Semin <[email protected]>
---
Folks, I've thoroughly studied the whole NVME subsystem looking for
similar problems. Turned out there is one more place which may cause the
same issue. It's connected with the opal_dev.{cmd,req} buffers passed to
the nvme_sec_submit() method. The rest of the buffers involved in the NVME
DMA are either allocated by kmalloc (must be cache-line-aligned by design)
or bounced-buffered if allocated on the stack (see the blk_rq_map_kern()
method implementation).
---
drivers/nvme/host/hwmon.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c
index 0a586d712920..94192ab7a02d 100644
--- a/drivers/nvme/host/hwmon.c
+++ b/drivers/nvme/host/hwmon.c
@@ -10,9 +10,10 @@
#include "nvme.h"
+/* DMA-noncoherent platforms require the cache-aligned buffers */
struct nvme_hwmon_data {
+ struct nvme_smart_log log ____cacheline_aligned;
struct nvme_ctrl *ctrl;
- struct nvme_smart_log log;
struct mutex read_lock;
};
--
2.37.2
On Fri, Sep 09, 2022 at 10:19:15PM +0300, Serge Semin wrote:
> Recent commit 52fde2c07da6 ("nvme: set dma alignment to dword") has caused
>
> Folks, I've thoroughly studied the whole NVME subsystem looking for
> similar problems. Turned out there is one more place which may cause the
> same issue. It's connected with the opal_dev.{cmd,req} buffers passed to
> the nvme_sec_submit() method. The rest of the buffers involved in the NVME
> DMA are either allocated by kmalloc (must be cache-line-aligned by design)
> or bounced-buffered if allocated on the stack (see the blk_rq_map_kern()
> method implementation).
What about user space addresses? We can map those with cacheline offsets.
> ---
> drivers/nvme/host/hwmon.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c
> index 0a586d712920..94192ab7a02d 100644
> --- a/drivers/nvme/host/hwmon.c
> +++ b/drivers/nvme/host/hwmon.c
> @@ -10,9 +10,10 @@
>
> #include "nvme.h"
>
> +/* DMA-noncoherent platforms require the cache-aligned buffers */
> struct nvme_hwmon_data {
> + struct nvme_smart_log log ____cacheline_aligned;
> struct nvme_ctrl *ctrl;
> - struct nvme_smart_log log;
So this by chance happened to work before 52fde2c07da6 because the field
started at a 4-byte offset on your arch?
The change looks good.
Reviewed-by: Keith Busch <[email protected]>
On 9/9/22 12:19, Serge Semin wrote:
> Recent commit 52fde2c07da6 ("nvme: set dma alignment to dword") has caused
> a regression on our platform. It turned out that the nvme_get_log() method
> invocation caused the nvme_hwmon_data structure instance corruption. In
> particular the nvme_hwmon_data.ctrl pointer was overwritten either with
> zeros or with garbage. After some researches we discovered that the
> problem happened even before the actual NVME DMA execution, but during the
> buffer mapping. Since our platform was DMA-noncoherent the mapping implied
> the cache-lines invalidations or write-backs depending on the
> DMA-direction parameter. In case of the NVME SMART log getting the DMA
> was performed from-device-to-memory, thus the cache-invalidation was
> activated during the buffer mapping. Since the log-buffer wasn't
> cache-line aligned the cache-invalidation caused the neighbour data
> discard. The neighbouring data turned to be the data surrounding the
> buffer in the framework of the nvme_hwmon_data structure.
>
> In order to fix that we need to make sure that the whole log-buffer is
> defined within the cache-line-aligned memory region so the
> cache-invalidation procedure wouldn't involve the adjacent data. By doing
> so we not only get rid from the denoted problem but also fulfill the
> requirement explicitly described in [1].
>
> After a deeper researches we found out that the denoted commit wasn't a
> root cause of the problem. It just revealed the invalidity by activating
> the DMA-based NVME SMART log getting performed in the framework of the
> NVME hwmon driver. The problem was here since the initial commit of the
> driver.
>
> [1] Documentation/core-api/dma-api.rst
>
> Fixes: 400b6a7b13a3 ("nvme: Add hardware monitoring support")
> Signed-off-by: Serge Semin <[email protected]>
Thanks for tracking this down and for the fix.
Reviewed-by: Guenter Roeck <[email protected]>
Guenter
>
> ---
>
> Folks, I've thoroughly studied the whole NVME subsystem looking for
> similar problems. Turned out there is one more place which may cause the
> same issue. It's connected with the opal_dev.{cmd,req} buffers passed to
> the nvme_sec_submit() method. The rest of the buffers involved in the NVME
> DMA are either allocated by kmalloc (must be cache-line-aligned by design)
> or bounced-buffered if allocated on the stack (see the blk_rq_map_kern()
> method implementation).
> ---
> drivers/nvme/host/hwmon.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c
> index 0a586d712920..94192ab7a02d 100644
> --- a/drivers/nvme/host/hwmon.c
> +++ b/drivers/nvme/host/hwmon.c
> @@ -10,9 +10,10 @@
>
> #include "nvme.h"
>
> +/* DMA-noncoherent platforms require the cache-aligned buffers */
> struct nvme_hwmon_data {
> + struct nvme_smart_log log ____cacheline_aligned;
> struct nvme_ctrl *ctrl;
> - struct nvme_smart_log log;
> struct mutex read_lock;
> };
>
On Fri, Sep 09, 2022 at 01:42:34PM -0600, Keith Busch wrote:
> On Fri, Sep 09, 2022 at 10:19:15PM +0300, Serge Semin wrote:
> > Recent commit 52fde2c07da6 ("nvme: set dma alignment to dword") has caused
> >
> > Folks, I've thoroughly studied the whole NVME subsystem looking for
> > similar problems. Turned out there is one more place which may cause the
> > same issue. It's connected with the opal_dev.{cmd,req} buffers passed to
> > the nvme_sec_submit() method. The rest of the buffers involved in the NVME
> > DMA are either allocated by kmalloc (must be cache-line-aligned by design)
> > or bounced-buffered if allocated on the stack (see the blk_rq_map_kern()
> > method implementation).
>
> What about user space addresses?
Reasonable question. Alas I haven't researched the user-space part as
much thorough. What I can say for sure that we haven't detected any
unaligned buffers passed to the DMA-mapping procedure other than the
ones denoted in this patch and in the next one. So to speak so far
none of the NVME-involved user-space buffers have had unaligned offset
in the physical address space. I have merged in the next patch in our
local kernel tree:
https://patchwork.linux-mips.org/project/linux-mips/patch/[email protected]/
So if an unaligned buffer was passed we would have immediately got it
detected.
> We can map those with cacheline offsets.
If we could do that easily it would have been great. But I don't see
an easy way out. AFAICS we'll need to fix the blk_rq_map_user_iov()
method so one would CPU-based copy the unaligned part of the buffer
and perform the DMA-required operations with the rest of it. Do you
have any better suggestion in mind?
>
> > ---
> > drivers/nvme/host/hwmon.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c
> > index 0a586d712920..94192ab7a02d 100644
> > --- a/drivers/nvme/host/hwmon.c
> > +++ b/drivers/nvme/host/hwmon.c
> > @@ -10,9 +10,10 @@
> >
> > #include "nvme.h"
> >
> > +/* DMA-noncoherent platforms require the cache-aligned buffers */
> > struct nvme_hwmon_data {
> > + struct nvme_smart_log log ____cacheline_aligned;
> > struct nvme_ctrl *ctrl;
> > - struct nvme_smart_log log;
>
> So this by chance happened to work before 52fde2c07da6 because the field
> started at a 4-byte offset on your arch?
Correct. The offset is 4-bytes indeed so the log-field base address is
4-bytes aligned. Due to that the bounce-buffer used to be used for the
NVME SMART log getting. Since the denoted commit the log-buffer have
been directly used for DMA, which has revealed the problem caused by the
cache-invalidation on the buffer mapping.
>
> The change looks good.
>
> Reviewed-by: Keith Busch <[email protected]>
Thanks.
-Sergey
I think this will work, but unless we have to I'd generally prefer
to just split dta that is DMAed into into a separate allocation.
That is, do a separate kmalloc for the nvme_smart_log structure.
Guenter, is this ok with you?
On Sat, Sep 10, 2022 at 07:30:45AM +0200, Christoph Hellwig wrote:
> I think this will work, but unless we have to I'd generally prefer
> to just split dta that is DMAed into into a separate allocation.
> That is, do a separate kmalloc for the nvme_smart_log structure.
Well, both approaches will solve the denoted problem. I am just
wondering why do you think that the kmalloc-ed buffer is more
preferable? IMO it is a bit less suitable since increases the memory
granularity - two kmalloc's instead of one. Moreover it makes the code
a bit more complex for the same reason of having two mallocs and two
frees. Meanwhile using the ____cacheline_aligned qualifier to prevent
the noncoherent DMA problem is a standard approach.
What would be the best solution if we had a qualifier like this:
#ifdef CONFIG_DMA_NONCOHERENT
#define ____dma_buffer ____cacheline_aligned
#else
#define ____dma_buffer
#endif
and used it instead of the direct ____cacheline_aligned utilization.
-Sergey
>
> Guenter, is this ok with you?
On 9/9/22 22:30, Christoph Hellwig wrote:
> I think this will work, but unless we have to I'd generally prefer
> to just split dta that is DMAed into into a separate allocation.
> That is, do a separate kmalloc for the nvme_smart_log structure.
>
> Guenter, is this ok with you?
Seems to be a bit overkill, but sure.
Guenter
On Sat, Sep 10, 2022 at 03:35:45PM +0300, Serge Semin wrote:
> On Sat, Sep 10, 2022 at 07:30:45AM +0200, Christoph Hellwig wrote:
> > I think this will work, but unless we have to I'd generally prefer
> > to just split dta that is DMAed into into a separate allocation.
> > That is, do a separate kmalloc for the nvme_smart_log structure.
>
> Well, both approaches will solve the denoted problem. I am just
> wondering why do you think that the kmalloc-ed buffer is more
> preferable? IMO it is a bit less suitable since increases the memory
> granularity - two kmalloc's instead of one. Moreover it makes the code
^
`-- I meant fragmentation of course...
> a bit more complex for the same reason of having two mallocs and two
> frees. Meanwhile using the ____cacheline_aligned qualifier to prevent
> the noncoherent DMA problem is a standard approach.
>
> What would be the best solution if we had a qualifier like this:
> #ifdef CONFIG_DMA_NONCOHERENT
> #define ____dma_buffer ____cacheline_aligned
> #else
> #define ____dma_buffer
> #endif
> and used it instead of the direct ____cacheline_aligned utilization.
>
> -Sergey
>
> >
> > Guenter, is this ok with you?
On Sat, Sep 10, 2022 at 03:35:42PM +0300, Serge Semin wrote:
> Well, both approaches will solve the denoted problem. I am just
> wondering why do you think that the kmalloc-ed buffer is more
> preferable?
Because it clearly documents the intent. Here is one buffer that is
just a data buffer, and here is one with kernel internal structure.
The concept of embedding on-disk / on-the-wire structures into internal
stuctures always seemed rather weird and unexpected to me, as we now
need to ensure that the alignment works right on both sides. With
the right annotations (as done in this series) this will work, but
it feels a little fragile to me.
> What would be the best solution if we had a qualifier like this:
> #ifdef CONFIG_DMA_NONCOHERENT
> #define ____dma_buffer ____cacheline_aligned
> #else
> #define ____dma_buffer
> #endif
> and used it instead of the direct ____cacheline_aligned utilization.
So independent of my preference for separate allocations, this suggested
additional would still be very useful for the places where we need
to use the alignment for performance or other reasons. I'd use
something like __dma_alligned or similar, though.
Hello Christoph,
Sorry for the delay with response.
On Mon, Sep 12, 2022 at 10:29:10AM +0200, Christoph Hellwig wrote:
> On Sat, Sep 10, 2022 at 03:35:42PM +0300, Serge Semin wrote:
> > Well, both approaches will solve the denoted problem. I am just
> > wondering why do you think that the kmalloc-ed buffer is more
> > preferable?
>
> Because it clearly documents the intent. Here is one buffer that is
> just a data buffer, and here is one with kernel internal structure.
> The concept of embedding on-disk / on-the-wire structures into internal
> stuctures always seemed rather weird and unexpected to me, as we now
> need to ensure that the alignment works right on both sides. With
> the right annotations (as done in this series) this will work, but
> it feels a little fragile to me.
IMO both the approaches seem unclear if a reader doesn't know what
they have been introduced for. Anyway do you insist on using the
kmalloc-ed buffer here instead? If so I'll resubmit the series with
this patch updated accordingly.
-Sergey
>
> > What would be the best solution if we had a qualifier like this:
> > #ifdef CONFIG_DMA_NONCOHERENT
> > #define ____dma_buffer ____cacheline_aligned
> > #else
> > #define ____dma_buffer
> > #endif
> > and used it instead of the direct ____cacheline_aligned utilization.
>
> So independent of my preference for separate allocations, this suggested
> additional would still be very useful for the places where we need
> to use the alignment for performance or other reasons. I'd use
> something like __dma_alligned or similar, though.
On Mon, Sep 26, 2022 at 01:23:25AM +0300, Serge Semin wrote:
> IMO both the approaches seem unclear if a reader doesn't know what
> they have been introduced for. Anyway do you insist on using the
> kmalloc-ed buffer here instead? If so I'll resubmit the series with
> this patch updated accordingly.
I don't like the __aligend version too much, but I can live with it if
you strongly prefer it.
On Mon, Sep 26, 2022 at 04:39:59PM +0200, Christoph Hellwig wrote:
> On Mon, Sep 26, 2022 at 01:23:25AM +0300, Serge Semin wrote:
> > IMO both the approaches seem unclear if a reader doesn't know what
> > they have been introduced for. Anyway do you insist on using the
> > kmalloc-ed buffer here instead? If so I'll resubmit the series with
> > this patch updated accordingly.
>
> I don't like the __aligend version too much, but I can live with it if
> you strongly prefer it.
What about converting the NVME hwmon driver to using the kmalloc'ed
buffer (seeing the rest of the NVME core driver prefer kmallocing the
buffers) and adding the ____cacheline_aligned modifier to the opal_dev
buffers?
-Sergey