LinuxLists.cc - Bug in add_dma_entry()'s debugging code

2023-11-27 16:02:30

Subject: Bug in add_dma_entry()'s debugging code

Among other things, add_dma_entry() in kernel/dma/debug.c prints an
error message when it sees two overlapping FROM_DEVICE DMA mappings.
The actual overlap detection is performed by a separate routine,
active_cacheline_insert(). But the criterion this routine uses is
wrong.

All it looks for is mappings that start on the same cache line. However
on architectures that have cache-coherent DMA (such as x86), touching
the same cache line does not mean that two DMA mappings will interfere
with each other. To truly overlap, they would have to touch the same
_bytes_.

The routine does not check for this, and consequently we get error
messages about overlapping mappings when in fact there is no overlap.
This bug has been reported in

https://bugzilla.kernel.org/show_bug.cgi?id=215740

How should this be fixed? Since the check done in add_dma_entry() is
completely invalid for x86 and similar architectures, should it simply
be removed for them? Or should the check be enhanced to look for
byte-granularity overlap?

Alan Stern

PS: As a separate issue, active_cacheline_insert() fails to detect
overlap in situations where a mapping occupies more than one cache line.
For example, if mapping A uses lines N and N+1 and mapping B uses line
N+1, no overlap will be detected because the radix-tree keys for A and B
will be different (N vs. N+1).

2023-11-27 16:08:32

by Christoph Hellwig

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Mon, Nov 27, 2023 at 11:02:20AM -0500, Alan Stern wrote:
> All it looks for is mappings that start on the same cache line. However
> on architectures that have cache-coherent DMA (such as x86), touching
> the same cache line does not mean that two DMA mappings will interfere
> with each other. To truly overlap, they would have to touch the same
> _bytes_.

But that is a special case that does not matter. Linux drivers need
to be written in a portable way, and that means we have to care about
platforms that are not DMA coherent.

> How should this be fixed? Since the check done in add_dma_entry() is
> completely invalid for x86 and similar architectures, should it simply
> be removed for them? Or should the check be enhanced to look for
> byte-granularity overlap?

The patch is every but "completely invalid". It points out that you
violate the Linux DMA API requirements. This might not have an
effect on the particular plaform you are currently running on, but it
is still wrong. Note that there have been various mumblings about
using nosnoop dma on x86, in which case you'll have the issue there
as well.

> PS: As a separate issue, active_cacheline_insert() fails to detect
> overlap in situations where a mapping occupies more than one cache line.
> For example, if mapping A uses lines N and N+1 and mapping B uses line
> N+1, no overlap will be detected because the radix-tree keys for A and B
> will be different (N vs. N+1).

Fixes welcome.

2023-11-27 16:51:36

by Alan Stern

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Mon, Nov 27, 2023 at 05:07:59PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 11:02:20AM -0500, Alan Stern wrote:
> > All it looks for is mappings that start on the same cache line. However
> > on architectures that have cache-coherent DMA (such as x86), touching
> > the same cache line does not mean that two DMA mappings will interfere
> > with each other. To truly overlap, they would have to touch the same
> > _bytes_.
>
> But that is a special case that does not matter. Linux drivers need
> to be written in a portable way, and that means we have to care about
> platforms that are not DMA coherent.

The buffers in the bug report were allocated by kmalloc(). Doesn't
kmalloc() guarantee that on architectures with non-cache-coherent DMA,
allocations will be aligned on cache-line boundaries (or larger)? Isn't
this what ARCH_DMA_MINALIGN and ARCH_KMALLOC_MINALIGN are supposed to
take care of in include/linux/slab.h?

> > How should this be fixed? Since the check done in add_dma_entry() is
> > completely invalid for x86 and similar architectures, should it simply
> > be removed for them? Or should the check be enhanced to look for
> > byte-granularity overlap?
>
> The patch is every but "completely invalid". It points out that you
> violate the Linux DMA API requirements.

Since when does the DMA API require that mappings on x86 must be to
separate cache lines? Is this documented anywhere?

For that matter, Documentation/core-api/dma-api-howto.rst explicitly
says:

If you acquired your memory via the page allocator
(i.e. __get_free_page*()) or the generic memory allocators
(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from
that memory using the addresses returned from those routines.

It also says:

Architectures must ensure that kmalloc'ed buffer is
DMA-safe. Drivers and subsystems depend on it. If an architecture
isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in
the CPU cache is identical to data in main memory),
ARCH_DMA_MINALIGN must be set so that the memory allocator
makes sure that kmalloc'ed buffer doesn't share a cache line with
the others. See arch/arm/include/asm/cache.h as an example.

It says nothing about avoiding more than one DMA operation at a time to
prevent overlap. Is the documentation wrong?

> This might not have an
> effect on the particular plaform you are currently running on, but it
> is still wrong.

Who decides what is right and what is wrong in this area? Clearly you
have a different opinion from David S. Miller, Richard Henderson, and
Jakub Jelinek (the authors of that documentation file).

> Note that there have been various mumblings about
> using nosnoop dma on x86, in which case you'll have the issue there
> as well.

Unless the people who implement nosnoop DMA also modify kmalloc() or
ARCH_DMA_MINALIGN.

I guess the real question boils down to where the responsiblity should
lie. Should kmalloc() guarantee that the memory regions it provides
will always be suitable for DMA, with no overlap issues? Or should all
the innumerable users of kmalloc() be responsible for jumping through
hoops to request arch-specific minimum alignment for their DMA buffers?

Alan Stern

2023-11-28 13:37:37

by Christoph Hellwig

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Mon, Nov 27, 2023 at 11:51:21AM -0500, Alan Stern wrote:
> The buffers in the bug report were allocated by kmalloc(). Doesn't
> kmalloc() guarantee that on architectures with non-cache-coherent DMA,
> allocations will be aligned on cache-line boundaries (or larger)? Isn't
> this what ARCH_DMA_MINALIGN and ARCH_KMALLOC_MINALIGN are supposed to
> take care of in include/linux/slab.h?

Oh. Yes, the variable alignment from kmalloc make things complicated.

> Architectures must ensure that kmalloc'ed buffer is
> DMA-safe. Drivers and subsystems depend on it. If an architecture
> isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in
> the CPU cache is identical to data in main memory),
> ARCH_DMA_MINALIGN must be set so that the memory allocator
> makes sure that kmalloc'ed buffer doesn't share a cache line with
> the others. See arch/arm/include/asm/cache.h as an example.
>
> It says nothing about avoiding more than one DMA operation at a time to
> prevent overlap. Is the documentation wrong?

The documentation is a bit lacking unfortunately. Again, assuming
you actually have non-coherent mappings you'd easily break the fragile
cache line ownership if you did. That doesn't apply to x86 as-is, but
we really try to keep drivers portable.

> > This might not have an
> > effect on the particular plaform you are currently running on, but it
> > is still wrong.
>
> Who decides what is right and what is wrong in this area? Clearly you
> have a different opinion from David S. Miller, Richard Henderson, and
> Jakub Jelinek (the authors of that documentation file).

I don't think this about anyone's opinion. It's a fundamental propery of
how to manage caches in the face of non-coherent DMA.

>
> > Note that there have been various mumblings about
> > using nosnoop dma on x86, in which case you'll have the issue there
> > as well.
>
> Unless the people who implement nosnoop DMA also modify kmalloc() or
> ARCH_DMA_MINALIGN.

Or don't do it on kmalloc buffers.

> I guess the real question boils down to where the responsiblity should
> lie. Should kmalloc() guarantee that the memory regions it provides
> will always be suitable for DMA, with no overlap issues? Or should all
> the innumerable users of kmalloc() be responsible for jumping through
> hoops to request arch-specific minimum alignment for their DMA buffers?

I'd actually go one step back:

1) for not cache coherent DMA you can't do overlapping operations inside
a cache line
2) dma-debug is intended to find DMA API misuses, even if they don't
have bad effects on your particular system
3) the fact that the kmalloc implementation returns differently aligned
memory depending on the platform breaks how dma-debug works currently

The logical confcusion from that would be that IFF dma-debug is enabled on
any platform we need to set ARCH_DMA_MINALIGN to the cache line size.

BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by
moving to bounce buffering unaligned memory for non-coherent
architectures, which makes this even more complicated. Right now I
don't have a good idea how to actually deal with having the cachline
sanity checks with that, but I'm Ccing some of the usual suspects if
they have a better idea.

2023-11-28 15:18:37

by Alan Stern

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote:
> I'd actually go one step back:
>
> 1) for not cache coherent DMA you can't do overlapping operations inside
> a cache line

Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA
operations that touch the same cache line concurrently. (The word
"overlapping" is a a little ambiguous in this context.)

(Right now dma-debug considers only DMA-IN operations. In theory this
restriction should apply even when some of the concurrent operations are
DMA-OUT, provided that at least one of them is DMA-IN. Minor point...)

> 2) dma-debug is intended to find DMA API misuses, even if they don't
> have bad effects on your particular system
> 3) the fact that the kmalloc implementation returns differently aligned
> memory depending on the platform breaks how dma-debug works currently

Exactly. That's essentially what Bugzilla #215740 is about.

> The logical confcusion from that would be that IFF dma-debug is enabled on
> any platform we need to set ARCH_DMA_MINALIGN to the cache line size.

(IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not
meant for production systems but rather for kernel testing, right?

> BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by
> moving to bounce buffering unaligned memory for non-coherent
> architectures,

What's the reason for this? To allow the minimum allocation size to be
smaller than the cache line size? Does the savings in memory make up
for the extra overhead of bounce buffering?

Or is this just to allow people to be more careless about how they
allocate their DMA buffers (which doesn't seem to make sense)?

> which makes this even more complicated. Right now I
> don't have a good idea how to actually deal with having the cachline
> sanity checks with that, but I'm Ccing some of the usual suspects if
> they have a better idea.

I get the impression that you would really like to have two different
versions of kmalloc() and friends: one for buffers that will be used in
DMA (and hence require cache-line alignment) and one for buffers that
won't be.

Alan Stern

2023-11-28 16:31:49

by Robin Murphy

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On 28/11/2023 3:18 pm, Alan Stern wrote:
> On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote:
>> I'd actually go one step back:
>>
>> 1) for not cache coherent DMA you can't do overlapping operations inside
>> a cache line
>
> Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA
> operations that touch the same cache line concurrently. (The word
> "overlapping" is a a little ambiguous in this context.)
>
> (Right now dma-debug considers only DMA-IN operations. In theory this
> restriction should apply even when some of the concurrent operations are
> DMA-OUT, provided that at least one of them is DMA-IN. Minor point...)
>
>> 2) dma-debug is intended to find DMA API misuses, even if they don't
>> have bad effects on your particular system
>> 3) the fact that the kmalloc implementation returns differently aligned
>> memory depending on the platform breaks how dma-debug works currently
>
> Exactly. That's essentially what Bugzilla #215740 is about.
>
>> The logical confcusion from that would be that IFF dma-debug is enabled on
>> any platform we need to set ARCH_DMA_MINALIGN to the cache line size.
>
> (IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not
> meant for production systems but rather for kernel testing, right?

Yikes, I'd hope that distros are heeding the warning that the Kconfig
calls out already. It's perhaps somewhat understated, as I'd describe
the performance impact to large modern systems with high-bandwidth I/O
as somewhere between "severe" and "crippling".

>> BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by
>> moving to bounce buffering unaligned memory for non-coherent
>> architectures,
>
> What's the reason for this? To allow the minimum allocation size to be
> smaller than the cache line size? Does the savings in memory make up
> for the extra overhead of bounce buffering?

Yes, on systems where non-coherent streaming DMA is expected to be rare,
or at least not performance-critical, not having to allocate 128 bytes
every time we want just 8 or so soon adds up.

> Or is this just to allow people to be more careless about how they
> allocate their DMA buffers (which doesn't seem to make sense)?
>
>> which makes this even more complicated. Right now I
>> don't have a good idea how to actually deal with having the cachline
>> sanity checks with that, but I'm Ccing some of the usual suspects if
>> they have a better idea.
>
> I get the impression that you would really like to have two different
> versions of kmalloc() and friends: one for buffers that will be used in
> DMA (and hence require cache-line alignment) and one for buffers that
> won't be.

That approach was mooted, but still has potentially-fiddly corner cases
like when the DMA buffer is a member of a larger struct, so we settled
on the bounce-buffering solution as the most robust compromise.

Thanks,
Robin.

2023-11-28 16:35:22

by Andy Shevchenko

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Tue, Nov 28, 2023 at 6:31 PM Robin Murphy <[email protected]> wrote:
> On 28/11/2023 3:18 pm, Alan Stern wrote:
> > On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote:

...

> >> The logical confcusion from that would be that IFF dma-debug is enabled on
> >> any platform we need to set ARCH_DMA_MINALIGN to the cache line size.
> >
> > (IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not
> > meant for production systems but rather for kernel testing, right?
>
> Yikes, I'd hope that distros are heeding the warning that the Kconfig
> calls out already. It's perhaps somewhat understated, as I'd describe
> the performance impact to large modern systems with high-bandwidth I/O
> as somewhere between "severe" and "crippling".

Independently on the distros configurations the (false positive)
message(s) will make difficult to debug the actual things, shouldn't
we have our code robust in any case?

--
With Best Regards,
Andy Shevchenko

2023-11-28 16:54:40

by Robin Murphy

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On 28/11/2023 4:34 pm, Andy Shevchenko wrote:
> On Tue, Nov 28, 2023 at 6:31 PM Robin Murphy <[email protected]> wrote:
>> On 28/11/2023 3:18 pm, Alan Stern wrote:
>>> On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote:
>
> ...
>
>>>> The logical confcusion from that would be that IFF dma-debug is enabled on
>>>> any platform we need to set ARCH_DMA_MINALIGN to the cache line size.
>>>
>>> (IF, not IFF.) And tell distributions that CONFIG_DMA_API_DEBUG is not
>>> meant for production systems but rather for kernel testing, right?
>>
>> Yikes, I'd hope that distros are heeding the warning that the Kconfig
>> calls out already. It's perhaps somewhat understated, as I'd describe
>> the performance impact to large modern systems with high-bandwidth I/O
>> as somewhere between "severe" and "crippling".
>
> Independently on the distros configurations the (false positive)
> message(s) will make difficult to debug the actual things, shouldn't
> we have our code robust in any case?

Sure, I have no objection to making dma-debug more robust and useful for
its intended purpose, I was just commenting on the fact that any
potential change in behaviour from this should be of less concern to
distros than the significant change in behaviour that enabling it
*already* poses (i.e. globally serialising DMA operations and doing
inherently slow stuff in what are normally expected to be fast paths).

Thanks,
Robin.

2023-11-28 17:45:07

by Catalin Marinas

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Tue, Nov 28, 2023 at 10:18:19AM -0500, Alan Stern wrote:
> On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote:
> > I'd actually go one step back:
> >
> > 1) for not cache coherent DMA you can't do overlapping operations inside
> > a cache line
>
> Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA
> operations that touch the same cache line concurrently. (The word
> "overlapping" is a a little ambiguous in this context.)

The problem is worse. I'd say you should not perform even a single
non-cache-coherent DMA (usually from-device or bidirectional) operation
if the cache line is shared with anything else modifying it. It doesn't
need to be another DMA operation. But that's more difficult to add to
the DMA API debug code (maybe something like the bouncing logic in
dma_kmalloc_needs_bounce()).

> > The logical confcusion from that would be that IFF dma-debug is enabled on
> > any platform we need to set ARCH_DMA_MINALIGN to the cache line size.

Or just force the kmalloc() min align to cache_line_size(), something
like:

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 4a658de44ee9..3ece20367636 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void)
#ifdef ARCH_HAS_DMA_MINALIGN
return ARCH_DMA_MINALIGN;
#endif
+ if (IS_ENABLED(CONFIG_DMA_API_DEBUG))
+ return cache_line_size();
return 1;
}
#endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8d431193c273..d0b21d6e9328 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void)
unsigned int minalign = dma_get_cache_alignment();

if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
- is_swiotlb_allocated())
+ is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG))
minalign = ARCH_KMALLOC_MINALIGN;

return max(minalign, arch_slab_minalign());

Also note that to_cacheline_number() in kernel/dma/debug.c only takes
into account the L1_CACHE_SHIFT. On arm64 for example, cache_line_size()
returns the maximum line of all the cache levels (and we've seen
hardware where the L1 is 64-byte, L2 is 128).

> > BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by
> > moving to bounce buffering unaligned memory for non-coherent
> > architectures,
>
> What's the reason for this? To allow the minimum allocation size to be
> smaller than the cache line size? Does the savings in memory make up
> for the extra overhead of bounce buffering?
>
> Or is this just to allow people to be more careless about how they
> allocate their DMA buffers (which doesn't seem to make sense)?

It's the former and it does make a difference with lots of small
structure or string allocations.

[...]
> I get the impression that you would really like to have two different
> versions of kmalloc() and friends: one for buffers that will be used in
> DMA (and hence require cache-line alignment) and one for buffers that
> won't be.

We've been there for the past 2-3 years (and a few other options). It's
hard to guess in a generic way because the allocation place may not
necessarily know how the device is going to access that memory (PIO,
DMA). The conclusion was that for those cases where the buffer is small,
we just do a bounce. If it's performance critical, the driver can use a
kmem_cache_create(SLAB_HWCACHE_ALIGN) and avoid the bouncing.

--
Catalin

2023-11-30 20:08:50

by Ferry Toth

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

Hi,

Op 28-11-2023 om 18:44 schreef Catalin Marinas:
> On Tue, Nov 28, 2023 at 10:18:19AM -0500, Alan Stern wrote:
>> On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote:
>>> I'd actually go one step back:
>>>
>>> 1) for not cache coherent DMA you can't do overlapping operations inside
>>> a cache line
>>
>> Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA
>> operations that touch the same cache line concurrently. (The word
>> "overlapping" is a a little ambiguous in this context.)
>
> The problem is worse. I'd say you should not perform even a single
> non-cache-coherent DMA (usually from-device or bidirectional) operation
> if the cache line is shared with anything else modifying it. It doesn't
> need to be another DMA operation. But that's more difficult to add to
> the DMA API debug code (maybe something like the bouncing logic in
> dma_kmalloc_needs_bounce()).
>
>>> The logical confcusion from that would be that IFF dma-debug is enabled on
>>> any platform we need to set ARCH_DMA_MINALIGN to the cache line size.
>
> Or just force the kmalloc() min align to cache_line_size(), something
> like:
>
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 4a658de44ee9..3ece20367636 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void)
> #ifdef ARCH_HAS_DMA_MINALIGN
> return ARCH_DMA_MINALIGN;
> #endif
> + if (IS_ENABLED(CONFIG_DMA_API_DEBUG))
> + return cache_line_size();
> return 1;
> }
> #endif
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8d431193c273..d0b21d6e9328 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void)
> unsigned int minalign = dma_get_cache_alignment();
>
> if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
> - is_swiotlb_allocated())
> + is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG))
> minalign = ARCH_KMALLOC_MINALIGN;
>
> return max(minalign, arch_slab_minalign());

With above suggestion "force the kmalloc() min align to
cache_line_size()" + Alan's debug code:

root@yuna:~# journalctl -k | grep hub
kernel: usbcore: registered new interface driver hub
kernel: hub 1-0:1.0: USB hub found
kernel: usb usb1: hub buffer at 71c7180, status at 71c71c0
kernel: hub 1-0:1.0: 1 port detected
kernel: hub 2-0:1.0: USB hub found
kernel: usb usb2: hub buffer at 71c79c0, status at 71c7a00
kernel: hub 2-0:1.0: 1 port detected
kernel: hub 1-1:1.0: USB hub found
kernel: usb 1-1: hub buffer at 65b36c0, status at 6639340
kernel: hub 1-1:1.0: 7 ports detected

and the stack trace indeed goes away.

IOW also the 2 root hub kmalloc() are now also aligned to the cache line
size, even though these never triggered the stack trace. Strange: hub
status is aligned far away from hub buffer, kmalloc mysteries.

This still did not land for me: are we detecting a false alarm here as
the 2 DMA operations can never happen on the same cache line on
non-cache-coherent platforms? If so, shouldn't we fix up the dma debug
code to not detect a false alarm? Instead of changing the alignment?
Or, is this a bonafide warning (for non-cache-coherent platforms)? Then
we should not silence it by force aligning it, but issue a WARN (on a
cache coherent platform) that is more useful (i.e. here we have not an
overlap but a shared cache line). On a non-cache coherent platform
something stronger than a WARN might be appropriate?

> Also note that to_cacheline_number() in kernel/dma/debug.c only takes
> into account the L1_CACHE_SHIFT. On arm64 for example, cache_line_size()
> returns the maximum line of all the cache levels (and we've seen
> hardware where the L1 is 64-byte, L2 is 128).
>
>>> BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by
>>> moving to bounce buffering unaligned memory for non-coherent
>>> architectures,
>>
>> What's the reason for this? To allow the minimum allocation size to be
>> smaller than the cache line size? Does the savings in memory make up
>> for the extra overhead of bounce buffering?
>>
>> Or is this just to allow people to be more careless about how they
>> allocate their DMA buffers (which doesn't seem to make sense)?
>
> It's the former and it does make a difference with lots of small
> structure or string allocations.
>
> [...]
>> I get the impression that you would really like to have two different
>> versions of kmalloc() and friends: one for buffers that will be used in
>> DMA (and hence require cache-line alignment) and one for buffers that
>> won't be.
>
> We've been there for the past 2-3 years (and a few other options). It's
> hard to guess in a generic way because the allocation place may not
> necessarily know how the device is going to access that memory (PIO,
> DMA). The conclusion was that for those cases where the buffer is small,
> we just do a bounce. If it's performance critical, the driver can use a
> kmem_cache_create(SLAB_HWCACHE_ALIGN) and avoid the bouncing.
>

2023-12-01 11:09:13

by Catalin Marinas

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Thu, Nov 30, 2023 at 09:08:23PM +0100, Ferry Toth wrote:
> Op 28-11-2023 om 18:44 schreef Catalin Marinas:
> > Or just force the kmalloc() min align to cache_line_size(), something
> > like:
> >
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index 4a658de44ee9..3ece20367636 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void)
> > #ifdef ARCH_HAS_DMA_MINALIGN
> > return ARCH_DMA_MINALIGN;
> > #endif
> > + if (IS_ENABLED(CONFIG_DMA_API_DEBUG))
> > + return cache_line_size();
> > return 1;
> > }
> > #endif
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 8d431193c273..d0b21d6e9328 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void)
> > unsigned int minalign = dma_get_cache_alignment();
> > if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
> > - is_swiotlb_allocated())
> > + is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG))
> > minalign = ARCH_KMALLOC_MINALIGN;
> > return max(minalign, arch_slab_minalign());
>
> With above suggestion "force the kmalloc() min align to cache_line_size()" +
> Alan's debug code:
>
> root@yuna:~# journalctl -k | grep hub
> kernel: usbcore: registered new interface driver hub
> kernel: hub 1-0:1.0: USB hub found
> kernel: usb usb1: hub buffer at 71c7180, status at 71c71c0
> kernel: hub 1-0:1.0: 1 port detected
> kernel: hub 2-0:1.0: USB hub found
> kernel: usb usb2: hub buffer at 71c79c0, status at 71c7a00
> kernel: hub 2-0:1.0: 1 port detected
> kernel: hub 1-1:1.0: USB hub found
> kernel: usb 1-1: hub buffer at 65b36c0, status at 6639340
> kernel: hub 1-1:1.0: 7 ports detected
>
> and the stack trace indeed goes away.
>
> IOW also the 2 root hub kmalloc() are now also aligned to the cache line
> size, even though these never triggered the stack trace. Strange: hub status
> is aligned far away from hub buffer, kmalloc mysteries.

They are 64 bytes apart in most cases other than the last one which I
guess the status had to go to a different slab page.

> This still did not land for me: are we detecting a false alarm here as the 2
> DMA operations can never happen on the same cache line on non-cache-coherent
> platforms? If so, shouldn't we fix up the dma debug code to not detect a
> false alarm? Instead of changing the alignment?

It's a false alarm indeed on this hardware since the DMA is
cache-coherent. I think Christoph mentioned earlier in this thread that
he'd like the DMA API debug to report potential problems even if the
hardware it is running on is safe.

> Or, is this a bonafide warning (for non-cache-coherent platforms)? Then we
> should not silence it by force aligning it, but issue a WARN (on a cache
> coherent platform) that is more useful (i.e. here we have not an overlap but
> a shared cache line). On a non-cache coherent platform something stronger
> than a WARN might be appropriate?

A non-cache coherent platform would either have the kmalloc()
allocations aligned or it would bounce those small, unaligned buffers.
So it doesn't seem right to issue a warning on x86 where kmalloc()
minimum alignment is 8 and not getting the warning on a non-coherent
platform which forces the kmalloc() alignment.

If we consider the kmalloc() aspect already covered by bouncing or force
alignment, the DMA API debug code can still detect other cache line
sharing situations.

--
Catalin

2023-12-01 12:19:17

by Ferry Toth

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

Hi

On 01-12-2023 12:08, Catalin Marinas wrote:
> On Thu, Nov 30, 2023 at 09:08:23PM +0100, Ferry Toth wrote:
>> Op 28-11-2023 om 18:44 schreef Catalin Marinas:
>>> Or just force the kmalloc() min align to cache_line_size(), something
>>> like:
>>>
>>> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
>>> index 4a658de44ee9..3ece20367636 100644
>>> --- a/include/linux/dma-mapping.h
>>> +++ b/include/linux/dma-mapping.h
>>> @@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void)
>>> #ifdef ARCH_HAS_DMA_MINALIGN
>>> return ARCH_DMA_MINALIGN;
>>> #endif
>>> + if (IS_ENABLED(CONFIG_DMA_API_DEBUG))
>>> + return cache_line_size();
>>> return 1;
>>> }
>>> #endif
>>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>>> index 8d431193c273..d0b21d6e9328 100644
>>> --- a/mm/slab_common.c
>>> +++ b/mm/slab_common.c
>>> @@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void)
>>> unsigned int minalign = dma_get_cache_alignment();
>>> if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
>>> - is_swiotlb_allocated())
>>> + is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG))
>>> minalign = ARCH_KMALLOC_MINALIGN;
>>> return max(minalign, arch_slab_minalign());
>> With above suggestion "force the kmalloc() min align to cache_line_size()" +
>> Alan's debug code:
>>
>> root@yuna:~# journalctl -k | grep hub
>> kernel: usbcore: registered new interface driver hub
>> kernel: hub 1-0:1.0: USB hub found
>> kernel: usb usb1: hub buffer at 71c7180, status at 71c71c0
>> kernel: hub 1-0:1.0: 1 port detected
>> kernel: hub 2-0:1.0: USB hub found
>> kernel: usb usb2: hub buffer at 71c79c0, status at 71c7a00
>> kernel: hub 2-0:1.0: 1 port detected
>> kernel: hub 1-1:1.0: USB hub found
>> kernel: usb 1-1: hub buffer at 65b36c0, status at 6639340
>> kernel: hub 1-1:1.0: 7 ports detected
>>
>> and the stack trace indeed goes away.
>>
>> IOW also the 2 root hub kmalloc() are now also aligned to the cache line
>> size, even though these never triggered the stack trace. Strange: hub status
>> is aligned far away from hub buffer, kmalloc mysteries.
> They are 64 bytes apart in most cases other than the last one which I
> guess the status had to go to a different slab page.
>
>> This still did not land for me: are we detecting a false alarm here as the 2
>> DMA operations can never happen on the same cache line on non-cache-coherent
>> platforms? If so, shouldn't we fix up the dma debug code to not detect a
>> false alarm? Instead of changing the alignment?
> It's a false alarm indeed on this hardware since the DMA is
> cache-coherent. I think Christoph mentioned earlier in this thread that
> he'd like the DMA API debug to report potential problems even if the
> hardware it is running on is safe.

I agree with Christoph. So, my question is "are we detecting a false
alarm", not "a false alarm indeed on this hardware"?

>> Or, is this a bonafide warning (for non-cache-coherent platforms)? Then we
>> should not silence it by force aligning it, but issue a WARN (on a cache
>> coherent platform) that is more useful (i.e. here we have not an overlap but
>> a shared cache line). On a non-cache coherent platform something stronger
>> than a WARN might be appropriate?
> A non-cache coherent platform would either have the kmalloc()
> allocations aligned or it would bounce those small, unaligned buffers.
It would? Or it always has?
> So it doesn't seem right to issue a warning on x86 where kmalloc()
> minimum alignment is 8 and not getting the warning on a non-coherent
> platform which forces the kmalloc() alignment.

If *all*non-coherent platforms implement either correct alignment or
bounce buffer, and on (coherent) x86 we get an WARN, then it is a false
alarm right?

That is exactly my question (because I have no idea which platforms have
non-coherent caches).

> If we consider the kmalloc() aspect already covered by bouncing or force
> alignment, the DMA API debug code can still detect other cache line
> sharing situations.

Consider? Or know?

If we know, and we can not easily prevent false WARN's on x86 it could
be sufficient to change the message to "possible cacheline sharing
detected" and add a line that "DMA_API_DEBUG" should be disabled on
production systems.

And while at it, we might be able to fix the missed real cacheline
sharing mentioned by Alan:

"As a separate issue, active_cacheline_insert() fails to detect
overlap in situations where a mapping occupies more than one cache line.
For example, if mapping A uses lines N and N+1 and mapping B uses line
N+1, no overlap will be detected because the radix-tree keys for A and B
will be different (N vs. N+1)."

2023-12-01 16:22:02

by Alan Stern

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Fri, Dec 01, 2023 at 01:17:43PM +0100, Ferry Toth wrote:
> > A non-cache coherent platform would either have the kmalloc()
> > allocations aligned or it would bounce those small, unaligned buffers.
> It would? Or it always has?
> > So it doesn't seem right to issue a warning on x86 where kmalloc()
> > minimum alignment is 8 and not getting the warning on a non-coherent
> > platform which forces the kmalloc() alignment.
>
> If *all*non-coherent platforms implement either correct alignment or bounce
> buffer, and on (coherent) x86 we get an WARN, then it is a false alarm
> right?
>
> That is exactly my question (because I have no idea which platforms have
> non-coherent caches).

Don't forget, not all DMA buffers are allocated by kmalloc(). A buffer
allocated by some other means might not be aligned properly and might
share a cache line with another buffer.

Or you might have a single data structure that was allocated by
kmalloc() and then create separate DMA mappings for two members of that
structure. If the two members are in the same cache line, that would be
an error.

Alan Stern

> > If we consider the kmalloc() aspect already covered by bouncing or force
> > alignment, the DMA API debug code can still detect other cache line
> > sharing situations.
>
> Consider? Or know?
>
> If we know, and we can not easily prevent false WARN's on x86 it could be
> sufficient to change the message to "possible cacheline sharing detected"
> and add a line that "DMA_API_DEBUG" should be disabled on production
> systems.
>
> And while at it, we might be able to fix the missed real cacheline sharing
> mentioned by Alan:
>
> ?"As a separate issue, active_cacheline_insert() fails to detect
> overlap in situations where a mapping occupies more than one cache line.
> For example, if mapping A uses lines N and N+1 and mapping B uses line
> N+1, no overlap will be detected because the radix-tree keys for A and B
> will be different (N vs. N+1)."
>

2023-12-01 17:43:12

by Catalin Marinas

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

Replying to both here.

On Fri, Dec 01, 2023 at 11:21:40AM -0500, Alan Stern wrote:
> On Fri, Dec 01, 2023 at 01:17:43PM +0100, Ferry Toth wrote:
> > > A non-cache coherent platform would either have the kmalloc()
> > > allocations aligned or it would bounce those small, unaligned buffers.
> >
> > It would? Or it always has?

It depends on the configuration. arm64 does the bounce as it opted in to
CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC.

> > > So it doesn't seem right to issue a warning on x86 where kmalloc()
> > > minimum alignment is 8 and not getting the warning on a non-coherent
> > > platform which forces the kmalloc() alignment.
> >
> > If *all*non-coherent platforms implement either correct alignment or bounce
> > buffer, and on (coherent) x86 we get an WARN, then it is a false alarm
> > right?
> >
> > That is exactly my question (because I have no idea which platforms have
> > non-coherent caches).
>
> Don't forget, not all DMA buffers are allocated by kmalloc(). A buffer
> allocated by some other means might not be aligned properly and might
> share a cache line with another buffer.
>
> Or you might have a single data structure that was allocated by
> kmalloc() and then create separate DMA mappings for two members of that
> structure. If the two members are in the same cache line, that would be
> an error.

Indeed, so to be sure we don't trip over other false alarms, we'd also
need to force ARCH_DMA_MINALIGN to be at least a cache-line size. That's
used in some structures to force a static alignment of various members
that take DMA transfers. After that, anything reported might actually be
a potential issue, not a false alarm.

However, I wonder whether we'd actually hide some genuine problems.
Let's say x86 gets some DMA corruption when it tries to DMA 8 bytes
into two adjacent buffers because of some DMA buffer overflow, nothing
to do with cache lines. You enable the DMA API debugging to see if it
reports anything and it suddenly starts working because of the forced
minimum alignment.

--
Catalin

2023-12-05 18:28:38

by Alan Stern

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Fri, Dec 01, 2023 at 05:42:52PM +0000, Catalin Marinas wrote:
> Indeed, so to be sure we don't trip over other false alarms, we'd also
> need to force ARCH_DMA_MINALIGN to be at least a cache-line size. That's
> used in some structures to force a static alignment of various members
> that take DMA transfers. After that, anything reported might actually be
> a potential issue, not a false alarm.
>
> However, I wonder whether we'd actually hide some genuine problems.
> Let's say x86 gets some DMA corruption when it tries to DMA 8 bytes
> into two adjacent buffers because of some DMA buffer overflow, nothing
> to do with cache lines. You enable the DMA API debugging to see if it
> reports anything and it suddenly starts working because of the forced
> minimum alignment.

In the long run, it may turn out that trying to detect memory usage
patterns that could cause problems for architectures other than the one
currently running is not workable. Certainly it is a bad idea to have a
debugging infrastructure that changes the behavior of other parts of the
system -- particularly when those other parts may be the ones you're
trying to debug.

You may end up needing to make a choice here. Which evil is lesser?

Alan Stern

2023-12-06 16:22:08

by Alan Stern

[permalink] [raw]

Subject: Re: Bug in add_dma_entry()'s debugging code

On Wed, Dec 06, 2023 at 05:12:40PM +0100, Ferry Toth wrote:
> Hi
>
> On 05-12-2023 19:28, Alan Stern wrote:
> > In the long run, it may turn out that trying to detect memory usage
> > patterns that could cause problems for architectures other than the one
> > currently running is not workable. Certainly it is a bad idea to have a
>
> Maybe. But while debugging code on your platform it is a good thing to get
> warnings for potential issues on other platforms.

Oh, absolutely! It's definitely a good thing. I'm just saying that
doing it may not be practical -- there may be too many false positives
(as in this bug report) and false negatives.

> In this case we (I) got misled by the warning message itself. That should be
> easy enough to improve.

I don't think so. Issuing incorrect warnings that should be ignored is
a very bad idea, no matter what the wording is. After seeing a few
messages like that, people learn to expect them -- and then they ignore
the valid warnings along with the incorrect ones.

Alan Stern