2023-11-03 15:14:03

by Niklas Schnelle

[permalink] [raw]
Subject: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

Hi Swiotlb Maintainers,

With s390 slated to use dma-iommu.c I was experimenting with
CONFIG_SWIOTLB_DYNAMIC but was getting errors all over the place.
Debugging this I've managed to narrow things down to what I believe is
a memory corruption issue caused by overrunning the entire transient
memory pool allocated by swiotlb. In my test this happens on the
iommu_dma_map_page(), swiotlb_tbl_map_single() path when handling
untrusted PCI devices.

I've seen this happen only for transient pools when:
* allocation size >=PAGE_SIZE and,
* the original address of the mapping is not page aligned.

The overrun can be seen clearly by adding a simple "tlb_addr +
alloc_size > pool->end' overrun check to swiotlb_tbl_map_single() and
forcing PCI test devices to be untrusted. With that in place while an
NVMe fio work load runs fine TCP/IP tests on a Mellanox ConnectX-4 VF
reliably trigger the overrun check and with a debug print produce
output like the following:

software IO TLB:
transient:1
index:1
dma_get_min_align_mask(dev):0
orig_addr:ac0caebe
tlb_addr=a4d0f800
start:a4d0f000
end:a4d10000
pool_size:4096
alloc_size:4096
offset:0

The complete code used for this test is available on my
git.kernel.org[0] repository but it's bascially v6.6 + iommu/next (for
s390 DMA API) + 2 trivial debug commits.

For further analysis I've worked closely with Halil Pasic.

The reason why we think this is happening seems twofold. Under a
certain set of circumstances in the function swiotlb_find_slots():
1) the allocated transient pool can not fit the required bounce buffer,
and
2) the invocation of swiotlb_pool_find_slots() finds "a suitable
slot" even though it should fail.

The reason for 2), i.e. swiotlb_pool_find_slots() thinking there is a
suitable bounce buffer in the pool is that for transient pools pool-
>slots[i].list end up nonsensical, because swiotlb_init_io_tlb_pool(),
among other places in swiotlb, assumes that the nslabs of the pool is a
multiple of IO_TLB_SEGSIZE

The reason for 1) is a bit more convoluted and not entirely understood
by us. We are certain though that the function swiotlb_find_slots()
allocates a pool with nr_slots(alloc_size), where this alloc_size is
the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(),
but for alignment reasons some slots may get "skipped over" in
swiotlb_area_find_slots() causing the picked slots to overrun the
allocation.

Not sure how to properly fix this as the different alignment
requirements get pretty complex quickly. So would appreciate your
input.

Thanks,
Niklas

[0] bounce-min branch of my git.kernel.org repository at
https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git



2023-11-03 16:16:29

by Halil Pasic

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

On Fri, 03 Nov 2023 16:13:03 +0100
Niklas Schnelle <[email protected]> wrote:

> The reason for 1) is a bit more convoluted and not entirely understood
> by us. We are certain though that the function swiotlb_find_slots()
> allocates a pool with nr_slots(alloc_size), where this alloc_size is
> the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(),
> but for alignment reasons some slots may get "skipped over" in
> swiotlb_area_find_slots() causing the picked slots to overrun the
> allocation.
>
> Not sure how to properly fix this as the different alignment
> requirements get pretty complex quickly. So would appreciate your
> input.

Let me post a more detailed analysis of why do we observe
swiotlb_area_find_slots() considering the slot with the
index 0 invalid in our particular case, and how does that
relate to the whole "alignment" complex.

Currently there are three distinct mechanisms that dictate the "alignment":
a) min_align_mask (introduced by 36950f2da1ea ("driver core: add a
min_align_mask))
field to struct device_dma_parameters"))
b) alloc_size >= PAGE_SIZE which requires "page alignment"
c) alloc_aligned_mask.

In our case min_align_mask == 0 and a) is thus not applicable, because b) and
c) we end up with iotlb_align_mask = 0x800. And because orig_add & 0x800 ==
0x800 but pool->start & 0x800 == 0 and the slot at index i is skipped over. The
slot 0 is skipped over because it is page aligned, when !!((1UL << PAGE_SHIFT)
& orig_addr)

Let us note that with the current implementation the min_align_size mask, that
is mechanism a) also controls the tlb_addr within the first slot so that:
tlb_addr & min_align_mask == orig_addr & min_align_mask. In that sense a) is
very unlike b) and c).

For example, if !min_align_mask, then tlb_addr & (IO_TLB_SIZE - 1) is always 0,
even if the alloc_size is >= PAGE_SIZE or if alloc_aligned_size is non 0.

If with b) and c) the goal is that the swiotlb buffer shall not stretch over
more pages or address space blocks of a size dictated by alloc_aligned_mask
then, that goal is accomplished. If however the goal is to preserve the offsets
modulo some exponent of 2 dictated either by PAGE_SHIFT or by alloc_aligned
mask, then that goal is not reached.

But there is more to it! In the beginning there was b), or more precisely in the
olden days for mapping_size >= PAGE_SIZE we used to allocate properly page
aligned bounce buffers. That is tlb_addr & (~PAGE_MASK) == 0 regardless of what
orig_addr & (~PAGE_MASK) & (IO_TLB_SIZE - 1) is. That first got borked by
commit 1f221a0d0dbf ("swiotlb: respect min_align_mask") and then it did not get
fixed by commit 0eee5ae10256 ("swiotlb: fix slot alignment checks").

Let us also note that if more than one of the above mechanisms are applicable,
then for the slot selection the idea is apparently to go with the strictest
"alignment requirement", while for the offset within the slot only a) matters
(if applicable, i.e. min_align_mask != 0), which may appear strange if
not thoroughly documented.

In our opinion the first step towards getting this right is to figure out what
the different kinds of alignments are really supposed to mean. For each of the
mechanisms we need to understand and document, whether making sure that the
bounce buffer does not stretch over more of certain units of memory (like,
pages, iova granule size, whatever), or is it about preserving offset within a
certain unit of memory, and if yes to what extent (the least significant n-bits
of the orig_addr dictated by the respective mask, or something different).

Thank you for your help in advance!

Regards,
Halil and Niklas

2023-11-03 19:04:06

by Petr Tesařík

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

Hello Niklas,

thank you for your report. This is some great analysis!

On Fri, 03 Nov 2023 16:13:03 +0100
Niklas Schnelle <[email protected]> wrote:

> Hi Swiotlb Maintainers,
>
> With s390 slated to use dma-iommu.c I was experimenting with
> CONFIG_SWIOTLB_DYNAMIC but was getting errors all over the place.
> Debugging this I've managed to narrow things down to what I believe is
> a memory corruption issue caused by overrunning the entire transient
> memory pool allocated by swiotlb. In my test this happens on the
> iommu_dma_map_page(), swiotlb_tbl_map_single() path when handling
> untrusted PCI devices.
>
> I've seen this happen only for transient pools when:
> * allocation size >=PAGE_SIZE and,
> * the original address of the mapping is not page aligned.
>
> The overrun can be seen clearly by adding a simple "tlb_addr +
> alloc_size > pool->end' overrun check to swiotlb_tbl_map_single() and
> forcing PCI test devices to be untrusted. With that in place while an
> NVMe fio work load runs fine TCP/IP tests on a Mellanox ConnectX-4 VF
> reliably trigger the overrun check and with a debug print produce
> output like the following:
>
> software IO TLB:
> transient:1
> index:1
> dma_get_min_align_mask(dev):0
> orig_addr:ac0caebe
> tlb_addr=a4d0f800
> start:a4d0f000
> end:a4d10000
> pool_size:4096
> alloc_size:4096
> offset:0
>
> The complete code used for this test is available on my
> git.kernel.org[0] repository but it's bascially v6.6 + iommu/next (for
> s390 DMA API) + 2 trivial debug commits.
>
> For further analysis I've worked closely with Halil Pasic.
>
> The reason why we think this is happening seems twofold. Under a
> certain set of circumstances in the function swiotlb_find_slots():
> 1) the allocated transient pool can not fit the required bounce buffer,

I am aware that this can happen. It's in fact why the index returned by
swiotlb_search_pool_area() is checked in swiotlb_find_slots().

> and
> 2) the invocation of swiotlb_pool_find_slots() finds "a suitable
> slot" even though it should fail.

This needs fixing.

> The reason for 2), i.e. swiotlb_pool_find_slots() thinking there is a
> suitable bounce buffer in the pool is that for transient pools pool-
> >slots[i].list end up nonsensical, because swiotlb_init_io_tlb_pool(),
> among other places in swiotlb, assumes that the nslabs of the pool is a
> multiple of IO_TLB_SEGSIZE

Ah... Yes, the transient pool size must also be a multiple of segment
size, but it is not enforced.

This reminds me of my debugging session that resulted in commit
aabd12609f91 ("swiotlb: always set the number of areas before
allocating the pool") and this conversation:

https://lore.kernel.org/linux-iommu/[email protected]/

> The reason for 1) is a bit more convoluted and not entirely understood
> by us. We are certain though that the function swiotlb_find_slots()
> allocates a pool with nr_slots(alloc_size), where this alloc_size is
> the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(),
> but for alignment reasons some slots may get "skipped over" in
> swiotlb_area_find_slots() causing the picked slots to overrun the
> allocation.

Yes, this can happen.

> Not sure how to properly fix this as the different alignment
> requirements get pretty complex quickly. So would appreciate your
> input.

I don't think it's possible to improve the allocation logic without
modifying the page allocator and/or the DMA atomic pool allocator to
take additional constraints into account.

I had a wild idea back in March, but it would require some intrusive
changes in the mm subsystem. Among other things, it would make memory
zones obsolete. I mean, people may actually like to get rid of DMA,
DMA32 and NORMAL, but you see how many nasty bugs were introduced even
by a relatively small change in SWIOTLB. Replacing memory zones with a
system based on generic physical allocation constraints would probably
blow up the universe. ;-)

Petr T

2023-11-03 20:51:45

by Petr Tesařík

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

On Fri, 3 Nov 2023 17:14:47 +0100
Halil Pasic <[email protected]> wrote:

>[...]
> In our case min_align_mask == 0 and a) is thus not applicable, because b) and
> c) we end up with iotlb_align_mask = 0x800. And because orig_add & 0x800 ==
> 0x800 but pool->start & 0x800 == 0 and the slot at index i is skipped over. The
> slot 0 is skipped over because it is page aligned, when !!((1UL << PAGE_SHIFT)
> & orig_addr)

Wait. These mask values can quickly become confusing. Do you mean
iotlb_align_mask == 0xfff?

> Let us note that with the current implementation the min_align_size mask, that
> is mechanism a) also controls the tlb_addr within the first slot so that:
> tlb_addr & min_align_mask == orig_addr & min_align_mask. In that sense a) is
> very unlike b) and c).

It is silently assumed that PAGE_SIZE >= IO_TLB_SIZE, so if the buffer
is page-aligned, the lower bits of the alignment inside the io tlb slot
must be zero.

If the same assumption is made about alloc_align_mask, it should be
documented, but it is not.

>[...]
> In our opinion the first step towards getting this right is to figure out what
> the different kinds of alignments are really supposed to mean. For each of the
> mechanisms we need to understand and document, whether making sure that the
> bounce buffer does not stretch over more of certain units of memory (like,
> pages, iova granule size, whatever), or is it about preserving offset within a
> certain unit of memory, and if yes to what extent (the least significant n-bits
> of the orig_addr dictated by the respective mask, or something different).


Seconded. I have also been struggling with the various alignment
constraints. I have even written (but not yet submitted) a patch to
calculate the combined alignment mask in swiotlb_tbl_map_single() and
pass it down to all other functions, just to make it clear what
alignment mask is used.

My understanding is that buffer alignment may be required by:

1. hardware which cannot handle an unaligned base address (presumably
because the chip performs a simple OR operation to get the addresses
of individual fields);

2. isolation of untrusted devices, where no two bounce buffers should
end up in the same iova granule;

3. allocation size; I could not find an explanation, so this might be
merely an attempt at reducing SWIOTLB internal fragmentation.

I hope other people on the Cc list can shed more light on the intended
behaviour.

Petr T

2023-11-06 07:43:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

On Fri, Nov 03, 2023 at 09:50:53PM +0100, Petr Tesařík wrote:
> Seconded. I have also been struggling with the various alignment
> constraints. I have even written (but not yet submitted) a patch to
> calculate the combined alignment mask in swiotlb_tbl_map_single() and
> pass it down to all other functions, just to make it clear what
> alignment mask is used.

That does sound like a good idea.

> My understanding is that buffer alignment may be required by:
>
> 1. hardware which cannot handle an unaligned base address (presumably
> because the chip performs a simple OR operation to get the addresses
> of individual fields);

There's all kinds of weird encodings that discard the low bits.
For NVMe it's the PRPs (that is actually documented in the NVMe
spec, so it might be easiest to grasp), but except for a Mellox
vendor extension this is also how all RDMA memory registrations
work.

2023-11-06 07:45:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

On Fri, Nov 03, 2023 at 07:59:49PM +0100, Petr Tesařík wrote:
> I don't think it's possible to improve the allocation logic without
> modifying the page allocator and/or the DMA atomic pool allocator to
> take additional constraints into account.
>
> I had a wild idea back in March, but it would require some intrusive
> changes in the mm subsystem. Among other things, it would make memory
> zones obsolete. I mean, people may actually like to get rid of DMA,
> DMA32 and NORMAL, but you see how many nasty bugs were introduced even
> by a relatively small change in SWIOTLB. Replacing memory zones with a
> system based on generic physical allocation constraints would probably
> blow up the universe. ;-)

It would be very nice, at least for DMA32 or the 30/31-bit DMA pools
used on some architectures. For the x86-style 16MB zone DMA I suspect
just having a small pool on the side that's not even exposed to the
memory allocator would probably work better.

I think a lot of the MM folks would love to be able to kill of the
extra zones.

2023-11-06 10:09:30

by Halil Pasic

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

On Fri, 3 Nov 2023 21:50:53 +0100
Petr Tesařík <[email protected]> wrote:

> >[...]
> > In our case min_align_mask == 0 and a) is thus not applicable, because b) and
> > c) we end up with iotlb_align_mask = 0x800. And because orig_add & 0x800 ==
> > 0x800 but pool->start & 0x800 == 0 and the slot at index i is skipped over. The
> > slot 0 is skipped over because it is page aligned, when !!((1UL << PAGE_SHIFT)
> > & orig_addr)
>
> Wait. These mask values can quickly become confusing. Do you mean
> iotlb_align_mask == 0xfff?

I mean iotlb_align_mask == 0x800. Because of
iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
in line 994 masks away the 0x7FF part
(https://github.com/torvalds/linux/blob/d2f51b3516dade79269ff45eae2a7668ae711b25/kernel/dma/swiotlb.c#L994C2-L994C41)
what remains of 0xfff (when PAGE_SHIFT == 12). My idea was to write
0x800 differently with a reference to PAGE_SHIFT, because for a
larger PAGE_SHIFT we end up with a different pattern and thus
requirement, but didn't really think it through properly because
even (1UL << (PAGE_SHIFT- 1)) (which is for PAGE_SHIFT == 12
0x800) does not tell the full story. Because all bits in the
interval [PAGE_SHIFT,IO_TLB_SHIFT) matter and not just the most
significant one (for PAGE_SHIFT == 12 and IO_TLB_SHIFT == 1 there is
just one).

Shame on me! Sorry for the confusion!

Regards,
Halil

2023-11-06 12:48:09

by Petr Tesarik

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

Hi Christoph,

On 11/6/2023 8:44 AM, Christoph Hellwig wrote:
> On Fri, Nov 03, 2023 at 07:59:49PM +0100, Petr Tesařík wrote:
>> I don't think it's possible to improve the allocation logic without
>> modifying the page allocator and/or the DMA atomic pool allocator to
>> take additional constraints into account.
>>
>> I had a wild idea back in March, but it would require some intrusive
>> changes in the mm subsystem. Among other things, it would make memory
>> zones obsolete. I mean, people may actually like to get rid of DMA,
>> DMA32 and NORMAL, but you see how many nasty bugs were introduced even
>> by a relatively small change in SWIOTLB. Replacing memory zones with a
>> system based on generic physical allocation constraints would probably
>> blow up the universe. ;-)
>
> It would be very nice, at least for DMA32 or the 30/31-bit DMA pools
> used on some architectures. For the x86-style 16MB zone DMA I suspect
> just having a small pool on the side that's not even exposed to the
> memory allocator would probably work better.
>
> I think a lot of the MM folks would love to be able to kill of the
> extra zones.

There's more to it. If you look at DMA buffer allocations, they need
memory which is contiguous in DMA address space of the requesting
device, but we allocate buffers that are contiguous in physical address
of the CPU. This difference is in fact responsible for some of the odd
DMA address limits.

All hell breaks loose when you try to fix this properly. Instead, we get
away with the observation that physically contiguous memory regions
coincide with DMA contiguous regions on real-world systems. But if
anyone feels like starting from scratch, they could also take the extra
time to look at this part. ;-)

FWIW I'm not volunteering, or at least not this year.

Petr T

2023-11-23 10:18:59

by Petr Tesařík

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

Hello everybody,

I don't think I have ever seen an answer to my question regarding
alignment constraints on swiotlb bounce buffers:

On Wed, 8 Nov 2023 10:13:47 +0100
Petr Tesařík <[email protected]> wrote:

>[...]
> To sum it up, there are two types of alignment:
>
> 1. specified by a device's min_align_mask; this says how many low
> bits of a buffer's physical address must be preserved,
>
> 2. specified by allocation size and/or the alignment parameter;
> this says how many low bits in the first IO TLB slot's physical
> address must be zero.
>
> I hope somebody can confirm or correct this summary before I go
> and break something. You know, it's not like cleanups in SWIOTLB
> have never broken anything. ;-)

If no answer means that nobody knows, then based on my understanding the
existing code (both implementation and users), I can assume that this
is the correct interpretation.

I'm giving it a few more days. If there's still no reaction, expect a
beautiful documentation patch and a less beautiful cleanup patch in the
next week.

Petr T

2023-11-27 15:59:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

On Thu, Nov 23, 2023 at 11:16:08AM +0100, Petr Tesařík wrote:
> > To sum it up, there are two types of alignment:
> >
> > 1. specified by a device's min_align_mask; this says how many low
> > bits of a buffer's physical address must be preserved,
> >
> > 2. specified by allocation size and/or the alignment parameter;
> > this says how many low bits in the first IO TLB slot's physical
> > address must be zero.

Both are correct.

2023-11-28 07:16:35

by Petr Tesařík

[permalink] [raw]
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

On Mon, 27 Nov 2023 16:59:13 +0100
Christoph Hellwig <[email protected]> wrote:

> On Thu, Nov 23, 2023 at 11:16:08AM +0100, Petr Tesařík wrote:
> > > To sum it up, there are two types of alignment:
> > >
> > > 1. specified by a device's min_align_mask; this says how many low
> > > bits of a buffer's physical address must be preserved,
> > >
> > > 2. specified by allocation size and/or the alignment parameter;
> > > this says how many low bits in the first IO TLB slot's physical
> > > address must be zero.
>
> Both are correct.

Great! Thank you for confirmation. Unfortunately, that's not quite how
the code works now.

I'm on it to fix things.

Petr T