2024-02-21 11:35:19

by Will Deacon

[permalink] [raw]
Subject: [PATCH v4 0/5] Fix double allocation in swiotlb_alloc()

Hi again, folks,

This is version four of the patches which I previously posted at:

v1: https://lore.kernel.org/r/[email protected]
v2: https://lore.kernel.org/r/[email protected]
v3: https://lore.kernel.org/r/[email protected]

Thanks to Petr for his Reviewed-by tag on the first three.

Changes since v3 include:

- Use umax() instead of max() to fix a build warning if the first
patch is applied to older kernels which warn on signedness
mismatches.

- Add two new patches to the end of the series to resolve some
additional issues with NVME and 64KiB pages, reported by Nicolin.
I've added them to this series, as the first three patches make it
easier to fix this problem in the SWIOTLB code.

- Add Reviewed-by tags from Petr

Cheers,

Will

Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Petr Tesarik <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Nicolin Chen <[email protected]>
Cc: Michael Kelley <[email protected]>

--->8

Nicolin Chen (1):
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

Will Deacon (4):
swiotlb: Fix double-allocation of slots due to broken alignment
handling
swiotlb: Enforce page alignment in swiotlb_alloc()
swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
swiotlb: Fix alignment checks when both allocation and DMA masks are
present

drivers/iommu/dma-iommu.c | 8 ++++++++
kernel/dma/swiotlb.c | 43 +++++++++++++++++++++++++++------------
2 files changed, 38 insertions(+), 13 deletions(-)

--
2.44.0.rc0.258.g7320e95886-goog



2024-02-21 11:35:33

by Will Deacon

[permalink] [raw]
Subject: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
checks"), causes a functional regression with vsock in a virtual machine
using bouncing via a restricted DMA SWIOTLB pool.

When virtio allocates the virtqueues for the vsock device using
dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
allocations if 'area->index' was left unaligned by a previous allocation
from the buffer:

# Final address in brackets is the SWIOTLB address returned to the caller
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)

This ends badly (typically buffer corruption and/or a hang) because
swiotlb_alloc() is expecting a page-aligned allocation and so blindly
returns a pointer to the 'struct page' corresponding to the allocation,
therefore double-allocating the first half (2KiB slot) of the 4KiB page.

Fix the problem by treating the allocation alignment separately to any
additional alignment requirements from the device, using the maximum
of the two as the stride to search the buffer slots and taking care
to ensure a minimum of page-alignment for buffers larger than a page.

Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix")
Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks")
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Dexuan Cui <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b079a9a8e087..2ec2cc81f1a2 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
unsigned long max_slots = get_max_slots(boundary_mask);
unsigned int iotlb_align_mask =
- dma_get_min_align_mask(dev) | alloc_align_mask;
+ dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
unsigned int nslots = nr_slots(alloc_size), stride;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int index, slots_checked, count = 0, i;
@@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
BUG_ON(!nslots);
BUG_ON(area_index >= pool->nareas);

+ /*
+ * For mappings with an alignment requirement don't bother looping to
+ * unaligned slots once we found an aligned one.
+ */
+ stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
+
/*
* For allocations of PAGE_SIZE or larger only look for page aligned
* allocations.
*/
if (alloc_size >= PAGE_SIZE)
- iotlb_align_mask |= ~PAGE_MASK;
- iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
-
- /*
- * For mappings with an alignment requirement don't bother looping to
- * unaligned slots once we found an aligned one.
- */
- stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
+ stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);

spin_lock_irqsave(&area->lock, flags);
if (unlikely(nslots > pool->area_nslabs - area->used))
@@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
index = area->index;

for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
- slot_index = slot_base + index;
+ phys_addr_t tlb_addr;

- if (orig_addr &&
- (slot_addr(tbl_dma_addr, slot_index) &
- iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
+ slot_index = slot_base + index;
+ tlb_addr = slot_addr(tbl_dma_addr, slot_index);
+
+ if ((tlb_addr & alloc_align_mask) ||
+ (orig_addr && (tlb_addr & iotlb_align_mask) !=
+ (orig_addr & iotlb_align_mask))) {
index = wrap_area_index(pool, index + 1);
slots_checked++;
continue;
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:35:46

by Will Deacon

[permalink] [raw]
Subject: [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc()

When allocating pages from a restricted DMA pool in swiotlb_alloc(),
the buffer address is blindly converted to a 'struct page *' that is
returned to the caller. In the unlikely event of an allocation bug,
page-unaligned addresses are not detected and slots can silently be
double-allocated.

Add a simple check of the buffer alignment in swiotlb_alloc() to make
debugging a little easier if something has gone wonky.

Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Dexuan Cui <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 2ec2cc81f1a2..ab7fbb40bc55 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1643,6 +1643,12 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
return NULL;

tlb_addr = slot_addr(pool->start, index);
+ if (unlikely(!PAGE_ALIGNED(tlb_addr))) {
+ dev_WARN_ONCE(dev, 1, "Cannot allocate pages from non page-aligned swiotlb addr 0x%pa.\n",
+ &tlb_addr);
+ swiotlb_release_slots(dev, tlb_addr);
+ return NULL;
+ }

return pfn_to_page(PFN_DOWN(tlb_addr));
}
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:36:11

by Will Deacon

[permalink] [raw]
Subject: [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present

Nicolin reports that swiotlb buffer allocations fail for an NVME device
behind an IOMMU using 64KiB pages. This is because we end up with a
minimum allocation alignment of 64KiB (for the IOMMU to map the buffer
safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME
page (i.e. preserving the 4KiB page offset from the original allocation).
If the original address is not 4KiB-aligned, the allocation will fail
because swiotlb_search_pool_area() erroneously compares these unmasked
bits with the 64KiB-aligned candidate allocation.

Tweak swiotlb_search_pool_area() so that the DMA alignment mask is
reduced based on the required alignment of the allocation.

Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Reported-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c20324fba814..c381a7ed718f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
unsigned long max_slots = get_max_slots(boundary_mask);
- unsigned int iotlb_align_mask =
- dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
+ unsigned int iotlb_align_mask = dma_get_min_align_mask(dev);
unsigned int nslots = nr_slots(alloc_size), stride;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int index, slots_checked, count = 0, i;
@@ -993,6 +992,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
BUG_ON(!nslots);
BUG_ON(area_index >= pool->nareas);

+ /*
+ * Ensure that the allocation is at least slot-aligned and update
+ * 'iotlb_align_mask' to ignore bits that will be preserved when
+ * offsetting into the allocation.
+ */
+ alloc_align_mask |= (IO_TLB_SIZE - 1);
+ iotlb_align_mask &= ~alloc_align_mask;
+
/*
* For mappings with an alignment requirement don't bother looping to
* unaligned slots once we found an aligned one.
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:36:14

by Will Deacon

[permalink] [raw]
Subject: [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()

core-api/dma-api-howto.rst states the following properties of
dma_alloc_coherent():

| The CPU virtual address and the DMA address are both guaranteed to
| be aligned to the smallest PAGE_SIZE order which is greater than or
| equal to the requested size.

However, swiotlb_alloc() passes zero for the 'alloc_align_mask'
parameter of swiotlb_find_slots() and so this property is not upheld.
Instead, allocations larger than a page are aligned to PAGE_SIZE,

Calculate the mask corresponding to the page order suitable for holding
the allocation and pass that to swiotlb_find_slots().

Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Dexuan Cui <[email protected]>
Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers")
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ab7fbb40bc55..c20324fba814 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1633,12 +1633,14 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
struct io_tlb_pool *pool;
phys_addr_t tlb_addr;
+ unsigned int align;
int index;

if (!mem)
return NULL;

- index = swiotlb_find_slots(dev, 0, size, 0, &pool);
+ align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
+ index = swiotlb_find_slots(dev, 0, size, align, &pool);
if (index == -1)
return NULL;

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:36:24

by Will Deacon

[permalink] [raw]
Subject: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Nicolin Chen <[email protected]>

The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
an NVME device can map a size between 300KB~512KB, which certainly failed
the swiotlb mappings, though the default pool of swiotlb has many slots:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1

Call trace:
[ 3.099918] swiotlb_tbl_map_single+0x214/0x240
[ 3.099921] iommu_dma_map_page+0x218/0x328
[ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
[ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
[ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
[ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
[ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
[ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
[ 3.103463] __submit_bio+0x44/0x220
[ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
[ 3.103470] submit_bio_noacct+0x180/0x6c8
[ 3.103471] submit_bio+0x34/0x130
[ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
[ 3.104766] mpage_submit_folio+0xa0/0x100
[ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
[ 3.104771] ext4_do_writepages+0x6a0/0xd78
[ 3.105615] ext4_writepages+0x80/0x118
[ 3.105616] do_writepages+0x90/0x1e8
[ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
[ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
[ 3.106656] file_write_and_wait_range+0x84/0x120
[ 3.106658] ext4_sync_file+0x7c/0x4c0
[ 3.106660] vfs_fsync_range+0x3c/0xa8
[ 3.106663] do_fsync+0x44/0xc0

Since untrusted devices might go down the swiotlb pathway with dma-iommu,
these devices should not map a size larger than swiotlb_max_mapping_size.

To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
the iommu_dma_opt_mapping_size().

Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Signed-off-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com
Signed-off-by: Will Deacon <[email protected]>
---
drivers/iommu/dma-iommu.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 50ccc4f1ef81..7d1a20da6d94 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1706,6 +1706,13 @@ static size_t iommu_dma_opt_mapping_size(void)
return iova_rcache_range();
}

+static size_t iommu_dma_max_mapping_size(struct device *dev)
+{
+ if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
+ return swiotlb_max_mapping_size(dev);
+ return SIZE_MAX;
+}
+
static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
@@ -1728,6 +1735,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
.opt_mapping_size = iommu_dma_opt_mapping_size,
+ .max_mapping_size = iommu_dma_max_mapping_size,
};

/*
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 23:35:56

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
>
> Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
> which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
> checks"), causes a functional regression with vsock in a virtual machine
> using bouncing via a restricted DMA SWIOTLB pool.
>
> When virtio allocates the virtqueues for the vsock device using
> dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
> allocations if 'area->index' was left unaligned by a previous allocation
> from the buffer:
>
> # Final address in brackets is the SWIOTLB address returned to the caller
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
>
> This ends badly (typically buffer corruption and/or a hang) because
> swiotlb_alloc() is expecting a page-aligned allocation and so blindly
> returns a pointer to the 'struct page' corresponding to the allocation,
> therefore double-allocating the first half (2KiB slot) of the 4KiB page.
>
> Fix the problem by treating the allocation alignment separately to any
> additional alignment requirements from the device, using the maximum
> of the two as the stride to search the buffer slots and taking care
> to ensure a minimum of page-alignment for buffers larger than a page.

Could you also add some text that this patch fixes the scenario I
described in the other email thread? Something like:

The changes to page alignment handling also fix a problem when
the alloc_align_mask is zero. The page alignment handling added
in the two mentioned commits could force alignment to more bits
in orig_addr than specified by the device's DMA min_align_mask,
resulting in a larger offset. Since swiotlb_max_mapping_size()
is based only on the DMA min_align_mask, that larger offset
plus the requested size could exceed IO_TLB_SEGSIZE slots, and
the mapping could fail when it shouldn't.

>
> Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix")
> Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks")
> Cc: Christoph Hellwig <[email protected]>
> Cc: Marek Szyprowski <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Dexuan Cui <[email protected]>
> Reviewed-by: Petr Tesarik <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/dma/swiotlb.c | 28 +++++++++++++++-------------
> 1 file changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index b079a9a8e087..2ec2cc81f1a2 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> unsigned long max_slots = get_max_slots(boundary_mask);
> unsigned int iotlb_align_mask =
> - dma_get_min_align_mask(dev) | alloc_align_mask;
> + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> unsigned int nslots = nr_slots(alloc_size), stride;
> unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> unsigned int index, slots_checked, count = 0, i;
> @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> BUG_ON(!nslots);
> BUG_ON(area_index >= pool->nareas);
>
> + /*
> + * For mappings with an alignment requirement don't bother looping to
> + * unaligned slots once we found an aligned one.
> + */
> + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> +
> /*
> * For allocations of PAGE_SIZE or larger only look for page aligned
> * allocations.
> */
> if (alloc_size >= PAGE_SIZE)
> - iotlb_align_mask |= ~PAGE_MASK;
> - iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> -
> - /*
> - * For mappings with an alignment requirement don't bother looping to
> - * unaligned slots once we found an aligned one.
> - */
> - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);

Is this special handling of alloc_size >= PAGE_SIZE really needed?
I think the comment is somewhat inaccurate. If orig_addr is non-zero, and
alloc_align_mask is zero, the requirement is for the alignment to match
the DMA min_align_mask bits in orig_addr, even if the allocation is
larger than a page. And with Patch 3 of this series, the swiotlb_alloc()
case passes in alloc_align_mask to handle page size and larger requests.
So it seems like this doesn't do anything useful unless orig_addr and
alloc_align_mask are both zero, and there aren't any cases of that
after this patch series. If the caller wants alignment, specify
it with alloc_align_mask.

>
> spin_lock_irqsave(&area->lock, flags);
> if (unlikely(nslots > pool->area_nslabs - area->used))
> @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> index = area->index;
>
> for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
> - slot_index = slot_base + index;
> + phys_addr_t tlb_addr;
>
> - if (orig_addr &&
> - (slot_addr(tbl_dma_addr, slot_index) &
> - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> + slot_index = slot_base + index;
> + tlb_addr = slot_addr(tbl_dma_addr, slot_index);
> +
> + if ((tlb_addr & alloc_align_mask) ||
> + (orig_addr && (tlb_addr & iotlb_align_mask) !=
> + (orig_addr & iotlb_align_mask))) {

It looks like these changes will cause a mapping failure in some
iommu_dma_map_page() cases that previously didn't fail.
Everything is made right by Patch 4 of your series, but from a
bisect standpoint, there will be a gap where things are worse.
In [1], I think Nicolin reported a crash with just this patch applied.

While the iommu_dma_map_page() case can already fail due to
"too large" requests because of not setting a max mapping size,
this patch can cause smaller requests to fail as well until Patch 4
gets applied. That might be problem to avoid, perhaps by
merging the Patch 4 changes into this patch.

Michael

[1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m0ec36324b17947adefc18b3ac715e1952150f89d

> index = wrap_area_index(pool, index + 1);
> slots_checked++;
> continue;
> --
> 2.44.0.rc0.258.g7320e95886-goog


2024-02-21 23:36:16

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 2/5] swiotlb: Enforce page alignment in swiotlb_alloc()

From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
>
> When allocating pages from a restricted DMA pool in swiotlb_alloc(),
> the buffer address is blindly converted to a 'struct page *' that is
> returned to the caller. In the unlikely event of an allocation bug,
> page-unaligned addresses are not detected and slots can silently be
> double-allocated.
>
> Add a simple check of the buffer alignment in swiotlb_alloc() to make
> debugging a little easier if something has gone wonky.
>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Marek Szyprowski <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Dexuan Cui <[email protected]>
> Reviewed-by: Petr Tesarik <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/dma/swiotlb.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 2ec2cc81f1a2..ab7fbb40bc55 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1643,6 +1643,12 @@ struct page *swiotlb_alloc(struct device *dev,
> size_t size)
> return NULL;
>
> tlb_addr = slot_addr(pool->start, index);
> + if (unlikely(!PAGE_ALIGNED(tlb_addr))) {
> + dev_WARN_ONCE(dev, 1, "Cannot allocate pages from non page-aligned swiotlb addr 0x%pa.\n",
> + &tlb_addr);
> + swiotlb_release_slots(dev, tlb_addr);
> + return NULL;
> + }
>
> return pfn_to_page(PFN_DOWN(tlb_addr));
> }
> --
> 2.44.0.rc0.258.g7320e95886-goog

Reviewed-by: Michael Kelley <[email protected]>


2024-02-21 23:36:51

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 3/5] swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()

From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
>
> core-api/dma-api-howto.rst states the following properties of
> dma_alloc_coherent():
>
> | The CPU virtual address and the DMA address are both guaranteed to
> | be aligned to the smallest PAGE_SIZE order which is greater than or
> | equal to the requested size.
>
> However, swiotlb_alloc() passes zero for the 'alloc_align_mask'
> parameter of swiotlb_find_slots() and so this property is not upheld.
> Instead, allocations larger than a page are aligned to PAGE_SIZE,
>
> Calculate the mask corresponding to the page order suitable for holding
> the allocation and pass that to swiotlb_find_slots().
>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Marek Szyprowski <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Dexuan Cui <[email protected]>
> Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers")
> Reviewed-by: Petr Tesarik <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/dma/swiotlb.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index ab7fbb40bc55..c20324fba814 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1633,12 +1633,14 @@ struct page *swiotlb_alloc(struct device *dev,
> size_t size)
> struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> struct io_tlb_pool *pool;
> phys_addr_t tlb_addr;
> + unsigned int align;
> int index;
>
> if (!mem)
> return NULL;
>
> - index = swiotlb_find_slots(dev, 0, size, 0, &pool);
> + align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
> + index = swiotlb_find_slots(dev, 0, size, align, &pool);
> if (index == -1)
> return NULL;
>
> --
> 2.44.0.rc0.258.g7320e95886-goog

Reviewed-by: Michael Kelley <[email protected]>


2024-02-21 23:37:16

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 4/5] swiotlb: Fix alignment checks when both allocation and DMA masks are present

From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
>
> Nicolin reports that swiotlb buffer allocations fail for an NVME device
> behind an IOMMU using 64KiB pages. This is because we end up with a
> minimum allocation alignment of 64KiB (for the IOMMU to map the buffer
> safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME
> page (i.e. preserving the 4KiB page offset from the original allocation).
> If the original address is not 4KiB-aligned, the allocation will fail
> because swiotlb_search_pool_area() erroneously compares these unmasked
> bits with the 64KiB-aligned candidate allocation.
>
> Tweak swiotlb_search_pool_area() so that the DMA alignment mask is
> reduced based on the required alignment of the allocation.
>
> Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
> Reported-by: Nicolin Chen <[email protected]>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/dma/swiotlb.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c20324fba814..c381a7ed718f 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device
> *dev, struct io_tlb_pool *pool
> dma_addr_t tbl_dma_addr =
> phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> unsigned long max_slots = get_max_slots(boundary_mask);
> - unsigned int iotlb_align_mask =
> - dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> + unsigned int iotlb_align_mask = dma_get_min_align_mask(dev);
> unsigned int nslots = nr_slots(alloc_size), stride;
> unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> unsigned int index, slots_checked, count = 0, i;
> @@ -993,6 +992,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> BUG_ON(!nslots);
> BUG_ON(area_index >= pool->nareas);
>
> + /*
> + * Ensure that the allocation is at least slot-aligned and update
> + * 'iotlb_align_mask' to ignore bits that will be preserved when
> + * offsetting into the allocation.
> + */
> + alloc_align_mask |= (IO_TLB_SIZE - 1);
> + iotlb_align_mask &= ~alloc_align_mask;
> +
> /*
> * For mappings with an alignment requirement don't bother looping to
> * unaligned slots once we found an aligned one.
> --
> 2.44.0.rc0.258.g7320e95886-goog

Reviewed-by: Michael Kelley <[email protected]>

But see my comments in Patch 1 of the series about whether this
should be folded into Patch 1.


2024-02-21 23:41:03

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
>
> The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
> an NVME device can map a size between 300KB~512KB, which certainly failed
> the swiotlb mappings, though the default pool of swiotlb has many slots:
> systemd[1]: Started Journal Service.
> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768
> (slots), used 32 (slots)
> note: journal-offline[392] exited with irqs disabled
> note: journal-offline[392] exited with preempt_count 1
>
> Call trace:
> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240
> [ 3.099921] iommu_dma_map_page+0x218/0x328
> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
> [ 3.103463] __submit_bio+0x44/0x220
> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
> [ 3.103470] submit_bio_noacct+0x180/0x6c8
> [ 3.103471] submit_bio+0x34/0x130
> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
> [ 3.104766] mpage_submit_folio+0xa0/0x100
> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
> [ 3.104771] ext4_do_writepages+0x6a0/0xd78
> [ 3.105615] ext4_writepages+0x80/0x118
> [ 3.105616] do_writepages+0x90/0x1e8
> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
> [ 3.106656] file_write_and_wait_range+0x84/0x120
> [ 3.106658] ext4_sync_file+0x7c/0x4c0
> [ 3.106660] vfs_fsync_range+0x3c/0xa8
> [ 3.106663] do_fsync+0x44/0xc0
>
> Since untrusted devices might go down the swiotlb pathway with dma-iommu,
> these devices should not map a size larger than swiotlb_max_mapping_size.
>
> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
> the iommu_dma_opt_mapping_size().
>
> Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
> Signed-off-by: Nicolin Chen <[email protected]>
> Link: https://lore.kernel.org/all/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com/
> Signed-off-by: Will Deacon <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 50ccc4f1ef81..7d1a20da6d94 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1706,6 +1706,13 @@ static size_t
> iommu_dma_opt_mapping_size(void)
> return iova_rcache_range();
> }
>
> +static size_t iommu_dma_max_mapping_size(struct device *dev)
> +{
> + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> + return swiotlb_max_mapping_size(dev);
> + return SIZE_MAX;
> +}
> +

In this [1] email, Nicolin had a version of this function that incorporated
the IOMMU granule. For granules bigger than 4K, I think that's needed
so that when IOMMU code sets the swiotlb alloc_align_mask to the
IOMMU granule - 1, the larger offset plus the size won't exceed the
max number of slots. swiotlb_max_mapping_size() by itself may
return a value that's too big when alloc_align_mask is used.

Michael

[1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48


> static const struct dma_map_ops iommu_dma_ops = {
> .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> .alloc = iommu_dma_alloc,
> @@ -1728,6 +1735,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> .unmap_resource = iommu_dma_unmap_resource,
> .get_merge_boundary = iommu_dma_get_merge_boundary,
> .opt_mapping_size = iommu_dma_opt_mapping_size,
> + .max_mapping_size = iommu_dma_max_mapping_size,
> };
>
> /*
> --
> 2.44.0.rc0.258.g7320e95886-goog


2024-02-23 11:35:29

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] Fix double allocation in swiotlb_alloc()

On Wed, Feb 21, 2024 at 11:34:59AM +0000, Will Deacon wrote:
> Hi again, folks,
>
> This is version four of the patches which I previously posted at:
>
> v1: https://lore.kernel.org/r/[email protected]
> v2: https://lore.kernel.org/r/[email protected]
> v3: https://lore.kernel.org/r/[email protected]
>
> Thanks to Petr for his Reviewed-by tag on the first three.
>
> Changes since v3 include:
>
> - Use umax() instead of max() to fix a build warning if the first
> patch is applied to older kernels which warn on signedness
> mismatches.
>
> - Add two new patches to the end of the series to resolve some
> additional issues with NVME and 64KiB pages, reported by Nicolin.
> I've added them to this series, as the first three patches make it
> easier to fix this problem in the SWIOTLB code.
>
> - Add Reviewed-by tags from Petr
>
> Cheers,
>
> Will
>
> Cc: [email protected]
> Cc: Christoph Hellwig <[email protected]>
> Cc: Marek Szyprowski <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Petr Tesarik <[email protected]>
> Cc: Dexuan Cui <[email protected]>
> Cc: Nicolin Chen <[email protected]>

This fixes the bug with NVME on arm64/SMMU when PAGE_SIZE=64KiB.

Tested-by: Nicolin Chen <[email protected]>

Thanks!

2024-02-23 12:26:05

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] Fix double allocation in swiotlb_alloc()

On Fri, Feb 23, 2024 at 03:34:56AM -0800, Nicolin Chen wrote:
> On Wed, Feb 21, 2024 at 11:34:59AM +0000, Will Deacon wrote:
> > This is version four of the patches which I previously posted at:
> >
> > v1: https://lore.kernel.org/r/[email protected]
> > v2: https://lore.kernel.org/r/[email protected]
> > v3: https://lore.kernel.org/r/[email protected]
> >
> > Thanks to Petr for his Reviewed-by tag on the first three.
> >
> > Changes since v3 include:
> >
> > - Use umax() instead of max() to fix a build warning if the first
> > patch is applied to older kernels which warn on signedness
> > mismatches.
> >
> > - Add two new patches to the end of the series to resolve some
> > additional issues with NVME and 64KiB pages, reported by Nicolin.
> > I've added them to this series, as the first three patches make it
> > easier to fix this problem in the SWIOTLB code.
> >
> > - Add Reviewed-by tags from Petr
> >
> > Cheers,
> >
> > Will
> >
> > Cc: [email protected]
> > Cc: Christoph Hellwig <[email protected]>
> > Cc: Marek Szyprowski <[email protected]>
> > Cc: Robin Murphy <[email protected]>
> > Cc: Petr Tesarik <[email protected]>
> > Cc: Dexuan Cui <[email protected]>
> > Cc: Nicolin Chen <[email protected]>
>
> This fixes the bug with NVME on arm64/SMMU when PAGE_SIZE=64KiB.
>
> Tested-by: Nicolin Chen <[email protected]>

Thanks, Nicolin! Please can you also respond to Michael's observation on
your patch (5/5)? I didn't think we needed anything extra there, but since
it's your patch I'd prefer to hear your opinion.

https://lore.kernel.org/lkml/SN6PR02MB4157828120FB7D3408CEC991D4572@SN6PR02MB4157.namprd02.prod.outlook.com/

Cheers,

Will

2024-02-23 12:47:54

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote:
> From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
> >
> > Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
> > which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
> > checks"), causes a functional regression with vsock in a virtual machine
> > using bouncing via a restricted DMA SWIOTLB pool.
> >
> > When virtio allocates the virtqueues for the vsock device using
> > dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
> > allocations if 'area->index' was left unaligned by a previous allocation
> > from the buffer:
> >
> > # Final address in brackets is the SWIOTLB address returned to the caller
> > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
> > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
> > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
> >
> > This ends badly (typically buffer corruption and/or a hang) because
> > swiotlb_alloc() is expecting a page-aligned allocation and so blindly
> > returns a pointer to the 'struct page' corresponding to the allocation,
> > therefore double-allocating the first half (2KiB slot) of the 4KiB page.
> >
> > Fix the problem by treating the allocation alignment separately to any
> > additional alignment requirements from the device, using the maximum
> > of the two as the stride to search the buffer slots and taking care
> > to ensure a minimum of page-alignment for buffers larger than a page.
>
> Could you also add some text that this patch fixes the scenario I
> described in the other email thread? Something like:
>
> The changes to page alignment handling also fix a problem when
> the alloc_align_mask is zero. The page alignment handling added
> in the two mentioned commits could force alignment to more bits
> in orig_addr than specified by the device's DMA min_align_mask,
> resulting in a larger offset. Since swiotlb_max_mapping_size()
> is based only on the DMA min_align_mask, that larger offset
> plus the requested size could exceed IO_TLB_SEGSIZE slots, and
> the mapping could fail when it shouldn't.

Thanks, Michael. I can add that in.

> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index b079a9a8e087..2ec2cc81f1a2 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> > unsigned long max_slots = get_max_slots(boundary_mask);
> > unsigned int iotlb_align_mask =
> > - dma_get_min_align_mask(dev) | alloc_align_mask;
> > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > unsigned int nslots = nr_slots(alloc_size), stride;
> > unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > unsigned int index, slots_checked, count = 0, i;
> > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > BUG_ON(!nslots);
> > BUG_ON(area_index >= pool->nareas);
> >
> > + /*
> > + * For mappings with an alignment requirement don't bother looping to
> > + * unaligned slots once we found an aligned one.
> > + */
> > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> > +
> > /*
> > * For allocations of PAGE_SIZE or larger only look for page aligned
> > * allocations.
> > */
> > if (alloc_size >= PAGE_SIZE)
> > - iotlb_align_mask |= ~PAGE_MASK;
> > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> > -
> > - /*
> > - * For mappings with an alignment requirement don't bother looping to
> > - * unaligned slots once we found an aligned one.
> > - */
> > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
>
> Is this special handling of alloc_size >= PAGE_SIZE really needed?

I've been wondering that as well, but please note that this code (and the
comment) are in the upstream code, so I was erring in favour of keeping
that while fixing the bugs. We could have an extra patch dropping it if
we can convince ourselves that it's not adding anything, though.

> I think the comment is somewhat inaccurate. If orig_addr is non-zero, and
> alloc_align_mask is zero, the requirement is for the alignment to match
> the DMA min_align_mask bits in orig_addr, even if the allocation is
> larger than a page. And with Patch 3 of this series, the swiotlb_alloc()
> case passes in alloc_align_mask to handle page size and larger requests.
> So it seems like this doesn't do anything useful unless orig_addr and
> alloc_align_mask are both zero, and there aren't any cases of that
> after this patch series. If the caller wants alignment, specify
> it with alloc_align_mask.

It's an interesting observation. Presumably the intention here is to
reduce the cost of the linear search, but the code originates from a
time when we didn't have iotlb_align_mask or alloc_align_mask and so I
tend to agree that it should probably just be dropped. I'm also not even
convinced that it works properly if the initial search index ends up
being 2KiB (i.e. slot) aligned -- we'll end up jumping over the
page-aligned addresses!

I'll add another patch to v5 which removes this check (and you've basically
written the commit message for me, so thanks).

> > spin_lock_irqsave(&area->lock, flags);
> > if (unlikely(nslots > pool->area_nslabs - area->used))
> > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > index = area->index;
> >
> > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
> > - slot_index = slot_base + index;
> > + phys_addr_t tlb_addr;
> >
> > - if (orig_addr &&
> > - (slot_addr(tbl_dma_addr, slot_index) &
> > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> > + slot_index = slot_base + index;
> > + tlb_addr = slot_addr(tbl_dma_addr, slot_index);
> > +
> > + if ((tlb_addr & alloc_align_mask) ||
> > + (orig_addr && (tlb_addr & iotlb_align_mask) !=
> > + (orig_addr & iotlb_align_mask))) {
>
> It looks like these changes will cause a mapping failure in some
> iommu_dma_map_page() cases that previously didn't fail.

Hmm, it's really hard to tell. This code has been quite badly broken for
some time, so I'm not sure how far back you have to go to find a kernel
that would work properly (e.g. for Nicolin's case with 64KiB pages).

> Everything is made right by Patch 4 of your series, but from a
> bisect standpoint, there will be a gap where things are worse.
> In [1], I think Nicolin reported a crash with just this patch applied.

In Nicolin's case, I think it didn't work without the patch either, this
just triggered the failure earlier.

> While the iommu_dma_map_page() case can already fail due to
> "too large" requests because of not setting a max mapping size,
> this patch can cause smaller requests to fail as well until Patch 4
> gets applied. That might be problem to avoid, perhaps by
> merging the Patch 4 changes into this patch.

I'll leave this up to Christoph. Personally, I'm keen to avoid having
a giant patch trying to fix all the SWIOTLB allocation issues in one go,
as it will inevitably get reverted due to a corner case that we weren't
able to test properly, breaking the common cases at the same time.

Will

2024-02-23 13:36:34

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

On Fri, 23 Feb 2024 12:47:43 +0000
Will Deacon <[email protected]> wrote:

> On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote:
> > From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
> > >
> > > Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
> > > which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
> > > checks"), causes a functional regression with vsock in a virtual machine
> > > using bouncing via a restricted DMA SWIOTLB pool.
> > >
> > > When virtio allocates the virtqueues for the vsock device using
> > > dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
> > > allocations if 'area->index' was left unaligned by a previous allocation
> > > from the buffer:
> > >
> > > # Final address in brackets is the SWIOTLB address returned to the caller
> > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > > 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
> > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > > 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
> > > | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > > 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
> > >
> > > This ends badly (typically buffer corruption and/or a hang) because
> > > swiotlb_alloc() is expecting a page-aligned allocation and so blindly
> > > returns a pointer to the 'struct page' corresponding to the allocation,
> > > therefore double-allocating the first half (2KiB slot) of the 4KiB page.
> > >
> > > Fix the problem by treating the allocation alignment separately to any
> > > additional alignment requirements from the device, using the maximum
> > > of the two as the stride to search the buffer slots and taking care
> > > to ensure a minimum of page-alignment for buffers larger than a page.
> >
> > Could you also add some text that this patch fixes the scenario I
> > described in the other email thread? Something like:
> >
> > The changes to page alignment handling also fix a problem when
> > the alloc_align_mask is zero. The page alignment handling added
> > in the two mentioned commits could force alignment to more bits
> > in orig_addr than specified by the device's DMA min_align_mask,
> > resulting in a larger offset. Since swiotlb_max_mapping_size()
> > is based only on the DMA min_align_mask, that larger offset
> > plus the requested size could exceed IO_TLB_SEGSIZE slots, and
> > the mapping could fail when it shouldn't.
>
> Thanks, Michael. I can add that in.
>
> > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > > index b079a9a8e087..2ec2cc81f1a2 100644
> > > --- a/kernel/dma/swiotlb.c
> > > +++ b/kernel/dma/swiotlb.c
> > > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> > > unsigned long max_slots = get_max_slots(boundary_mask);
> > > unsigned int iotlb_align_mask =
> > > - dma_get_min_align_mask(dev) | alloc_align_mask;
> > > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > > unsigned int nslots = nr_slots(alloc_size), stride;
> > > unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > > unsigned int index, slots_checked, count = 0, i;
> > > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > BUG_ON(!nslots);
> > > BUG_ON(area_index >= pool->nareas);
> > >
> > > + /*
> > > + * For mappings with an alignment requirement don't bother looping to
> > > + * unaligned slots once we found an aligned one.
> > > + */
> > > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> > > +
> > > /*
> > > * For allocations of PAGE_SIZE or larger only look for page aligned
> > > * allocations.
> > > */
> > > if (alloc_size >= PAGE_SIZE)
> > > - iotlb_align_mask |= ~PAGE_MASK;
> > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> > > -
> > > - /*
> > > - * For mappings with an alignment requirement don't bother looping to
> > > - * unaligned slots once we found an aligned one.
> > > - */
> > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> >
> > Is this special handling of alloc_size >= PAGE_SIZE really needed?
>
> I've been wondering that as well, but please note that this code (and the
> comment) are in the upstream code, so I was erring in favour of keeping
> that while fixing the bugs. We could have an extra patch dropping it if
> we can convince ourselves that it's not adding anything, though.
>
> > I think the comment is somewhat inaccurate. If orig_addr is non-zero, and
> > alloc_align_mask is zero, the requirement is for the alignment to match
> > the DMA min_align_mask bits in orig_addr, even if the allocation is
> > larger than a page. And with Patch 3 of this series, the swiotlb_alloc()
> > case passes in alloc_align_mask to handle page size and larger requests.
> > So it seems like this doesn't do anything useful unless orig_addr and
> > alloc_align_mask are both zero, and there aren't any cases of that
> > after this patch series. If the caller wants alignment, specify
> > it with alloc_align_mask.
>
> It's an interesting observation. Presumably the intention here is to
> reduce the cost of the linear search, but the code originates from a
> time when we didn't have iotlb_align_mask or alloc_align_mask and so I
> tend to agree that it should probably just be dropped. I'm also not even
> convinced that it works properly if the initial search index ends up
> being 2KiB (i.e. slot) aligned -- we'll end up jumping over the
> page-aligned addresses!

Originally, SWIOTLB was not used for allocations, so orig_addr was
never zero. The assumption was that if the bounce buffer should be
page-aligned, then the original buffer was also page-aligned, and the
check against iotlb_align_mask was sufficient.

> I'll add another patch to v5 which removes this check (and you've basically
> written the commit message for me, so thanks).
>
> > > spin_lock_irqsave(&area->lock, flags);
> > > if (unlikely(nslots > pool->area_nslabs - area->used))
> > > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > index = area->index;
> > >
> > > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
> > > - slot_index = slot_base + index;
> > > + phys_addr_t tlb_addr;
> > >
> > > - if (orig_addr &&
> > > - (slot_addr(tbl_dma_addr, slot_index) &
> > > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> > > + slot_index = slot_base + index;
> > > + tlb_addr = slot_addr(tbl_dma_addr, slot_index);
> > > +
> > > + if ((tlb_addr & alloc_align_mask) ||
> > > + (orig_addr && (tlb_addr & iotlb_align_mask) !=
> > > + (orig_addr & iotlb_align_mask))) {
> >
> > It looks like these changes will cause a mapping failure in some
> > iommu_dma_map_page() cases that previously didn't fail.
>
> Hmm, it's really hard to tell. This code has been quite badly broken for
> some time, so I'm not sure how far back you have to go to find a kernel
> that would work properly (e.g. for Nicolin's case with 64KiB pages).

I believe it fails exactly in the cases that previously found an
incorrectly aligned bounce buffer.

In any case, the "middle" bits (low bits but ignoring offset inside a
slot) of tlb_addr should indeed correspond to the middle bits of
orig_addr.

>
> > Everything is made right by Patch 4 of your series, but from a
> > bisect standpoint, there will be a gap where things are worse.
> > In [1], I think Nicolin reported a crash with just this patch applied.
>
> In Nicolin's case, I think it didn't work without the patch either, this
> just triggered the failure earlier.
>
> > While the iommu_dma_map_page() case can already fail due to
> > "too large" requests because of not setting a max mapping size,
> > this patch can cause smaller requests to fail as well until Patch 4
> > gets applied. That might be problem to avoid, perhaps by
> > merging the Patch 4 changes into this patch.
>
> I'll leave this up to Christoph. Personally, I'm keen to avoid having
> a giant patch trying to fix all the SWIOTLB allocation issues in one go,
> as it will inevitably get reverted due to a corner case that we weren't
> able to test properly, breaking the common cases at the same time.

I tend to think that more patches are better, even though this patch
alone does introduce some regressions.

Petr T

2024-02-23 17:05:18

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

From: Will Deacon <[email protected]> Sent: Friday, February 23, 2024 4:48 AM
> On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote:
> > From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM

[snip]

> > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > > index b079a9a8e087..2ec2cc81f1a2 100644
> > > --- a/kernel/dma/swiotlb.c
> > > +++ b/kernel/dma/swiotlb.c
> > > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> > > unsigned long max_slots = get_max_slots(boundary_mask);
> > > unsigned int iotlb_align_mask =
> > > - dma_get_min_align_mask(dev) | alloc_align_mask;
> > > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > > unsigned int nslots = nr_slots(alloc_size), stride;
> > > unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > > unsigned int index, slots_checked, count = 0, i;
> > > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > BUG_ON(!nslots);
> > > BUG_ON(area_index >= pool->nareas);
> > >
> > > + /*
> > > + * For mappings with an alignment requirement don't bother looping to
> > > + * unaligned slots once we found an aligned one.
> > > + */
> > > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> > > +
> > > /*
> > > * For allocations of PAGE_SIZE or larger only look for page aligned
> > > * allocations.
> > > */
> > > if (alloc_size >= PAGE_SIZE)
> > > - iotlb_align_mask |= ~PAGE_MASK;
> > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> > > -
> > > - /*
> > > - * For mappings with an alignment requirement don't bother looping to
> > > - * unaligned slots once we found an aligned one.
> > > - */
> > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> >
> > Is this special handling of alloc_size >= PAGE_SIZE really needed?
>
> I've been wondering that as well, but please note that this code (and the
> comment) are in the upstream code, so I was erring in favour of keeping
> that while fixing the bugs. We could have an extra patch dropping it if
> we can convince ourselves that it's not adding anything, though.
>
> > I think the comment is somewhat inaccurate. If orig_addr is non-zero, and
> > alloc_align_mask is zero, the requirement is for the alignment to match
> > the DMA min_align_mask bits in orig_addr, even if the allocation is
> > larger than a page. And with Patch 3 of this series, the swiotlb_alloc()
> > case passes in alloc_align_mask to handle page size and larger requests.
> > So it seems like this doesn't do anything useful unless orig_addr and
> > alloc_align_mask are both zero, and there aren't any cases of that
> > after this patch series. If the caller wants alignment, specify
> > it with alloc_align_mask.
>
> It's an interesting observation. Presumably the intention here is to
> reduce the cost of the linear search, but the code originates from a
> time when we didn't have iotlb_align_mask or alloc_align_mask and so I
> tend to agree that it should probably just be dropped. I'm also not even
> convinced that it works properly if the initial search index ends up
> being 2KiB (i.e. slot) aligned -- we'll end up jumping over the
> page-aligned addresses!
>
> I'll add another patch to v5 which removes this check (and you've basically
> written the commit message for me, so thanks).

Works for me.

>
> > > spin_lock_irqsave(&area->lock, flags);
> > > if (unlikely(nslots > pool->area_nslabs - area->used))
> > > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > index = area->index;
> > >
> > > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
> > > - slot_index = slot_base + index;
> > > + phys_addr_t tlb_addr;
> > >
> > > - if (orig_addr &&
> > > - (slot_addr(tbl_dma_addr, slot_index) &
> > > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> > > + slot_index = slot_base + index;
> > > + tlb_addr = slot_addr(tbl_dma_addr, slot_index);
> > > +
> > > + if ((tlb_addr & alloc_align_mask) ||
> > > + (orig_addr && (tlb_addr & iotlb_align_mask) !=
> > > + (orig_addr & iotlb_align_mask))) {
> >
> > It looks like these changes will cause a mapping failure in some
> > iommu_dma_map_page() cases that previously didn't fail.
>
> Hmm, it's really hard to tell. This code has been quite badly broken for
> some time, so I'm not sure how far back you have to go to find a kernel
> that would work properly (e.g. for Nicolin's case with 64KiB pages).
>
> > Everything is made right by Patch 4 of your series, but from a
> > bisect standpoint, there will be a gap where things are worse.
> > In [1], I think Nicolin reported a crash with just this patch applied.
>
> In Nicolin's case, I think it didn't work without the patch either, this
> just triggered the failure earlier.
>
> > While the iommu_dma_map_page() case can already fail due to
> > "too large" requests because of not setting a max mapping size,
> > this patch can cause smaller requests to fail as well until Patch 4
> > gets applied. That might be problem to avoid, perhaps by
> > merging the Patch 4 changes into this patch.
>
> I'll leave this up to Christoph. Personally, I'm keen to avoid having
> a giant patch trying to fix all the SWIOTLB allocation issues in one go,
> as it will inevitably get reverted due to a corner case that we weren't
> able to test properly, breaking the common cases at the same time.
>

Yes, I agree there's a tradeoff against cramming all the changes into
one big patch, so I'm OK with whichever approach is taken.

FWIW, here is the case I'm concerned about being broken after this
patch, but before Patch 4 of the series:

* alloc_align_mask is 0xFFFF (e.g., due to 64K IOMMU granule)
* iotlb_align_mask is 0x800 (DMA min_align_mask is 4K - 1, as for NVMe)
* orig_addr is non-NULL and has bit 0x800 set

In the new "if" statement, any tlb_addr that produces "false" for
the left half of the "||" operator produces "true" for the right half.
So the entire "if" statement always evaluates to true and the
"for" loop never finds any slots that can be used. In other words,
for this case there's no way for the returned swiotlb memory to be
aligned to alloc_align_mask and to orig_addr (modulo DMA
min_align_mask) at the same time, and the mapping fails.

Michael

2024-02-23 19:58:40

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

On Wed, Feb 21, 2024 at 11:39:29PM +0000, Michael Kelley wrote:
> From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
> > +static size_t iommu_dma_max_mapping_size(struct device *dev)
> > +{
> > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> > + return swiotlb_max_mapping_size(dev);
> > + return SIZE_MAX;
> > +}
> > +
>
> In this [1] email, Nicolin had a version of this function that incorporated
> the IOMMU granule. For granules bigger than 4K, I think that's needed
> so that when IOMMU code sets the swiotlb alloc_align_mask to the
> IOMMU granule - 1, the larger offset plus the size won't exceed the
> max number of slots. swiotlb_max_mapping_size() by itself may
> return a value that's too big when alloc_align_mask is used.
>
> Michael
>
> [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48

Yea, I just confirmed that with 64KB PAGE_SIZE the alloc_size
can be over swiotlb_max_mapping_size, i.e. 256KB > 252KB. Yet,
the max size corresponding to the max number of slots should
be 256KB. So, I feel that this is marginally safe?

With that being said, there seems to be a 4KB size waste, due
to aligning with the iommu_domain granule, in this particular
alloc_size=256KB case?

On the other hand, if swiotlb_max_mapping_size was subtracted
by 64KB (granule=64KB), i.e. alloc_size=192KB, which seems to
generate more swiotlb fragments?

Thanks
Nicolin

2024-02-23 21:11:22

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Nicolin Chen <[email protected]> Sent: Friday, February 23, 2024 11:58 AM
>
> On Wed, Feb 21, 2024 at 11:39:29PM +0000, Michael Kelley wrote:
> > From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024
> 3:35 AM
> > > +static size_t iommu_dma_max_mapping_size(struct device *dev)
> > > +{
> > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> > > + return swiotlb_max_mapping_size(dev);
> > > + return SIZE_MAX;
> > > +}
> > > +
> >
> > In this [1] email, Nicolin had a version of this function that incorporated
> > the IOMMU granule. For granules bigger than 4K, I think that's needed
> > so that when IOMMU code sets the swiotlb alloc_align_mask to the
> > IOMMU granule - 1, the larger offset plus the size won't exceed the
> > max number of slots. swiotlb_max_mapping_size() by itself may
> > return a value that's too big when alloc_align_mask is used.
> >
> > Michael
> >
> > [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48
>
> Yea, I just confirmed that with 64KB PAGE_SIZE the alloc_size
> can be over swiotlb_max_mapping_size, i.e. 256KB > 252KB. Yet,
> the max size corresponding to the max number of slots should
> be 256KB. So, I feel that this is marginally safe?

Yes, in the specific case you tested. But I don't think the general
case is safe. In your specific case, the "size" argument to
iommu_dma_map_page() is presumably 252 Kbytes because that's
what your new iommu_dma_max_mapping_size() returns.
iommu_dma_map_page() calls iova_align() to round up the 252K
to 256K. Also in your specific case, the "offset" argument to
iommu_dma_map_page() is 0, so the "phys" variable (which embeds
the offset) passed to swiotlb_tbl_map_single() is aligned on a
64 Kbyte page boundary. swiotlb_tbl_map_single() then
computes an offset in the orig_addr (i.e., "phys") based on the
DMA min_align_mask (4 Kbytes), and that value is 0 in your specific
case. swiotlb_tbl_map_single() then calls swiotlb_find_slots()
specifying a size that is 256K bytes plus an offset of 0, so everything
works.

But what if the original offset passed to iommu_dma_map_page()
is 3 Kbytes (for example)? swiotlb_tbl_map_single() will have an
orig_addr that is offset from a page boundary by 3 Kbytes. The
value passed to swiotlb_find_slots() will be 256 Kbytes plus an
offset of 3 Kbytes, which is too big.

>
> With that being said, there seems to be a 4KB size waste, due
> to aligning with the iommu_domain granule, in this particular
> alloc_size=256KB case?
>
> On the other hand, if swiotlb_max_mapping_size was subtracted
> by 64KB (granule=64KB), i.e. alloc_size=192KB, which seems to
> generate more swiotlb fragments?

dma_max_mapping_size() returns a fixed value for a particular
device, so the fixed value must be conversative enough to handle
dma_map_page() calls with a non-zero offset (or the similar
dma_map_sg() where the scatter/gather list has non-zero
offsets). So yes, the higher layers must limit I/O size, which
can produce more separate I/Os and swiotlb fragments to
fulfill an original request that is 256 Kbytes or larger. The
same thing happens with 4K page size in that a 256K I/O
gets broken into a 252K I/O and a 4K I/O if the swiotlb is
being used.

I'm trying to think through if there's a way to make things
work with iommu_max_mapping_size() being less
conversative than subtracting the full granule size. I'm
doubtful that there is.

Michael

2024-02-25 21:17:48

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Michael Kelley <[email protected]> Sent: Friday, February 23, 2024 1:11 PM
>
> From: Nicolin Chen <[email protected]> Sent: Friday, February 23, 2024 11:58 AM
> >
> > On Wed, Feb 21, 2024 at 11:39:29PM +0000, Michael Kelley wrote:
> > > From: Will Deacon <[email protected]> Sent: Wednesday, February 21, 2024 3:35 AM
> > > > +static size_t iommu_dma_max_mapping_size(struct device *dev)
> > > > +{
> > > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> > > > + return swiotlb_max_mapping_size(dev);
> > > > + return SIZE_MAX;
> > > > +}
> > > > +
> > >
> > > In this [1] email, Nicolin had a version of this function that incorporated
> > > the IOMMU granule. For granules bigger than 4K, I think that's needed
> > > so that when IOMMU code sets the swiotlb alloc_align_mask to the
> > > IOMMU granule - 1, the larger offset plus the size won't exceed the
> > > max number of slots. swiotlb_max_mapping_size() by itself may
> > > return a value that's too big when alloc_align_mask is used.
> > >
> > > Michael
> > >
> > > [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@SN6PR02MB4157.namprd02.prod.outlook.com/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48
> >
> > Yea, I just confirmed that with 64KB PAGE_SIZE the alloc_size
> > can be over swiotlb_max_mapping_size, i.e. 256KB > 252KB. Yet,
> > the max size corresponding to the max number of slots should
> > be 256KB. So, I feel that this is marginally safe?
>
> Yes, in the specific case you tested. But I don't think the general
> case is safe. In your specific case, the "size" argument to
> iommu_dma_map_page() is presumably 252 Kbytes because that's
> what your new iommu_dma_max_mapping_size() returns.
> iommu_dma_map_page() calls iova_align() to round up the 252K
> to 256K. Also in your specific case, the "offset" argument to
> iommu_dma_map_page() is 0, so the "phys" variable (which embeds
> the offset) passed to swiotlb_tbl_map_single() is aligned on a
> 64 Kbyte page boundary. swiotlb_tbl_map_single() then
> computes an offset in the orig_addr (i.e., "phys") based on the
> DMA min_align_mask (4 Kbytes), and that value is 0 in your specific
> case. swiotlb_tbl_map_single() then calls swiotlb_find_slots()
> specifying a size that is 256K bytes plus an offset of 0, so everything
> works.
>
> But what if the original offset passed to iommu_dma_map_page()
> is 3 Kbytes (for example)? swiotlb_tbl_map_single() will have an
> orig_addr that is offset from a page boundary by 3 Kbytes. The
> value passed to swiotlb_find_slots() will be 256 Kbytes plus an
> offset of 3 Kbytes, which is too big.
>
> >
> > With that being said, there seems to be a 4KB size waste, due
> > to aligning with the iommu_domain granule, in this particular
> > alloc_size=256KB case?
> >
> > On the other hand, if swiotlb_max_mapping_size was subtracted
> > by 64KB (granule=64KB), i.e. alloc_size=192KB, which seems to
> > generate more swiotlb fragments?
>
> dma_max_mapping_size() returns a fixed value for a particular
> device, so the fixed value must be conversative enough to handle
> dma_map_page() calls with a non-zero offset (or the similar
> dma_map_sg() where the scatter/gather list has non-zero
> offsets). So yes, the higher layers must limit I/O size, which
> can produce more separate I/Os and swiotlb fragments to
> fulfill an original request that is 256 Kbytes or larger. The
> same thing happens with 4K page size in that a 256K I/O
> gets broken into a 252K I/O and a 4K I/O if the swiotlb is
> being used.
>
> I'm trying to think through if there's a way to make things
> work with iommu_max_mapping_size() being less
> conversative than subtracting the full granule size. I'm
> doubtful that there is.
>

With the current interface between iommu_dma_map_page()
and swiotlb_tbl_map_single(), iommu_max_mapping_size()
really has no choice but to be more conservative and subtract
the full granule size.

The underlying problem is how iommu_dma_map_page() and
swiotlb_tbl_map_single() interact. iommu_dma_map_page()
rounds up the size to a granule boundary, and then
swiotlb_tbl_map_single() adds the offset modulo DMA
min_align_mask. These steps should be done in the opposite
order -- first add the offset then do the rounding up. With
that opposite order, iommu_max_mapping_size() could be
less conservative and be based solely on min_align_mask.

I could see using the following approach, where alignment
and rounding up are both done in swiotlb_tbl_map_single(),
based on the alloc_align_mask parameter. Conceptually it
makes sense to align the start and the end of the allocated
buffer in the same function rather than splitting them.

1. Remove the alloc_size parameter from
swiotlb_tbl_map_single(). Fixup two other call sites, where
alloc_size is redundant anyway.

2. swiotlb_tbl_map_single() calculates its own local
alloc_size as map_size + offset, rounded up based on
alloc_align_mask. If alloc_align_mask is zero, this
rounding up is a no-op, and alloc_size equals map_size
+ offset. Pass this local alloc_size to swiotlb_find_slots.

3. iommu_dma_map_page() removes the iova_align() call
just prior to invoking swiotlb_tbl_map_single().

Looking at the code, there are also problems in
iommu_dma_map_page() in zero'ing out portions of the
allocated swiotlb buffer. Evidently the zero'ing is done
to avoid giving an untrusted device DMA access to kernel
memory areas that could contain random information
that's sensitive. But I see these issues in the code:

1. The zero'ing doesn't cover any "padding" at the
beginning of the swiotlb buffer due to min_align_mask
offsetting. The cover letter for the last set of changes in
this area explicitly says that such padding isn't zero'ed [1]
but I can't reconcile why. The swiotlb *does* allocate
padding slots at the beginning, even in the 5.16 kernel
when those commits first went in. Maybe I'm missing
something.

2. With commit 901c7280ca0d, swiotlb_tbl_map_single()
always calls swiotlb_bounce(), which initializes the
mapped portion regardless of the DMA xfer direction. But
then in the case of DMA_FROM_DEVICE (i.e., read I/Os)
iommu_dma_map_page() will zero what swiotlb_bounce()
already initialized. The discussion that led to commit
901c7280ca0d concluded that the buffer must be
initialized to the ultimate target buffer contents so that
partial reads don't change data that wasn't read. This
zero'ing violates that and should be removed (unless
there's some unique handling of untrusted devices I'm
not aware of).

3. If the swiotlb path is taken for reasons other than an
untrusted device, iommu_dma_map_page() still does the
zero'ing. That doesn't break anything, but it seems wasteful
as there could be a lot of padding area that is unnecessarily
zero'ed in the dma_kmalloc_needs_bounce() case, especially
if the IOMMU granule is 64K.

Net, it seems like some additional work beyond Patch 5
of Will's series is needed to get Nicolin's path working
robustly. I could help with coding, but I can't test IOMMU
paths as I only have access to VMs (x64 and arm64).

Michael

[1] https://lore.kernel.org/all/[email protected]/

2024-02-26 19:36:12

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

On 21/02/2024 11:35 am, Will Deacon wrote:
> From: Nicolin Chen <[email protected]>
>
> The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
> an NVME device can map a size between 300KB~512KB, which certainly failed
> the swiotlb mappings, though the default pool of swiotlb has many slots:
> systemd[1]: Started Journal Service.
> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
> note: journal-offline[392] exited with irqs disabled
> note: journal-offline[392] exited with preempt_count 1
>
> Call trace:
> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240
> [ 3.099921] iommu_dma_map_page+0x218/0x328
> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
> [ 3.103463] __submit_bio+0x44/0x220
> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
> [ 3.103470] submit_bio_noacct+0x180/0x6c8
> [ 3.103471] submit_bio+0x34/0x130
> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
> [ 3.104766] mpage_submit_folio+0xa0/0x100
> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
> [ 3.104771] ext4_do_writepages+0x6a0/0xd78
> [ 3.105615] ext4_writepages+0x80/0x118
> [ 3.105616] do_writepages+0x90/0x1e8
> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
> [ 3.106656] file_write_and_wait_range+0x84/0x120
> [ 3.106658] ext4_sync_file+0x7c/0x4c0
> [ 3.106660] vfs_fsync_range+0x3c/0xa8
> [ 3.106663] do_fsync+0x44/0xc0
>
> Since untrusted devices might go down the swiotlb pathway with dma-iommu,
> these devices should not map a size larger than swiotlb_max_mapping_size.
>
> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
> the iommu_dma_opt_mapping_size().

On the basis that this is at least far closer to correct than doing nothing,

Acked-by: Robin Murphy <[email protected]>

TBH I'm scared to think about theoretical correctness for all the
interactions between the IOVA granule and min_align_mask, since just the
SWIOTLB stuff is bad enough, even before you realise the ways that the
IOVA allocation isn't necessarily right either. However I reckon as long
as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a
min_align_mask larger than a granule, then this should probably work
well enough as-is.

Cheers,
Robin.

> Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
> Signed-off-by: Nicolin Chen <[email protected]>
> Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com
> Signed-off-by: Will Deacon <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 50ccc4f1ef81..7d1a20da6d94 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1706,6 +1706,13 @@ static size_t iommu_dma_opt_mapping_size(void)
> return iova_rcache_range();
> }
>
> +static size_t iommu_dma_max_mapping_size(struct device *dev)
> +{
> + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> + return swiotlb_max_mapping_size(dev);
> + return SIZE_MAX;
> +}
> +
> static const struct dma_map_ops iommu_dma_ops = {
> .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> .alloc = iommu_dma_alloc,
> @@ -1728,6 +1735,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> .unmap_resource = iommu_dma_unmap_resource,
> .get_merge_boundary = iommu_dma_get_merge_boundary,
> .opt_mapping_size = iommu_dma_opt_mapping_size,
> + .max_mapping_size = iommu_dma_max_mapping_size,
> };
>
> /*

2024-02-26 21:11:36

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Robin Murphy <[email protected]> Sent: Monday, February 26, 2024 11:36 AM
>
> On 21/02/2024 11:35 am, Will Deacon wrote:
> > From: Nicolin Chen <[email protected]>
> >
> > The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
> > On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
> > an NVME device can map a size between 300KB~512KB, which certainly failed
> > the swiotlb mappings, though the default pool of swiotlb has many slots:
> > systemd[1]: Started Journal Service.
> > => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
> > note: journal-offline[392] exited with irqs disabled
> > note: journal-offline[392] exited with preempt_count 1
> >
> > Call trace:
> > [ 3.099918] swiotlb_tbl_map_single+0x214/0x240
> > [ 3.099921] iommu_dma_map_page+0x218/0x328
> > [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
> > [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
> > [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
> > [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
> > [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
> > [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
> > [ 3.103463] __submit_bio+0x44/0x220
> > [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
> > [ 3.103470] submit_bio_noacct+0x180/0x6c8
> > [ 3.103471] submit_bio+0x34/0x130
> > [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
> > [ 3.104766] mpage_submit_folio+0xa0/0x100
> > [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
> > [ 3.104771] ext4_do_writepages+0x6a0/0xd78
> > [ 3.105615] ext4_writepages+0x80/0x118
> > [ 3.105616] do_writepages+0x90/0x1e8
> > [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
> > [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
> > [ 3.106656] file_write_and_wait_range+0x84/0x120
> > [ 3.106658] ext4_sync_file+0x7c/0x4c0
> > [ 3.106660] vfs_fsync_range+0x3c/0xa8
> > [ 3.106663] do_fsync+0x44/0xc0
> >
> > Since untrusted devices might go down the swiotlb pathway with dma-iommu,
> > these devices should not map a size larger than swiotlb_max_mapping_size.
> >
> > To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
> > take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
> > the iommu_dma_opt_mapping_size().
>
> On the basis that this is at least far closer to correct than doing nothing,
>
> Acked-by: Robin Murphy <[email protected]>
>
> TBH I'm scared to think about theoretical correctness for all the
> interactions between the IOVA granule and min_align_mask, since just the
> SWIOTLB stuff is bad enough, even before you realise the ways that the
> IOVA allocation isn't necessarily right either. However I reckon as long
> as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a
> min_align_mask larger than a granule, then this should probably work
> well enough as-is.
>

I'm not convinced. The conditions you describe are reasonable
and reflect upstream code today. But there can still be a failure
due to attempting to allocate a "too large" swiotlb buffer. It
happens with a granule of 64K and min_align_mask of 4K - 1 (the
NVMe case) when the offset passed to iommu_dma_map_page()
is non-zero modulo 4K. With this patch, the size passed into
iommu_dma_map_page() is limited to 252K, but it gets rounded
up to 256K. Then swiotlb_tbl_map_single() adds the offset
modulo 4K. The result exceeds the 256K limit in swiotlb and
the mapping fails.

The case I describe is a reasonable case that happens in the real
world. As is, this patch will work most of the time because for
large xfers the offset is typically at least 4K aligned. But not always.

Michael

2024-02-27 13:54:35

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

On 26/02/2024 9:11 pm, Michael Kelley wrote:
> From: Robin Murphy <[email protected]> Sent: Monday, February 26, 2024 11:36 AM
>>
>> On 21/02/2024 11:35 am, Will Deacon wrote:
>>> From: Nicolin Chen <[email protected]>
>>>
>>> The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
>>> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
>>> an NVME device can map a size between 300KB~512KB, which certainly failed
>>> the swiotlb mappings, though the default pool of swiotlb has many slots:
>>> systemd[1]: Started Journal Service.
>>> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
>>> note: journal-offline[392] exited with irqs disabled
>>> note: journal-offline[392] exited with preempt_count 1
>>>
>>> Call trace:
>>> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240
>>> [ 3.099921] iommu_dma_map_page+0x218/0x328
>>> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
>>> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
>>> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
>>> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
>>> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
>>> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
>>> [ 3.103463] __submit_bio+0x44/0x220
>>> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
>>> [ 3.103470] submit_bio_noacct+0x180/0x6c8
>>> [ 3.103471] submit_bio+0x34/0x130
>>> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
>>> [ 3.104766] mpage_submit_folio+0xa0/0x100
>>> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
>>> [ 3.104771] ext4_do_writepages+0x6a0/0xd78
>>> [ 3.105615] ext4_writepages+0x80/0x118
>>> [ 3.105616] do_writepages+0x90/0x1e8
>>> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
>>> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
>>> [ 3.106656] file_write_and_wait_range+0x84/0x120
>>> [ 3.106658] ext4_sync_file+0x7c/0x4c0
>>> [ 3.106660] vfs_fsync_range+0x3c/0xa8
>>> [ 3.106663] do_fsync+0x44/0xc0
>>>
>>> Since untrusted devices might go down the swiotlb pathway with dma-iommu,
>>> these devices should not map a size larger than swiotlb_max_mapping_size.
>>>
>>> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
>>> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
>>> the iommu_dma_opt_mapping_size().
>>
>> On the basis that this is at least far closer to correct than doing nothing,
>>
>> Acked-by: Robin Murphy <[email protected]>
>>
>> TBH I'm scared to think about theoretical correctness for all the
>> interactions between the IOVA granule and min_align_mask, since just the
>> SWIOTLB stuff is bad enough, even before you realise the ways that the
>> IOVA allocation isn't necessarily right either. However I reckon as long
>> as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a
>> min_align_mask larger than a granule, then this should probably work
>> well enough as-is.
>>
>
> I'm not convinced. The conditions you describe are reasonable
> and reflect upstream code today. But there can still be a failure
> due to attempting to allocate a "too large" swiotlb buffer. It
> happens with a granule of 64K and min_align_mask of 4K - 1 (the
> NVMe case) when the offset passed to iommu_dma_map_page()
> is non-zero modulo 4K. With this patch, the size passed into
> iommu_dma_map_page() is limited to 252K, but it gets rounded
> up to 256K. Then swiotlb_tbl_map_single() adds the offset
> modulo 4K. The result exceeds the 256K limit in swiotlb and
> the mapping fails.
>
> The case I describe is a reasonable case that happens in the real
> world. As is, this patch will work most of the time because for
> large xfers the offset is typically at least 4K aligned. But not always.

Indeed that's what my "probably [...] well enough" meant to imply.

I think there's proving to be sufficient complexity here that if it
turns out we do manage to still hit real-world issues with the coarse
approximation, that will be the point when any further effort would be
better spent on finally tackling the thing that's always been on the
to-do list, where we'd only bounce the unaligned start and/or end
granules while mapping the rest of the buffer in place.

Cheers,
Robin.

2024-02-27 14:57:25

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Robin Murphy <[email protected]> Sent: Tuesday, February 27, 2024 5:23 AM
>
> On 26/02/2024 9:11 pm, Michael Kelley wrote:
> > From: Robin Murphy <[email protected]> Sent: Monday, February
> 26, 2024 11:36 AM
> >>
> >> On 21/02/2024 11:35 am, Will Deacon wrote:
> >>> From: Nicolin Chen <[email protected]>
> >>>
> >>> The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
> >>> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
> >>> an NVME device can map a size between 300KB~512KB, which certainly failed
> >>> the swiotlb mappings, though the default pool of swiotlb has many slots:
> >>> systemd[1]: Started Journal Service.
> >>> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
> >>> note: journal-offline[392] exited with irqs disabled
> >>> note: journal-offline[392] exited with preempt_count 1
> >>>
> >>> Call trace:
> >>> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240
> >>> [ 3.099921] iommu_dma_map_page+0x218/0x328
> >>> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
> >>> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
> >>> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
> >>> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
> >>> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
> >>> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
> >>> [ 3.103463] __submit_bio+0x44/0x220
> >>> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
> >>> [ 3.103470] submit_bio_noacct+0x180/0x6c8
> >>> [ 3.103471] submit_bio+0x34/0x130
> >>> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
> >>> [ 3.104766] mpage_submit_folio+0xa0/0x100
> >>> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
> >>> [ 3.104771] ext4_do_writepages+0x6a0/0xd78
> >>> [ 3.105615] ext4_writepages+0x80/0x118
> >>> [ 3.105616] do_writepages+0x90/0x1e8
> >>> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
> >>> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
> >>> [ 3.106656] file_write_and_wait_range+0x84/0x120
> >>> [ 3.106658] ext4_sync_file+0x7c/0x4c0
> >>> [ 3.106660] vfs_fsync_range+0x3c/0xa8
> >>> [ 3.106663] do_fsync+0x44/0xc0
> >>>
> >>> Since untrusted devices might go down the swiotlb pathway with dma-iommu,
> >>> these devices should not map a size larger than swiotlb_max_mapping_size.
> >>>
> >>> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
> >>> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
> >>> the iommu_dma_opt_mapping_size().
> >>
> >> On the basis that this is at least far closer to correct than doing nothing,
> >>
> >> Acked-by: Robin Murphy <[email protected]>
> >>
> >> TBH I'm scared to think about theoretical correctness for all the
> >> interactions between the IOVA granule and min_align_mask, since just the
> >> SWIOTLB stuff is bad enough, even before you realise the ways that the
> >> IOVA allocation isn't necessarily right either. However I reckon as long
> >> as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a
> >> min_align_mask larger than a granule, then this should probably work
> >> well enough as-is.
> >>
> >
> > I'm not convinced. The conditions you describe are reasonable
> > and reflect upstream code today. But there can still be a failure
> > due to attempting to allocate a "too large" swiotlb buffer. It
> > happens with a granule of 64K and min_align_mask of 4K - 1 (the
> > NVMe case) when the offset passed to iommu_dma_map_page()
> > is non-zero modulo 4K. With this patch, the size passed into
> > iommu_dma_map_page() is limited to 252K, but it gets rounded
> > up to 256K. Then swiotlb_tbl_map_single() adds the offset
> > modulo 4K. The result exceeds the 256K limit in swiotlb and
> > the mapping fails.
> >
> > The case I describe is a reasonable case that happens in the real
> > world. As is, this patch will work most of the time because for
> > large xfers the offset is typically at least 4K aligned. But not always.
>
> Indeed that's what my "probably [...] well enough" meant to imply.
>
> I think there's proving to be sufficient complexity here that if it
> turns out we do manage to still hit real-world issues with the coarse
> approximation, that will be the point when any further effort would be
> better spent on finally tackling the thing that's always been on the
> to-do list, where we'd only bounce the unaligned start and/or end
> granules while mapping the rest of the buffer in place.
>

Fair enough. It's your call to make. :-)

Michael

2024-02-27 15:38:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

On Fri, Feb 23, 2024 at 12:47:43PM +0000, Will Deacon wrote:
> > > /*
> > > * For allocations of PAGE_SIZE or larger only look for page aligned
> > > * allocations.
> > > */
> > > if (alloc_size >= PAGE_SIZE)
> > > - iotlb_align_mask |= ~PAGE_MASK;
> > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> > > -
> > > - /*
> > > - * For mappings with an alignment requirement don't bother looping to
> > > - * unaligned slots once we found an aligned one.
> > > - */
> > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> >
> > Is this special handling of alloc_size >= PAGE_SIZE really needed?
>
> I've been wondering that as well, but please note that this code (and the
> comment) are in the upstream code, so I was erring in favour of keeping
> that while fixing the bugs. We could have an extra patch dropping it if
> we can convince ourselves that it's not adding anything, though.

This special casing goes back to before git history. It obviously is not
needed, but it might have made sense back then. If people come up with
a good argument I'm totally fine with removing it, but I also think we
need to get the fixes here in ASAP, so things that are just cleanups
probably aren't priority right now.

> > While the iommu_dma_map_page() case can already fail due to
> > "too large" requests because of not setting a max mapping size,
> > this patch can cause smaller requests to fail as well until Patch 4
> > gets applied. That might be problem to avoid, perhaps by
> > merging the Patch 4 changes into this patch.
>
> I'll leave this up to Christoph. Personally, I'm keen to avoid having
> a giant patch trying to fix all the SWIOTLB allocation issues in one go,
> as it will inevitably get reverted due to a corner case that we weren't
> able to test properly, breaking the common cases at the same time.

Let's keep it split.


2024-02-27 15:40:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

On Wed, Feb 21, 2024 at 11:35:04AM +0000, Will Deacon wrote:
> +static size_t iommu_dma_max_mapping_size(struct device *dev)
> +{
> + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> + return swiotlb_max_mapping_size(dev);

Curious: do we really need both checks here? If swiotlb is active
for a device (for whatever reason), aren't we then always bound
by the max size? If not please add a comment explaining it.


2024-02-27 15:53:17

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

On 27/02/2024 3:40 pm, Christoph Hellwig wrote:
> On Wed, Feb 21, 2024 at 11:35:04AM +0000, Will Deacon wrote:
>> +static size_t iommu_dma_max_mapping_size(struct device *dev)
>> +{
>> + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
>> + return swiotlb_max_mapping_size(dev);
>
> Curious: do we really need both checks here? If swiotlb is active
> for a device (for whatever reason), aren't we then always bound
> by the max size? If not please add a comment explaining it.
>

Oh, good point - if we have an untrusted device but SWIOTLB isn't
initialised for whatever reason, then it doesn't matter what
max_mapping_size returns because iommu_dma_map_page() is going to bail
out regardless.

Thanks,
Robin.

2024-02-28 12:23:38

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

On Tue, Feb 27, 2024 at 03:53:05PM +0000, Robin Murphy wrote:
> On 27/02/2024 3:40 pm, Christoph Hellwig wrote:
> > On Wed, Feb 21, 2024 at 11:35:04AM +0000, Will Deacon wrote:
> > > +static size_t iommu_dma_max_mapping_size(struct device *dev)
> > > +{
> > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> > > + return swiotlb_max_mapping_size(dev);
> >
> > Curious: do we really need both checks here? If swiotlb is active
> > for a device (for whatever reason), aren't we then always bound
> > by the max size? If not please add a comment explaining it.
> >
>
> Oh, good point - if we have an untrusted device but SWIOTLB isn't
> initialised for whatever reason, then it doesn't matter what
> max_mapping_size returns because iommu_dma_map_page() is going to bail out
> regardless.

Makes sense. Since this is all internal to the IOMMU DMA code, I can just
drop the first part of the check.

I'll get a v5 out shortly.

Will