Hi all,
This is version five of the patches which I previously posted at:
v1: https://lore.kernel.org/r/[email protected]
v2: https://lore.kernel.org/r/[email protected]
v3: https://lore.kernel.org/r/[email protected]
v4: https://lore.kernel.org/r/[email protected]
Cheers again to everybody who helped to review and test the last
version.
Changes since v4 include:
- Remove unnecessary 'is_swiotlb_active()' from patch 5.
Nicolin: I didn't add your Tested-by because of this, so if you can
take it for another spin, that would be fantastic.
- New patch removing redundant stride adjustment for allocations
of PAGE_SIZE or more.
- Commit message tweaks and addition of tags from reviewers and
testers.
The final patch is a cleanup, so I'm happy to post it again after the
merge window if it doesn't make it this time around. The rest are fixes
and, even though patch five doesn't solve the general problem, it's
sufficient to fix NVME for Nicolin and is definitely an improvement over
what we currently have.
Thanks,
Will
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Petr Tesarik <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Nicolin Chen <[email protected]>
Cc: Michael Kelley <[email protected]>
--->8
Nicolin Chen (1):
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
Will Deacon (5):
swiotlb: Fix double-allocation of slots due to broken alignment
handling
swiotlb: Enforce page alignment in swiotlb_alloc()
swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
swiotlb: Fix alignment checks when both allocation and DMA masks are
present
swiotlb: Remove pointless stride adjustment for allocations >=
PAGE_SIZE
drivers/iommu/dma-iommu.c | 9 +++++++++
kernel/dma/swiotlb.c | 36 +++++++++++++++++++++++-------------
2 files changed, 32 insertions(+), 13 deletions(-)
--
2.44.0.rc1.240.g4c46232300-goog
Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
checks"), causes a functional regression with vsock in a virtual machine
using bouncing via a restricted DMA SWIOTLB pool.
When virtio allocates the virtqueues for the vsock device using
dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
allocations if 'area->index' was left unaligned by a previous allocation
from the buffer:
# Final address in brackets is the SWIOTLB address returned to the caller
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
This ends badly (typically buffer corruption and/or a hang) because
swiotlb_alloc() is expecting a page-aligned allocation and so blindly
returns a pointer to the 'struct page' corresponding to the allocation,
therefore double-allocating the first half (2KiB slot) of the 4KiB page.
Fix the problem by treating the allocation alignment separately to any
additional alignment requirements from the device, using the maximum
of the two as the stride to search the buffer slots and taking care
to ensure a minimum of page-alignment for buffers larger than a page.
This also resolves swiotlb allocation failures occuring due to the
inclusion of ~PAGE_MASK in 'iotlb_align_mask' for large allocations and
resulting in alignment requirements exceeding swiotlb_max_mapping_size().
Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix")
Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks")
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Dexuan Cui <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b079a9a8e087..2ec2cc81f1a2 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
unsigned long max_slots = get_max_slots(boundary_mask);
unsigned int iotlb_align_mask =
- dma_get_min_align_mask(dev) | alloc_align_mask;
+ dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
unsigned int nslots = nr_slots(alloc_size), stride;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int index, slots_checked, count = 0, i;
@@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
BUG_ON(!nslots);
BUG_ON(area_index >= pool->nareas);
+ /*
+ * For mappings with an alignment requirement don't bother looping to
+ * unaligned slots once we found an aligned one.
+ */
+ stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
+
/*
* For allocations of PAGE_SIZE or larger only look for page aligned
* allocations.
*/
if (alloc_size >= PAGE_SIZE)
- iotlb_align_mask |= ~PAGE_MASK;
- iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
-
- /*
- * For mappings with an alignment requirement don't bother looping to
- * unaligned slots once we found an aligned one.
- */
- stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
+ stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
spin_lock_irqsave(&area->lock, flags);
if (unlikely(nslots > pool->area_nslabs - area->used))
@@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
index = area->index;
for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
- slot_index = slot_base + index;
+ phys_addr_t tlb_addr;
- if (orig_addr &&
- (slot_addr(tbl_dma_addr, slot_index) &
- iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
+ slot_index = slot_base + index;
+ tlb_addr = slot_addr(tbl_dma_addr, slot_index);
+
+ if ((tlb_addr & alloc_align_mask) ||
+ (orig_addr && (tlb_addr & iotlb_align_mask) !=
+ (orig_addr & iotlb_align_mask))) {
index = wrap_area_index(pool, index + 1);
slots_checked++;
continue;
--
2.44.0.rc1.240.g4c46232300-goog
Nicolin reports that swiotlb buffer allocations fail for an NVME device
behind an IOMMU using 64KiB pages. This is because we end up with a
minimum allocation alignment of 64KiB (for the IOMMU to map the buffer
safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME
page (i.e. preserving the 4KiB page offset from the original allocation).
If the original address is not 4KiB-aligned, the allocation will fail
because swiotlb_search_pool_area() erroneously compares these unmasked
bits with the 64KiB-aligned candidate allocation.
Tweak swiotlb_search_pool_area() so that the DMA alignment mask is
reduced based on the required alignment of the allocation.
Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Reported-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c20324fba814..c381a7ed718f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
unsigned long max_slots = get_max_slots(boundary_mask);
- unsigned int iotlb_align_mask =
- dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
+ unsigned int iotlb_align_mask = dma_get_min_align_mask(dev);
unsigned int nslots = nr_slots(alloc_size), stride;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int index, slots_checked, count = 0, i;
@@ -993,6 +992,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
BUG_ON(!nslots);
BUG_ON(area_index >= pool->nareas);
+ /*
+ * Ensure that the allocation is at least slot-aligned and update
+ * 'iotlb_align_mask' to ignore bits that will be preserved when
+ * offsetting into the allocation.
+ */
+ alloc_align_mask |= (IO_TLB_SIZE - 1);
+ iotlb_align_mask &= ~alloc_align_mask;
+
/*
* For mappings with an alignment requirement don't bother looping to
* unaligned slots once we found an aligned one.
--
2.44.0.rc1.240.g4c46232300-goog
From: Nicolin Chen <[email protected]>
The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
an NVME device can map a size between 300KB~512KB, which certainly failed
the swiotlb mappings, though the default pool of swiotlb has many slots:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1
Call trace:
[ 3.099918] swiotlb_tbl_map_single+0x214/0x240
[ 3.099921] iommu_dma_map_page+0x218/0x328
[ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
[ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
[ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
[ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
[ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
[ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
[ 3.103463] __submit_bio+0x44/0x220
[ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
[ 3.103470] submit_bio_noacct+0x180/0x6c8
[ 3.103471] submit_bio+0x34/0x130
[ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
[ 3.104766] mpage_submit_folio+0xa0/0x100
[ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
[ 3.104771] ext4_do_writepages+0x6a0/0xd78
[ 3.105615] ext4_writepages+0x80/0x118
[ 3.105616] do_writepages+0x90/0x1e8
[ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
[ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
[ 3.106656] file_write_and_wait_range+0x84/0x120
[ 3.106658] ext4_sync_file+0x7c/0x4c0
[ 3.106660] vfs_fsync_range+0x3c/0xa8
[ 3.106663] do_fsync+0x44/0xc0
Since untrusted devices might go down the swiotlb pathway with dma-iommu,
these devices should not map a size larger than swiotlb_max_mapping_size.
To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
the iommu_dma_opt_mapping_size().
Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Signed-off-by: Nicolin Chen <[email protected]>
Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com
Acked-by: Robin Murphy <[email protected]>
[will: Drop redundant is_swiotlb_active(dev) check]
Signed-off-by: Will Deacon <[email protected]>
---
drivers/iommu/dma-iommu.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 50ccc4f1ef81..639efa0c4072 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1706,6 +1706,14 @@ static size_t iommu_dma_opt_mapping_size(void)
return iova_rcache_range();
}
+static size_t iommu_dma_max_mapping_size(struct device *dev)
+{
+ if (dev_is_untrusted(dev))
+ return swiotlb_max_mapping_size(dev);
+
+ return SIZE_MAX;
+}
+
static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
@@ -1728,6 +1736,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
.opt_mapping_size = iommu_dma_opt_mapping_size,
+ .max_mapping_size = iommu_dma_max_mapping_size,
};
/*
--
2.44.0.rc1.240.g4c46232300-goog
When allocating pages from a restricted DMA pool in swiotlb_alloc(),
the buffer address is blindly converted to a 'struct page *' that is
returned to the caller. In the unlikely event of an allocation bug,
page-unaligned addresses are not detected and slots can silently be
double-allocated.
Add a simple check of the buffer alignment in swiotlb_alloc() to make
debugging a little easier if something has gone wonky.
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Dexuan Cui <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 2ec2cc81f1a2..ab7fbb40bc55 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1643,6 +1643,12 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
return NULL;
tlb_addr = slot_addr(pool->start, index);
+ if (unlikely(!PAGE_ALIGNED(tlb_addr))) {
+ dev_WARN_ONCE(dev, 1, "Cannot allocate pages from non page-aligned swiotlb addr 0x%pa.\n",
+ &tlb_addr);
+ swiotlb_release_slots(dev, tlb_addr);
+ return NULL;
+ }
return pfn_to_page(PFN_DOWN(tlb_addr));
}
--
2.44.0.rc1.240.g4c46232300-goog
core-api/dma-api-howto.rst states the following properties of
dma_alloc_coherent():
| The CPU virtual address and the DMA address are both guaranteed to
| be aligned to the smallest PAGE_SIZE order which is greater than or
| equal to the requested size.
However, swiotlb_alloc() passes zero for the 'alloc_align_mask'
parameter of swiotlb_find_slots() and so this property is not upheld.
Instead, allocations larger than a page are aligned to PAGE_SIZE,
Calculate the mask corresponding to the page order suitable for holding
the allocation and pass that to swiotlb_find_slots().
Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Dexuan Cui <[email protected]>
Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers")
Reviewed-by: Petr Tesarik <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ab7fbb40bc55..c20324fba814 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1633,12 +1633,14 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
struct io_tlb_pool *pool;
phys_addr_t tlb_addr;
+ unsigned int align;
int index;
if (!mem)
return NULL;
- index = swiotlb_find_slots(dev, 0, size, 0, &pool);
+ align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
+ index = swiotlb_find_slots(dev, 0, size, align, &pool);
if (index == -1)
return NULL;
--
2.44.0.rc1.240.g4c46232300-goog
For swiotlb allocations >= PAGE_SIZE, the slab search historically
adjusted the stride to avoid checking unaligned slots. However, this is
no longer needed now that the code around it has evolved and the
stride is calculated from the required alignment.
Either 'alloc_align_mask' is used to specify the allocation alignment or
the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'.
At least one of these masks is always non-zero.
In light of that, remove the redundant (and slightly confusing) check.
Link: https://lore.kernel.org/r/SN6PR02MB4157089980E6FC58D5557BCED4572@SN6PR02MB4157.namprd02.prod.outlook.com
Reported-by: Michael Kelley <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/dma/swiotlb.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c381a7ed718f..0d8805569f5e 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1006,13 +1006,6 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
*/
stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
- /*
- * For allocations of PAGE_SIZE or larger only look for page aligned
- * allocations.
- */
- if (alloc_size >= PAGE_SIZE)
- stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
-
spin_lock_irqsave(&area->lock, flags);
if (unlikely(nslots > pool->area_nslabs - area->used))
goto not_found;
--
2.44.0.rc1.240.g4c46232300-goog
From: Will Deacon <[email protected]> Sent: Wednesday, February 28, 2024 5:39 AM
>
> Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
> which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
> checks"), causes a functional regression with vsock in a virtual machine
> using bouncing via a restricted DMA SWIOTLB pool.
>
> When virtio allocates the virtqueues for the vsock device using
> dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
> allocations if 'area->index' was left unaligned by a previous allocation
> from the buffer:
>
> # Final address in brackets is the SWIOTLB address returned to the caller
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
>
> This ends badly (typically buffer corruption and/or a hang) because
> swiotlb_alloc() is expecting a page-aligned allocation and so blindly
> returns a pointer to the 'struct page' corresponding to the allocation,
> therefore double-allocating the first half (2KiB slot) of the 4KiB page.
>
> Fix the problem by treating the allocation alignment separately to any
> additional alignment requirements from the device, using the maximum
> of the two as the stride to search the buffer slots and taking care
> to ensure a minimum of page-alignment for buffers larger than a page.
>
> This also resolves swiotlb allocation failures occuring due to the
> inclusion of ~PAGE_MASK in 'iotlb_align_mask' for large allocations and
> resulting in alignment requirements exceeding swiotlb_max_mapping_size().
>
> Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix")
> Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks")
> Cc: Christoph Hellwig <[email protected]>
> Cc: Marek Szyprowski <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Dexuan Cui <[email protected]>
> Reviewed-by: Petr Tesarik <[email protected]>
> Tested-by: Nicolin Chen <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/dma/swiotlb.c | 28 +++++++++++++++-------------
> 1 file changed, 15 insertions(+), 13 deletions(-)
>
Reviewed-by: Michael Kelley <[email protected]>
From: Will Deacon <[email protected]> Sent: Wednesday, February 28, 2024 5:39 AM
>
> From: Nicolin Chen <[email protected]>
>
> The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
> an NVME device can map a size between 300KB~512KB, which certainly failed
> the swiotlb mappings, though the default pool of swiotlb has many slots:
> systemd[1]: Started Journal Service.
> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
> note: journal-offline[392] exited with irqs disabled
> note: journal-offline[392] exited with preempt_count 1
>
> Call trace:
> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240
> [ 3.099921] iommu_dma_map_page+0x218/0x328
> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
> [ 3.103463] __submit_bio+0x44/0x220
> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
> [ 3.103470] submit_bio_noacct+0x180/0x6c8
> [ 3.103471] submit_bio+0x34/0x130
> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
> [ 3.104766] mpage_submit_folio+0xa0/0x100
> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
> [ 3.104771] ext4_do_writepages+0x6a0/0xd78
> [ 3.105615] ext4_writepages+0x80/0x118
> [ 3.105616] do_writepages+0x90/0x1e8
> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
> [ 3.106656] file_write_and_wait_range+0x84/0x120
> [ 3.106658] ext4_sync_file+0x7c/0x4c0
> [ 3.106660] vfs_fsync_range+0x3c/0xa8
> [ 3.106663] do_fsync+0x44/0xc0
>
> Since untrusted devices might go down the swiotlb pathway with dma-iommu,
> these devices should not map a size larger than swiotlb_max_mapping_size.
>
> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
> the iommu_dma_opt_mapping_size().
>
> Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
> Signed-off-by: Nicolin Chen <[email protected]>
> Link: https://lore.kernel.org/all/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com/
> Acked-by: Robin Murphy <[email protected]>
> [will: Drop redundant is_swiotlb_active(dev) check]
> Signed-off-by: Will Deacon <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 50ccc4f1ef81..639efa0c4072 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1706,6 +1706,14 @@ static size_t
> iommu_dma_opt_mapping_size(void)
> return iova_rcache_range();
> }
>
> +static size_t iommu_dma_max_mapping_size(struct device *dev)
> +{
> + if (dev_is_untrusted(dev))
> + return swiotlb_max_mapping_size(dev);
> +
> + return SIZE_MAX;
> +}
> +
> static const struct dma_map_ops iommu_dma_ops = {
> .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> .alloc = iommu_dma_alloc,
> @@ -1728,6 +1736,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> .unmap_resource = iommu_dma_unmap_resource,
> .get_merge_boundary = iommu_dma_get_merge_boundary,
> .opt_mapping_size = iommu_dma_opt_mapping_size,
> + .max_mapping_size = iommu_dma_max_mapping_size,
> };
>
> /*
> --
> 2.44.0.rc1.240.g4c46232300-goog
Reviewed-by: Michael Kelley <[email protected]>
From: Will Deacon <[email protected]> Sent: Wednesday, February 28, 2024 5:40 AM
>
> For swiotlb allocations >= PAGE_SIZE, the slab search historically
> adjusted the stride to avoid checking unaligned slots. However, this is
> no longer needed now that the code around it has evolved and the
> stride is calculated from the required alignment.
>
> Either 'alloc_align_mask' is used to specify the allocation alignment or
> the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'.
> At least one of these masks is always non-zero.
I think the patch is correct, but this justification is not. alloc_align_mask
and the DMA min_align_mask are often both zero. While the NVMe
PCI driver sets min_align_mask, SCSI disk drivers do not (except for the
Hyper-V synthetic SCSI driver). When both are zero, presumably
there are no alignment requirements, so a stride of 1 is appropriate.
Michael
>
> In light of that, remove the redundant (and slightly confusing) check.
>
> Link: https://lore.kernel.org/all/SN6PR02MB4157089980E6FC58D5557BCED4572@SN6PR02MB4157.namprd02.prod.outlook.com/
> Reported-by: Michael Kelley <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/dma/swiotlb.c | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c381a7ed718f..0d8805569f5e 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1006,13 +1006,6 @@ static int swiotlb_search_pool_area(struct device
> *dev, struct io_tlb_pool *pool
> */
> stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
>
> - /*
> - * For allocations of PAGE_SIZE or larger only look for page aligned
> - * allocations.
> - */
> - if (alloc_size >= PAGE_SIZE)
> - stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> -
> spin_lock_irqsave(&area->lock, flags);
> if (unlikely(nslots > pool->area_nslabs - area->used))
> goto not_found;
> --
> 2.44.0.rc1.240.g4c46232300-goog
> From: Will Deacon <[email protected]> Sent: Wednesday, February 28, 2024 5:40 AM
> >
> > For swiotlb allocations >= PAGE_SIZE, the slab search historically
> > adjusted the stride to avoid checking unaligned slots. However, this is
> > no longer needed now that the code around it has evolved and the
> > stride is calculated from the required alignment.
> >
> > Either 'alloc_align_mask' is used to specify the allocation alignment or
> > the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'.
> > At least one of these masks is always non-zero.
>
> I think the patch is correct, but this justification is not. alloc_align_mask
> and the DMA min_align_mask are often both zero. While the NVMe
> PCI driver sets min_align_mask, SCSI disk drivers do not (except for the
> Hyper-V synthetic SCSI driver). When both are zero, presumably
> there are no alignment requirements, so a stride of 1 is appropriate.
>
I did additional checking into the history of page alignment for alloc
sizes of a page or greater. The swiotlb code probably made sense
prior to commit 0eee5ae10256. This commit has this change:
slot_base = area_index * mem->area_nslabs;
- index = wrap_area_index(mem, ALIGN(area->index, stride));
+ index = area->index;
for (slots_checked = 0; slots_checked < mem->area_nslabs; ) {
slot_index = slot_base + index;
In the old code, once the PAGE_SIZE was factored into the stride, the
stride was used to align the starting index, so that the first slot checked
would be page aligned. But commit 0eee5ae10256 drops that use
of the stride and puts the page alignment in iotlb_align_mask, for
reasons explained in the commit message. But in Will's Patch 1
when the page alignment is no longer incorporated into
iotlb_align_mask (for good reason), page aligning the stride then
doesn't do anything useful, resulting in this patch.
If there *is* a requirement for page alignment of page-size-or-greater
requests, even when alloc_align_mask and min_align_mask are zero,
we need to think about how to do that correctly, as that requirement
is no longer met after Patch 1 of this series.
Michael
On Thu, Feb 29, 2024 at 07:36:05AM +0000, Michael Kelley wrote:
> If there *is* a requirement for page alignment of page-size-or-greater
> requests, even when alloc_align_mask and min_align_mask are zero,
> we need to think about how to do that correctly, as that requirement
> is no longer met after Patch 1 of this series.
It has been historical behavior that all dma allocations are page
aligned (including in fact smaller than page sized ones that get
rounded up to a page size). The documentation actually (incorretly)
states an even stronger guarantee:
"The CPU virtual address and the DMA address are both
guaranteed to be aligned to the smallest PAGE_SIZE order which
is greater than or equal to the requested size. This invariant
exists (for example) to guarantee that if you allocate a chunk
which is smaller than or equal to 64 kilobytes, the extent of the
buffer you receive will not cross a 64K boundary."
From: Christoph Hellwig <[email protected]> Sent: Thursday, February 29, 2024 5:34 AM
>
> On Thu, Feb 29, 2024 at 07:36:05AM +0000, Michael Kelley wrote:
> > If there *is* a requirement for page alignment of page-size-or-greater
> > requests, even when alloc_align_mask and min_align_mask are zero,
> > we need to think about how to do that correctly, as that requirement
> > is no longer met after Patch 1 of this series.
>
> It has been historical behavior that all dma allocations are page
> aligned (including in fact smaller than page sized ones that get
> rounded up to a page size). The documentation actually (incorretly)
> states an even stronger guarantee:
>
> "The CPU virtual address and the DMA address are both
> guaranteed to be aligned to the smallest PAGE_SIZE order which
> is greater than or equal to the requested size. This invariant
> exists (for example) to guarantee that if you allocate a chunk
> which is smaller than or equal to 64 kilobytes, the extent of the
> buffer you receive will not cross a 64K boundary."
Any thoughts on how that historical behavior should apply if
the DMA min_align_mask is non-zero, or the alloc_align_mask
parameter to swiotbl_tbl_map_single() is non-zero? As currently
used, alloc_align_mask is page aligned if the IOMMU granule is
>= PAGE_SIZE. But a non-zero min_align_mask could mandate
returning a buffer that is not page aligned. Perhaps do the
historical behavior only if alloc_align_mask and min_align_mask
are both zero?
Michael
On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> Any thoughts on how that historical behavior should apply if
> the DMA min_align_mask is non-zero, or the alloc_align_mask
> parameter to swiotbl_tbl_map_single() is non-zero? As currently
> used, alloc_align_mask is page aligned if the IOMMU granule is
> >= PAGE_SIZE. But a non-zero min_align_mask could mandate
> returning a buffer that is not page aligned. Perhaps do the
> historical behavior only if alloc_align_mask and min_align_mask
> are both zero?
I think the driver setting min_align_mask is a clear indicator
that the driver requested a specific alignment and the defaults
don't apply. For swiotbl_tbl_map_single as used by dma-iommu
I'd have to tak a closer look at how it is used.
On Thu, 29 Feb 2024 16:47:56 +0100
Christoph Hellwig <[email protected]> wrote:
> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> > Any thoughts on how that historical behavior should apply if
> > the DMA min_align_mask is non-zero, or the alloc_align_mask
> > parameter to swiotbl_tbl_map_single() is non-zero? As currently
> > used, alloc_align_mask is page aligned if the IOMMU granule is
> > >= PAGE_SIZE. But a non-zero min_align_mask could mandate
> > returning a buffer that is not page aligned. Perhaps do the
> > historical behavior only if alloc_align_mask and min_align_mask
> > are both zero?
>
> I think the driver setting min_align_mask is a clear indicator
> that the driver requested a specific alignment and the defaults
> don't apply. For swiotbl_tbl_map_single as used by dma-iommu
> I'd have to tak a closer look at how it is used.
I'm not sure it helps in this discussion, but let me dive into a bit
of ancient history to understand how we ended up here.
IIRC this behaviour was originally motivated by limitations of PC AT
hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
added a page register, but it was on a separate chip and it did not
increment when the 8237 address rolled over back to zero. Effectively,
the page register selected a 64K-aligned window of 64K buffers.
Consequently, DMA buffers could not cross a 64K physical boundary.
Thanks to how the buddy allocator works, the 64K-boundary constraint
was satisfied by allocation size, and drivers took advantage of it when
allocating device buffers. IMO software bounce buffers simply followed
the same logic that worked for buffers allocated by the buddy allocator.
OTOH min_align_mask was motivated by NVME which prescribes the value of
a certain number of low bits in the DMA address (for simplicity assumed
to be identical with the same bits in the physical address).
The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
IIUC it is used to guarantee that unaligned transactions do not share
the IOMMU granule with another device. This whole thing is weird,
because swiotlb_tbl_map_single() is called like this:
aligned_size = iova_align(iovad, size);
phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
iova_mask(iovad), dir, attrs);
Here:
* alloc_size = iova_align(iovad, size)
* alloc_align_mask = iova_mask(iovad)
Now, iova_align() rounds up its argument to a multiple of iova granule
and iova_mask() is simply "granule - 1". This works, because granule
size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
In that case, the alloc_align_mask argument is not even needed if you
adjust the code to match documentation---the resulting buffer will be
aligned to a granule boundary by virtue of having a size that is a
multiple of the granule size.
To sum it up:
1. min_align_mask is by far the most important constraint. Devices will
simply stop working if it is not met.
2. Alignment to the smallest PAGE_SIZE order which is greater than or
equal to the requested size has been documented, and some drivers
may rely on it.
3. alloc_align_mask is a misguided fix for a bug in the above.
Correct me if anything of the above is wrong.
HTH
Petr T
On 2024-03-01 3:39 pm, Petr Tesařík wrote:
> On Thu, 29 Feb 2024 16:47:56 +0100
> Christoph Hellwig <[email protected]> wrote:
>
>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
>>> Any thoughts on how that historical behavior should apply if
>>> the DMA min_align_mask is non-zero, or the alloc_align_mask
>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently
>>> used, alloc_align_mask is page aligned if the IOMMU granule is
>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate
>>> returning a buffer that is not page aligned. Perhaps do the
>>> historical behavior only if alloc_align_mask and min_align_mask
>>> are both zero?
>>
>> I think the driver setting min_align_mask is a clear indicator
>> that the driver requested a specific alignment and the defaults
>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu
>> I'd have to tak a closer look at how it is used.
>
> I'm not sure it helps in this discussion, but let me dive into a bit
> of ancient history to understand how we ended up here.
>
> IIRC this behaviour was originally motivated by limitations of PC AT
> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
> added a page register, but it was on a separate chip and it did not
> increment when the 8237 address rolled over back to zero. Effectively,
> the page register selected a 64K-aligned window of 64K buffers.
> Consequently, DMA buffers could not cross a 64K physical boundary.
>
> Thanks to how the buddy allocator works, the 64K-boundary constraint
> was satisfied by allocation size, and drivers took advantage of it when
> allocating device buffers. IMO software bounce buffers simply followed
> the same logic that worked for buffers allocated by the buddy allocator.
>
> OTOH min_align_mask was motivated by NVME which prescribes the value of
> a certain number of low bits in the DMA address (for simplicity assumed
> to be identical with the same bits in the physical address).
>
> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
> IIUC it is used to guarantee that unaligned transactions do not share
> the IOMMU granule with another device. This whole thing is weird,
> because swiotlb_tbl_map_single() is called like this:
>
> aligned_size = iova_align(iovad, size);
> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> iova_mask(iovad), dir, attrs);
>
> Here:
>
> * alloc_size = iova_align(iovad, size)
> * alloc_align_mask = iova_mask(iovad)
>
> Now, iova_align() rounds up its argument to a multiple of iova granule
> and iova_mask() is simply "granule - 1". This works, because granule
> size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
Not quite, the granule must *not* be larger than PAGE_SIZE (but can be
smaller).
> In that case, the alloc_align_mask argument is not even needed if you
> adjust the code to match documentation---the resulting buffer will be
> aligned to a granule boundary by virtue of having a size that is a
> multiple of the granule size.
I think we've conflated two different notions of "allocation" here. The
page-alignment which Christoph quoted applies for dma_alloc_coherent(),
which nowadays *should* only be relevant to SWIOTLB code in the
swiotlb_alloc() path for stealing coherent pages out of restricted DMA
pools. Historically there was never any official alignment requirement
for the DMA addresses returned by dma_map_{page,sg}(), until
min_align_mask was introduced.
> To sum it up:
>
> 1. min_align_mask is by far the most important constraint. Devices will
> simply stop working if it is not met.
> 2. Alignment to the smallest PAGE_SIZE order which is greater than or
> equal to the requested size has been documented, and some drivers
> may rely on it.
Strictly it stopped being necessary since fafadcd16595 when the
historical swiotlb_alloc() was removed, but 602d9858f07c implies that
indeed the assumption of it for streaming mappings had already set in.
Of course NVMe is now using min_align_mask, but I'm not sure if it's
worth taking the risk to find out who else should also be.
> 3. alloc_align_mask is a misguided fix for a bug in the above.
No, alloc_align_mask is about describing the explicit requirement rather
than relying on any implicit behaviour, and thus being able to do the
optimal thing for, say, a 9KB mapping given a 4KB IOVA granule and 64KB
PAGE_SIZE.
Thanks,
Robin.
>
> Correct me if anything of the above is wrong.
>
> HTH
> Petr T
On Fri, 1 Mar 2024 16:39:27 +0100
Petr Tesařík <[email protected]> wrote:
> On Thu, 29 Feb 2024 16:47:56 +0100
> Christoph Hellwig <[email protected]> wrote:
>
> > On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> > > Any thoughts on how that historical behavior should apply if
> > > the DMA min_align_mask is non-zero, or the alloc_align_mask
> > > parameter to swiotbl_tbl_map_single() is non-zero? As currently
> > > used, alloc_align_mask is page aligned if the IOMMU granule is
> > > >= PAGE_SIZE. But a non-zero min_align_mask could mandate
> > > returning a buffer that is not page aligned. Perhaps do the
> > > historical behavior only if alloc_align_mask and min_align_mask
> > > are both zero?
> >
> > I think the driver setting min_align_mask is a clear indicator
> > that the driver requested a specific alignment and the defaults
> > don't apply. For swiotbl_tbl_map_single as used by dma-iommu
> > I'd have to tak a closer look at how it is used.
>
> I'm not sure it helps in this discussion, but let me dive into a bit
> of ancient history to understand how we ended up here.
>
> IIRC this behaviour was originally motivated by limitations of PC AT
> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
> added a page register, but it was on a separate chip and it did not
> increment when the 8237 address rolled over back to zero. Effectively,
> the page register selected a 64K-aligned window of 64K buffers.
> Consequently, DMA buffers could not cross a 64K physical boundary.
>
> Thanks to how the buddy allocator works, the 64K-boundary constraint
> was satisfied by allocation size, and drivers took advantage of it when
> allocating device buffers. IMO software bounce buffers simply followed
> the same logic that worked for buffers allocated by the buddy allocator.
>
> OTOH min_align_mask was motivated by NVME which prescribes the value of
> a certain number of low bits in the DMA address (for simplicity assumed
> to be identical with the same bits in the physical address).
>
> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
> IIUC it is used to guarantee that unaligned transactions do not share
> the IOMMU granule with another device. This whole thing is weird,
> because swiotlb_tbl_map_single() is called like this:
>
> aligned_size = iova_align(iovad, size);
> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> iova_mask(iovad), dir, attrs);
>
> Here:
>
> * alloc_size = iova_align(iovad, size)
> * alloc_align_mask = iova_mask(iovad)
>
> Now, iova_align() rounds up its argument to a multiple of iova granule
> and iova_mask() is simply "granule - 1". This works, because granule
> size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
>
> In that case, the alloc_align_mask argument is not even needed if you
> adjust the code to match documentation---the resulting buffer will be
> aligned to a granule boundary by virtue of having a size that is a
> multiple of the granule size.
>
> To sum it up:
>
> 1. min_align_mask is by far the most important constraint. Devices will
> simply stop working if it is not met.
> 2. Alignment to the smallest PAGE_SIZE order which is greater than or
> equal to the requested size has been documented, and some drivers
> may rely on it.
> 3. alloc_align_mask is a misguided fix for a bug in the above.
>
> Correct me if anything of the above is wrong.
I thought about it some more, and I believe I know what should happen
if the first two constraints appear to be mutually exclusive.
First, the alignment based on size does not guarantee that the resulting
physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must
be always identical to the original buffer address.
Let's take an example request like this:
TLB_SIZE = 0x00000800
min_align_mask = 0x0000ffff
orig_addr = 0x....1234
alloc_size = 0x00002800
Minimum alignment mask requires to keep the 1234 at the end. Allocation
size requires a buffer that is aligned to 16K. Of course, there is no
16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT
was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are
masked off). Since the SWIOTLB API does not guarantee any specific
value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a
perfectly valid bounce buffer address for this example.
The caller may rightfully expect that the 16K granule containing the
bounce buffer is not shared with any other user. For the above case I
suggest to increase the allocation size to 0x4000 already in
swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot
address.
Will, do you want to incorporate it into your patch series, or should I
send the corresponding patch?
Petr T
On Fri, 1 Mar 2024 16:38:33 +0000
Robin Murphy <[email protected]> wrote:
> On 2024-03-01 3:39 pm, Petr Tesařík wrote:
> > On Thu, 29 Feb 2024 16:47:56 +0100
> > Christoph Hellwig <[email protected]> wrote:
> >
> >> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> >>> Any thoughts on how that historical behavior should apply if
> >>> the DMA min_align_mask is non-zero, or the alloc_align_mask
> >>> parameter to swiotbl_tbl_map_single() is non-zero? As currently
> >>> used, alloc_align_mask is page aligned if the IOMMU granule is
> >>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate
> >>> returning a buffer that is not page aligned. Perhaps do the
> >>> historical behavior only if alloc_align_mask and min_align_mask
> >>> are both zero?
> >>
> >> I think the driver setting min_align_mask is a clear indicator
> >> that the driver requested a specific alignment and the defaults
> >> don't apply. For swiotbl_tbl_map_single as used by dma-iommu
> >> I'd have to tak a closer look at how it is used.
> >
> > I'm not sure it helps in this discussion, but let me dive into a bit
> > of ancient history to understand how we ended up here.
> >
> > IIRC this behaviour was originally motivated by limitations of PC AT
> > hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
> > usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
> > added a page register, but it was on a separate chip and it did not
> > increment when the 8237 address rolled over back to zero. Effectively,
> > the page register selected a 64K-aligned window of 64K buffers.
> > Consequently, DMA buffers could not cross a 64K physical boundary.
> >
> > Thanks to how the buddy allocator works, the 64K-boundary constraint
> > was satisfied by allocation size, and drivers took advantage of it when
> > allocating device buffers. IMO software bounce buffers simply followed
> > the same logic that worked for buffers allocated by the buddy allocator.
> >
> > OTOH min_align_mask was motivated by NVME which prescribes the value of
> > a certain number of low bits in the DMA address (for simplicity assumed
> > to be identical with the same bits in the physical address).
> >
> > The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
> > IIUC it is used to guarantee that unaligned transactions do not share
> > the IOMMU granule with another device. This whole thing is weird,
> > because swiotlb_tbl_map_single() is called like this:
> >
> > aligned_size = iova_align(iovad, size);
> > phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> > iova_mask(iovad), dir, attrs);
> >
> > Here:
> >
> > * alloc_size = iova_align(iovad, size)
> > * alloc_align_mask = iova_mask(iovad)
> >
> > Now, iova_align() rounds up its argument to a multiple of iova granule
> > and iova_mask() is simply "granule - 1". This works, because granule
> > size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
>
> Not quite, the granule must *not* be larger than PAGE_SIZE (but can be
> smaller).
OK... I must rethink what is achieved with the alignment then.
> > In that case, the alloc_align_mask argument is not even needed if you
> > adjust the code to match documentation---the resulting buffer will be
> > aligned to a granule boundary by virtue of having a size that is a
> > multiple of the granule size.
>
> I think we've conflated two different notions of "allocation" here. The
> page-alignment which Christoph quoted applies for dma_alloc_coherent(),
> which nowadays *should* only be relevant to SWIOTLB code in the
> swiotlb_alloc() path for stealing coherent pages out of restricted DMA
> pools. Historically there was never any official alignment requirement
> for the DMA addresses returned by dma_map_{page,sg}(), until
> min_align_mask was introduced.
>
> > To sum it up:
> >
> > 1. min_align_mask is by far the most important constraint. Devices will
> > simply stop working if it is not met.
> > 2. Alignment to the smallest PAGE_SIZE order which is greater than or
> > equal to the requested size has been documented, and some drivers
> > may rely on it.
>
> Strictly it stopped being necessary since fafadcd16595 when the
> historical swiotlb_alloc() was removed, but 602d9858f07c implies that
> indeed the assumption of it for streaming mappings had already set in.
> Of course NVMe is now using min_align_mask, but I'm not sure if it's
> worth taking the risk to find out who else should also be.
>
> > 3. alloc_align_mask is a misguided fix for a bug in the above.
>
> No, alloc_align_mask is about describing the explicit requirement rather
> than relying on any implicit behaviour, and thus being able to do the
> optimal thing for, say, a 9KB mapping given a 4KB IOVA granule and 64KB
> PAGE_SIZE.
It's getting confusing. Can we stay with the IOMMU use case for a
moment? Since you don't use IO_TLB_SIZE or IO_TLB_SHIFT, I assume the
code should work with any SWIOTLB slot size.
For granule sizes up to SWIOTLB slot size, you always get a buffer that
is not shared with any other granule.
For granule sizes between SWIOTLB slot size and page size, you want to
get as many slots as needed to cover the whole granule. Indeed, that
would not be achieved with size alone.
What if we change the size-based alignment constraint to:
Aligned to the smallest IO_TLB_SIZE order which is greater than or
equal to the requested size.
AFAICS this is stricter, i.e. it covers the already documented
PAGE_SIZE alignment, but it would also cover IOMMU needs.
I have the feeling that the slot search has too many parameters. This
alloc_align_mask has only one in-tree user, so it's a logical candidate
for reduction.
Petr T
>
> Thanks,
> Robin.
>
> >
> > Correct me if anything of the above is wrong.
> >
> > HTH
> > Petr T
>
On 2024-03-01 5:08 pm, Petr Tesařík wrote:
> On Fri, 1 Mar 2024 16:39:27 +0100
> Petr Tesařík <[email protected]> wrote:
>
>> On Thu, 29 Feb 2024 16:47:56 +0100
>> Christoph Hellwig <[email protected]> wrote:
>>
>>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
>>>> Any thoughts on how that historical behavior should apply if
>>>> the DMA min_align_mask is non-zero, or the alloc_align_mask
>>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently
>>>> used, alloc_align_mask is page aligned if the IOMMU granule is
>>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate
>>>> returning a buffer that is not page aligned. Perhaps do the
>>>> historical behavior only if alloc_align_mask and min_align_mask
>>>> are both zero?
>>>
>>> I think the driver setting min_align_mask is a clear indicator
>>> that the driver requested a specific alignment and the defaults
>>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu
>>> I'd have to tak a closer look at how it is used.
>>
>> I'm not sure it helps in this discussion, but let me dive into a bit
>> of ancient history to understand how we ended up here.
>>
>> IIRC this behaviour was originally motivated by limitations of PC AT
>> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
>> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
>> added a page register, but it was on a separate chip and it did not
>> increment when the 8237 address rolled over back to zero. Effectively,
>> the page register selected a 64K-aligned window of 64K buffers.
>> Consequently, DMA buffers could not cross a 64K physical boundary.
>>
>> Thanks to how the buddy allocator works, the 64K-boundary constraint
>> was satisfied by allocation size, and drivers took advantage of it when
>> allocating device buffers. IMO software bounce buffers simply followed
>> the same logic that worked for buffers allocated by the buddy allocator.
>>
>> OTOH min_align_mask was motivated by NVME which prescribes the value of
>> a certain number of low bits in the DMA address (for simplicity assumed
>> to be identical with the same bits in the physical address).
>>
>> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
>> IIUC it is used to guarantee that unaligned transactions do not share
>> the IOMMU granule with another device. This whole thing is weird,
>> because swiotlb_tbl_map_single() is called like this:
>>
>> aligned_size = iova_align(iovad, size);
>> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
>> iova_mask(iovad), dir, attrs);
>>
>> Here:
>>
>> * alloc_size = iova_align(iovad, size)
>> * alloc_align_mask = iova_mask(iovad)
>>
>> Now, iova_align() rounds up its argument to a multiple of iova granule
>> and iova_mask() is simply "granule - 1". This works, because granule
>> size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
>>
>> In that case, the alloc_align_mask argument is not even needed if you
>> adjust the code to match documentation---the resulting buffer will be
>> aligned to a granule boundary by virtue of having a size that is a
>> multiple of the granule size.
>>
>> To sum it up:
>>
>> 1. min_align_mask is by far the most important constraint. Devices will
>> simply stop working if it is not met.
>> 2. Alignment to the smallest PAGE_SIZE order which is greater than or
>> equal to the requested size has been documented, and some drivers
>> may rely on it.
>> 3. alloc_align_mask is a misguided fix for a bug in the above.
>>
>> Correct me if anything of the above is wrong.
>
> I thought about it some more, and I believe I know what should happen
> if the first two constraints appear to be mutually exclusive.
>
> First, the alignment based on size does not guarantee that the resulting
> physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must
> be always identical to the original buffer address.
>
> Let's take an example request like this:
>
> TLB_SIZE = 0x00000800
> min_align_mask = 0x0000ffff
> orig_addr = 0x....1234
> alloc_size = 0x00002800
>
> Minimum alignment mask requires to keep the 1234 at the end. Allocation
> size requires a buffer that is aligned to 16K. Of course, there is no
> 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT
> was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are
> masked off). Since the SWIOTLB API does not guarantee any specific
> value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a
> perfectly valid bounce buffer address for this example.
>
> The caller may rightfully expect that the 16K granule containing the
> bounce buffer is not shared with any other user. For the above case I
> suggest to increase the allocation size to 0x4000 already in
> swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot
> address.
That doesn't make sense - a caller asks to map some range of kernel
addresses and they get back a corresponding range of DMA addresses; they
cannot make any reasonable assumptions about DMA addresses *outside*
that range. In the example above, the caller has explicitly chosen not
to map the range xxx0000-xxx1234; if they expect the device to actually
access bytes in the DMA range yyy0000-yyy1234, then they should have
mapped the whole range starting from xxx0000 and it is their own error.
SWIOTLB does not and cannot provide any memory protection itself, so
there is no functional benefit to automatically over-allocating, all it
will do is waste slots. iommu-dma *can* provide memory protection
between individual mappings via additional layers that SWIOTLB doesn't
know about, so in that case it's iommu-dma's responsibility to
explicitly manage whatever over-allocation is necessary at the SWIOTLB
level to match the IOMMU level.
Thanks,
Robin.
On Fri, 1 Mar 2024 17:54:06 +0000
Robin Murphy <[email protected]> wrote:
> On 2024-03-01 5:08 pm, Petr Tesařík wrote:
> > On Fri, 1 Mar 2024 16:39:27 +0100
> > Petr Tesařík <[email protected]> wrote:
> >
> >> On Thu, 29 Feb 2024 16:47:56 +0100
> >> Christoph Hellwig <[email protected]> wrote:
> >>
> >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> >>>> Any thoughts on how that historical behavior should apply if
> >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask
> >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently
> >>>> used, alloc_align_mask is page aligned if the IOMMU granule is
> >>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate
> >>>> returning a buffer that is not page aligned. Perhaps do the
> >>>> historical behavior only if alloc_align_mask and min_align_mask
> >>>> are both zero?
> >>>
> >>> I think the driver setting min_align_mask is a clear indicator
> >>> that the driver requested a specific alignment and the defaults
> >>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu
> >>> I'd have to tak a closer look at how it is used.
> >>
> >> I'm not sure it helps in this discussion, but let me dive into a bit
> >> of ancient history to understand how we ended up here.
> >>
> >> IIRC this behaviour was originally motivated by limitations of PC AT
> >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
> >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
> >> added a page register, but it was on a separate chip and it did not
> >> increment when the 8237 address rolled over back to zero. Effectively,
> >> the page register selected a 64K-aligned window of 64K buffers.
> >> Consequently, DMA buffers could not cross a 64K physical boundary.
> >>
> >> Thanks to how the buddy allocator works, the 64K-boundary constraint
> >> was satisfied by allocation size, and drivers took advantage of it when
> >> allocating device buffers. IMO software bounce buffers simply followed
> >> the same logic that worked for buffers allocated by the buddy allocator.
> >>
> >> OTOH min_align_mask was motivated by NVME which prescribes the value of
> >> a certain number of low bits in the DMA address (for simplicity assumed
> >> to be identical with the same bits in the physical address).
> >>
> >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
> >> IIUC it is used to guarantee that unaligned transactions do not share
> >> the IOMMU granule with another device. This whole thing is weird,
> >> because swiotlb_tbl_map_single() is called like this:
> >>
> >> aligned_size = iova_align(iovad, size);
> >> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> >> iova_mask(iovad), dir, attrs);
> >>
> >> Here:
> >>
> >> * alloc_size = iova_align(iovad, size)
> >> * alloc_align_mask = iova_mask(iovad)
> >>
> >> Now, iova_align() rounds up its argument to a multiple of iova granule
> >> and iova_mask() is simply "granule - 1". This works, because granule
> >> size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
> >>
> >> In that case, the alloc_align_mask argument is not even needed if you
> >> adjust the code to match documentation---the resulting buffer will be
> >> aligned to a granule boundary by virtue of having a size that is a
> >> multiple of the granule size.
> >>
> >> To sum it up:
> >>
> >> 1. min_align_mask is by far the most important constraint. Devices will
> >> simply stop working if it is not met.
> >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or
> >> equal to the requested size has been documented, and some drivers
> >> may rely on it.
> >> 3. alloc_align_mask is a misguided fix for a bug in the above.
> >>
> >> Correct me if anything of the above is wrong.
> >
> > I thought about it some more, and I believe I know what should happen
> > if the first two constraints appear to be mutually exclusive.
> >
> > First, the alignment based on size does not guarantee that the resulting
> > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must
> > be always identical to the original buffer address.
> >
> > Let's take an example request like this:
> >
> > TLB_SIZE = 0x00000800
> > min_align_mask = 0x0000ffff
> > orig_addr = 0x....1234
> > alloc_size = 0x00002800
> >
> > Minimum alignment mask requires to keep the 1234 at the end. Allocation
> > size requires a buffer that is aligned to 16K. Of course, there is no
> > 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT
> > was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are
> > masked off). Since the SWIOTLB API does not guarantee any specific
> > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a
> > perfectly valid bounce buffer address for this example.
> >
> > The caller may rightfully expect that the 16K granule containing the
> > bounce buffer is not shared with any other user. For the above case I
> > suggest to increase the allocation size to 0x4000 already in
> > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot
> > address.
>
> That doesn't make sense - a caller asks to map some range of kernel
> addresses and they get back a corresponding range of DMA addresses; they
> cannot make any reasonable assumptions about DMA addresses *outside*
> that range. In the example above, the caller has explicitly chosen not
> to map the range xxx0000-xxx1234; if they expect the device to actually
> access bytes in the DMA range yyy0000-yyy1234, then they should have
> mapped the whole range starting from xxx0000 and it is their own error.
I agree that the range was not requested. But it is not wrong if
SWIOTLB overallocates. In fact, it usually does overallocate because it
works with slot granularity.
> SWIOTLB does not and cannot provide any memory protection itself, so
> there is no functional benefit to automatically over-allocating, all it
> will do is waste slots. iommu-dma *can* provide memory protection
> between individual mappings via additional layers that SWIOTLB doesn't
> know about, so in that case it's iommu-dma's responsibility to
> explicitly manage whatever over-allocation is necessary at the SWIOTLB
> level to match the IOMMU level.
I'm trying to understand what the caller expects to get if they request
both buffer alignment (either given implicitly through mapping size or
explicitly with an alloc_align_mask) with a min_align_mask and non-zero
low bits covered by the buffer alignment.
In other words, if iommu_dma_map_page() gets into this situation:
* granule size is 4k
* device specifies 64k min_align_mask
* bit 11 of the original buffer address is non-zero
Then you ask for a pair of slots where the first slot has bit 11 == 0
(required by alignment to granule size) and also has bit 11 == 1
(required to preserve the lowest 16 bits of the original address).
Sure, you can fail such a mapping, but is it what the caller expects?
Thanks
Petr T
From: Petr Tesařík <[email protected]> Sent: Friday, March 1, 2024 10:42 AM
>
> On Fri, 1 Mar 2024 17:54:06 +0000
> Robin Murphy <[email protected]> wrote:
>
> > On 2024-03-01 5:08 pm, Petr Tesařík wrote:
> > > On Fri, 1 Mar 2024 16:39:27 +0100
> > > Petr Tesařík <[email protected]> wrote:
> > >
> > >> On Thu, 29 Feb 2024 16:47:56 +0100
> > >> Christoph Hellwig <[email protected]> wrote:
> > >>
> > >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> > >>>> Any thoughts on how that historical behavior should apply if
> > >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask
> > >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently
> > >>>> used, alloc_align_mask is page aligned if the IOMMU granule is
> > >>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate
> > >>>> returning a buffer that is not page aligned. Perhaps do the
> > >>>> historical behavior only if alloc_align_mask and min_align_mask
> > >>>> are both zero?
> > >>>
> > >>> I think the driver setting min_align_mask is a clear indicator
> > >>> that the driver requested a specific alignment and the defaults
> > >>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu
> > >>> I'd have to tak a closer look at how it is used.
> > >>
> > >> I'm not sure it helps in this discussion, but let me dive into a bit
> > >> of ancient history to understand how we ended up here.
> > >>
> > >> IIRC this behaviour was originally motivated by limitations of PC AT
> > >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
> > >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
> > >> added a page register, but it was on a separate chip and it did not
> > >> increment when the 8237 address rolled over back to zero. Effectively,
> > >> the page register selected a 64K-aligned window of 64K buffers.
> > >> Consequently, DMA buffers could not cross a 64K physical boundary.
> > >>
> > >> Thanks to how the buddy allocator works, the 64K-boundary constraint
> > >> was satisfied by allocation size, and drivers took advantage of it when
> > >> allocating device buffers. IMO software bounce buffers simply followed
> > >> the same logic that worked for buffers allocated by the buddy allocator.
> > >>
> > >> OTOH min_align_mask was motivated by NVME which prescribes the value of
> > >> a certain number of low bits in the DMA address (for simplicity assumed
> > >> to be identical with the same bits in the physical address).
> > >>
> > >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
> > >> IIUC it is used to guarantee that unaligned transactions do not share
> > >> the IOMMU granule with another device. This whole thing is weird,
> > >> because swiotlb_tbl_map_single() is called like this:
> > >>
> > >> aligned_size = iova_align(iovad, size);
> > >> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> > >> iova_mask(iovad), dir, attrs);
> > >>
> > >> Here:
> > >>
> > >> * alloc_size = iova_align(iovad, size)
> > >> * alloc_align_mask = iova_mask(iovad)
> > >>
> > >> Now, iova_align() rounds up its argument to a multiple of iova granule
> > >> and iova_mask() is simply "granule - 1". This works, because granule
> > >> size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
> > >>
> > >> In that case, the alloc_align_mask argument is not even needed if you
> > >> adjust the code to match documentation---the resulting buffer will be
> > >> aligned to a granule boundary by virtue of having a size that is a
> > >> multiple of the granule size.
> > >>
> > >> To sum it up:
> > >>
> > >> 1. min_align_mask is by far the most important constraint. Devices will
> > >> simply stop working if it is not met.
> > >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or
> > >> equal to the requested size has been documented, and some drivers
> > >> may rely on it.
> > >> 3. alloc_align_mask is a misguided fix for a bug in the above.
> > >>
> > >> Correct me if anything of the above is wrong.
> > >
> > > I thought about it some more, and I believe I know what should happen
> > > if the first two constraints appear to be mutually exclusive.
> > >
> > > First, the alignment based on size does not guarantee that the resulting
> > > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must
> > > be always identical to the original buffer address.
> > >
> > > Let's take an example request like this:
> > >
> > > TLB_SIZE = 0x00000800
> > > min_align_mask = 0x0000ffff
> > > orig_addr = 0x....1234
> > > alloc_size = 0x00002800
> > >
> > > Minimum alignment mask requires to keep the 1234 at the end. Allocation
> > > size requires a buffer that is aligned to 16K. Of course, there is no
> > > 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT
> > > was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are
> > > masked off). Since the SWIOTLB API does not guarantee any specific
> > > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a
> > > perfectly valid bounce buffer address for this example.
> > >
> > > The caller may rightfully expect that the 16K granule containing the
> > > bounce buffer is not shared with any other user. For the above case I
> > > suggest to increase the allocation size to 0x4000 already in
> > > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot
> > > address.
> >
> > That doesn't make sense - a caller asks to map some range of kernel
> > addresses and they get back a corresponding range of DMA addresses; they
> > cannot make any reasonable assumptions about DMA addresses *outside*
> > that range. In the example above, the caller has explicitly chosen not
> > to map the range xxx0000-xxx1234; if they expect the device to actually
> > access bytes in the DMA range yyy0000-yyy1234, then they should have
> > mapped the whole range starting from xxx0000 and it is their own error.
>
> I agree that the range was not requested. But it is not wrong if
> SWIOTLB overallocates. In fact, it usually does overallocate because it
> works with slot granularity.
>
> > SWIOTLB does not and cannot provide any memory protection itself, so
> > there is no functional benefit to automatically over-allocating, all it
> > will do is waste slots. iommu-dma *can* provide memory protection
> > between individual mappings via additional layers that SWIOTLB doesn't
> > know about, so in that case it's iommu-dma's responsibility to
> > explicitly manage whatever over-allocation is necessary at the SWIOTLB
> > level to match the IOMMU level.
>
> I'm trying to understand what the caller expects to get if they request
> both buffer alignment (either given implicitly through mapping size or
> explicitly with an alloc_align_mask) with a min_align_mask and non-zero
> low bits covered by the buffer alignment.
>
> In other words, if iommu_dma_map_page() gets into this situation:
>
> * granule size is 4k
> * device specifies 64k min_align_mask
> * bit 11 of the original buffer address is non-zero
>
> Then you ask for a pair of slots where the first slot has bit 11 == 0
> (required by alignment to granule size) and also has bit 11 == 1
> (required to preserve the lowest 16 bits of the original address).
>
> Sure, you can fail such a mapping, but is it what the caller expects?
>
Here's my take on tying all the threads together. There are
four alignment combinations:
1. alloc_align_mask: zero; min_align_mask: zero
2. alloc_align_mask: zero; min_align_mask: non-zero
3. alloc_align_mask: non-zero; min_align_mask: zero/ignored
4. alloc_align_mask: non-zero; min_align_mask: non-zero
xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
via swiotlb_map() and swiotlb_tbl_map_single()
iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
swiotlb_alloc() is #3, directly to swiotlb_find_slots()
For #1, the returned physical address has no constraints if
the requested size is less than a page. For page size or
greater, the discussed historical requirement for page
alignment applies.
For #2, min_align_mask governs the bits of the returned
physical address that must match the original address. When
needed, swiotlb must also allocate pre-padding aligned to
IO_TLB_SIZE that precedes the returned physical address. A
request size <= swiotlb_max_mapping_size() will not exceed
IO_TLB_SEGSIZE even with the padding. The historical
requirement for page alignment does not apply because the
driver has explicitly used the newer min_align_mask feature.
For #3, alloc_align_mask specifies the required alignment. No
pre-padding is needed. Per earlier comments from Robin[1],
it's reasonable to assume alloc_align_mask (i.e., the granule)
is >= IO_TLB_SIZE. The original address is not relevant in
determining the alignment, and the historical page alignment
requirement does not apply since alloc_align_mask explicitly
states the alignment.
For #4, the returned physical address must match the bits
in the original address specified by min_align_mask. swiotlb
swiotlb must also allocate pre-padding aligned to
alloc_align_mask that precedes the returned physical address.
Also per Robin[1], assume alloc_align_mask is >=
min_align_mask, which solves the conflicting alignment
problem pointed out by Petr[2]. Perhaps we should add a
"WARN_ON(alloc_align_mask < min_align_mask)" rather than
failing depending on which bits of the original address are
set. Again, the historical requirement for page alignment does
not apply.
I believe Will's patch set implements everything in #2, #3,
and #4, except my suggested WARN_ON in #4. The historical page
alignment in #1 presumably needs to be added. Also, the current
implementation of #4 has a bug in that IO_TLB_SEGSIZE could be
exceeded as pointed out here[3], but Robin was OK with not
fixing that now.
Michael
[1] https://lore.kernel.org/linux-iommu/[email protected]/T/#mbd31cbfbdf841336e25f37758c8af1a0b6d8f3eb
[2] https://lore.kernel.org/linux-iommu/[email protected]/T/#mf631679b302b1f5c7cacc82f4c15fb4b19f3dea1
[3] https://lore.kernel.org/linux-iommu/[email protected]/T/#m4179a909777ec751f3dc15b515617477e6682600
On 04/03/2024 11:00 am, Petr Tesařík wrote:
[...]
>> Here's my take on tying all the threads together. There are
>> four alignment combinations:
>>
>> 1. alloc_align_mask: zero; min_align_mask: zero
>> 2. alloc_align_mask: zero; min_align_mask: non-zero
>> 3. alloc_align_mask: non-zero; min_align_mask: zero/ignored
>> 4. alloc_align_mask: non-zero; min_align_mask: non-zero
>
> What does "min_align_mask: zero/ignored" mean? Under which
> circumstances should be a non-zero min_align_mask ignored?
>
>> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
>> via swiotlb_map() and swiotlb_tbl_map_single()
>>
>> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
>>
>> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
>>
>> For #1, the returned physical address has no constraints if
>> the requested size is less than a page. For page size or
>> greater, the discussed historical requirement for page
>> alignment applies.
>>
>> For #2, min_align_mask governs the bits of the returned
>> physical address that must match the original address. When
>> needed, swiotlb must also allocate pre-padding aligned to
>> IO_TLB_SIZE that precedes the returned physical address. A
>> request size <= swiotlb_max_mapping_size() will not exceed
>> IO_TLB_SEGSIZE even with the padding. The historical
>> requirement for page alignment does not apply because the
>> driver has explicitly used the newer min_align_mask feature.
>
> What is the idea here? Is it the assumption that only old drivers rely
> on page alignment, so if they use min_align_mask, it proves that they
> are new and must not rely on page alignment?
Yes, if a driver goes out of its way to set a min_align_mask which is
smaller than its actual alignment constraint, that is clearly the
driver's own bug. Strictly we only need to be sympathetic to drivers
which predate min_align_mask, when implicitly relying on page alignment
was all they had.
>> For #3, alloc_align_mask specifies the required alignment. No
>> pre-padding is needed. Per earlier comments from Robin[1],
>> it's reasonable to assume alloc_align_mask (i.e., the granule)
>> is >= IO_TLB_SIZE. The original address is not relevant in
>> determining the alignment, and the historical page alignment
>> requirement does not apply since alloc_align_mask explicitly
>> states the alignment.
FWIW I'm also starting to wonder about getting rid of the alloc_size
argument and just have SWIOTLB round the end address up to
alloc_align_mask itself as part of all these calculations. Seems like it
could potentially end up a little simpler, maybe?
>> For #4, the returned physical address must match the bits
>> in the original address specified by min_align_mask. swiotlb
>> swiotlb must also allocate pre-padding aligned to
>> alloc_align_mask that precedes the returned physical address.
>> Also per Robin[1], assume alloc_align_mask is >=
>> min_align_mask, which solves the conflicting alignment
>> problem pointed out by Petr[2]. Perhaps we should add a
>> "WARN_ON(alloc_align_mask < min_align_mask)" rather than
>> failing depending on which bits of the original address are
>> set. Again, the historical requirement for page alignment does
>> not apply.
>
> AFAICS the only reason this works in practice is that there are only
> two in-tree users of min_align_mask: NVMe and Hyper-V. Both use a mask
> of 12 bits, and the IOVA granule size is never smaller than 4K.
If we assume a nonzero alloc_align_mask exclusively signifies iommu-dma,
then for this situation SWIOTLB should only need to worry about the
intersection of alloc_align_mask & min_align_mask, since any
min_align_mask bits larger than the IOVA granule would need to be
accounted for in the IOVA allocation regardless of SWIOTLB.
> If we want to rely on this, then I propose to make a BUG_ON() rather
> than WARN_ON().
I've just proposed a patch to make it not matter for now - the nature of
iommu-dma makes it slightly more awkward to prevent SWIOTLB from ever
seeing this condition at all, so I chose not to do that, but as long as
swiotlb_tbl_map_single() does *something* for conflicting constraints
without completely falling over, which swiotlb_tbl_unmap_single can then
undo again, then it should be fine.
Thanks,
Robin.
On Thu, Feb 29, 2024 at 06:07:32AM +0000, Michael Kelley wrote:
> From: Will Deacon <[email protected]> Sent: Wednesday, February 28, 2024 5:40 AM
> >
> > For swiotlb allocations >= PAGE_SIZE, the slab search historically
> > adjusted the stride to avoid checking unaligned slots. However, this is
> > no longer needed now that the code around it has evolved and the
> > stride is calculated from the required alignment.
> >
> > Either 'alloc_align_mask' is used to specify the allocation alignment or
> > the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'.
> > At least one of these masks is always non-zero.
>
> I think the patch is correct, but this justification is not. alloc_align_mask
> and the DMA min_align_mask are often both zero. While the NVMe
> PCI driver sets min_align_mask, SCSI disk drivers do not (except for the
> Hyper-V synthetic SCSI driver). When both are zero, presumably
> there are no alignment requirements, so a stride of 1 is appropriate.
Sorry, yes, I messed up the commit message here as I was trying to reason
through the allocation case separately from the mapping case. However, I
need to digest the rest of this thread before doing the obvious fix...
Will
On Mon, 4 Mar 2024 13:37:56 +0000
Robin Murphy <[email protected]> wrote:
> On 04/03/2024 11:00 am, Petr Tesařík wrote:
> [...]
> >> Here's my take on tying all the threads together. There are
> >> four alignment combinations:
> >>
> >> 1. alloc_align_mask: zero; min_align_mask: zero
> >> 2. alloc_align_mask: zero; min_align_mask: non-zero
> >> 3. alloc_align_mask: non-zero; min_align_mask: zero/ignored
> >> 4. alloc_align_mask: non-zero; min_align_mask: non-zero
> >
> > What does "min_align_mask: zero/ignored" mean? Under which
> > circumstances should be a non-zero min_align_mask ignored?
> >
> >> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> >> via swiotlb_map() and swiotlb_tbl_map_single()
> >>
> >> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
> >>
> >> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
> >>
> >> For #1, the returned physical address has no constraints if
> >> the requested size is less than a page. For page size or
> >> greater, the discussed historical requirement for page
> >> alignment applies.
> >>
> >> For #2, min_align_mask governs the bits of the returned
> >> physical address that must match the original address. When
> >> needed, swiotlb must also allocate pre-padding aligned to
> >> IO_TLB_SIZE that precedes the returned physical address. A
> >> request size <= swiotlb_max_mapping_size() will not exceed
> >> IO_TLB_SEGSIZE even with the padding. The historical
> >> requirement for page alignment does not apply because the
> >> driver has explicitly used the newer min_align_mask feature.
> >
> > What is the idea here? Is it the assumption that only old drivers rely
> > on page alignment, so if they use min_align_mask, it proves that they
> > are new and must not rely on page alignment?
>
> Yes, if a driver goes out of its way to set a min_align_mask which is
> smaller than its actual alignment constraint, that is clearly the
> driver's own bug. Strictly we only need to be sympathetic to drivers
> which predate min_align_mask, when implicitly relying on page alignment
> was all they had.
>
> >> For #3, alloc_align_mask specifies the required alignment. No
> >> pre-padding is needed. Per earlier comments from Robin[1],
> >> it's reasonable to assume alloc_align_mask (i.e., the granule)
> >> is >= IO_TLB_SIZE. The original address is not relevant in
> >> determining the alignment, and the historical page alignment
> >> requirement does not apply since alloc_align_mask explicitly
> >> states the alignment.
>
> FWIW I'm also starting to wonder about getting rid of the alloc_size
> argument and just have SWIOTLB round the end address up to
> alloc_align_mask itself as part of all these calculations. Seems like it
> could potentially end up a little simpler, maybe?
>
> >> For #4, the returned physical address must match the bits
> >> in the original address specified by min_align_mask. swiotlb
> >> swiotlb must also allocate pre-padding aligned to
> >> alloc_align_mask that precedes the returned physical address.
> >> Also per Robin[1], assume alloc_align_mask is >=
> >> min_align_mask, which solves the conflicting alignment
> >> problem pointed out by Petr[2]. Perhaps we should add a
> >> "WARN_ON(alloc_align_mask < min_align_mask)" rather than
> >> failing depending on which bits of the original address are
> >> set. Again, the historical requirement for page alignment does
> >> not apply.
> >
> > AFAICS the only reason this works in practice is that there are only
> > two in-tree users of min_align_mask: NVMe and Hyper-V. Both use a mask
> > of 12 bits, and the IOVA granule size is never smaller than 4K.
>
> If we assume a nonzero alloc_align_mask exclusively signifies iommu-dma,
> then for this situation SWIOTLB should only need to worry about the
> intersection of alloc_align_mask & min_align_mask, since any
> min_align_mask bits larger than the IOVA granule would need to be
> accounted for in the IOVA allocation regardless of SWIOTLB.
Ah, right, it's not limited to bounce buffers.
> > If we want to rely on this, then I propose to make a BUG_ON() rather
> > than WARN_ON().
>
> I've just proposed a patch to make it not matter for now - the nature of
> iommu-dma makes it slightly more awkward to prevent SWIOTLB from ever
> seeing this condition at all, so I chose not to do that, but as long as
> swiotlb_tbl_map_single() does *something* for conflicting constraints
> without completely falling over, which swiotlb_tbl_unmap_single can then
> undo again, then it should be fine.
Yes. It may allocate an unsuitably aligned bounce buffer, or it may
fail, but your IOMMU patch will continue to work (and also cover the
non-SWIOTLB case).
I believe this patch series is now good as is, except the commit
message should make it clear that alloc_align_mask and min_align_mask
can both be zero, but that simply means no alignment constraints.
Thanks,
Petr T
From: Petr Tesařík <[email protected]> Sent: Monday, March 4, 2024 7:55 AM
>
> On Mon, 4 Mar 2024 13:37:56 +0000
> Robin Murphy <[email protected]> wrote:
>
> > On 04/03/2024 11:00 am, Petr Tesařík wrote:
> > [...]
> > >> Here's my take on tying all the threads together. There are
> > >> four alignment combinations:
> > >>
> > >> 1. alloc_align_mask: zero; min_align_mask: zero
> > >> 2. alloc_align_mask: zero; min_align_mask: non-zero
> > >> 3. alloc_align_mask: non-zero; min_align_mask: zero/ignored
> > >> 4. alloc_align_mask: non-zero; min_align_mask: non-zero
> > >
> > > What does "min_align_mask: zero/ignored" mean? Under which
> > > circumstances should be a non-zero min_align_mask ignored?
"Ignored" was my short-hand for the swiotlb_alloc() case where
orig_addr is zero. Even if min_align_mask is set for the device, it
doesn't have any effect when orig_addr is zero.
> > >
> > >> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> > >> via swiotlb_map() and swiotlb_tbl_map_single()
> > >>
> > >> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
> > >>
> > >> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
> > >>
> > >> For #1, the returned physical address has no constraints if
> > >> the requested size is less than a page. For page size or
> > >> greater, the discussed historical requirement for page
> > >> alignment applies.
> > >>
> > >> For #2, min_align_mask governs the bits of the returned
> > >> physical address that must match the original address. When
> > >> needed, swiotlb must also allocate pre-padding aligned to
> > >> IO_TLB_SIZE that precedes the returned physical address. A
> > >> request size <= swiotlb_max_mapping_size() will not exceed
> > >> IO_TLB_SEGSIZE even with the padding. The historical
> > >> requirement for page alignment does not apply because the
> > >> driver has explicitly used the newer min_align_mask feature.
> > >
> > > What is the idea here? Is it the assumption that only old drivers rely
> > > on page alignment, so if they use min_align_mask, it proves that they
> > > are new and must not rely on page alignment?
> >
> > Yes, if a driver goes out of its way to set a min_align_mask which is
> > smaller than its actual alignment constraint, that is clearly the
> > driver's own bug. Strictly we only need to be sympathetic to drivers
> > which predate min_align_mask, when implicitly relying on page alignment
> > was all they had.
> >
> > >> For #3, alloc_align_mask specifies the required alignment. No
> > >> pre-padding is needed. Per earlier comments from Robin[1],
> > >> it's reasonable to assume alloc_align_mask (i.e., the granule)
> > >> is >= IO_TLB_SIZE. The original address is not relevant in
> > >> determining the alignment, and the historical page alignment
> > >> requirement does not apply since alloc_align_mask explicitly
> > >> states the alignment.
> >
> > FWIW I'm also starting to wonder about getting rid of the alloc_size
> > argument and just have SWIOTLB round the end address up to
> > alloc_align_mask itself as part of all these calculations. Seems like it
> > could potentially end up a little simpler, maybe?
Yes, I was thinking exactly this. But my reasoning was to solve the
bug in #4 that I previously pointed out. If iommu_dma_map_page()
does *not* do
aligned_size = iova_align(iovad, size);
but swiotlb_tbl_map_single() rounds up the size based on
alloc_align_mask *after* adding the offset modulo
min_align_mask, then the rounded-up size won't exceed IO_TLB_SIZE,
regardless of which bits are set in orig_addr.
> >
> > >> For #4, the returned physical address must match the bits
> > >> in the original address specified by min_align_mask. swiotlb
> > >> swiotlb must also allocate pre-padding aligned to
> > >> alloc_align_mask that precedes the returned physical address.
> > >> Also per Robin[1], assume alloc_align_mask is >=
> > >> min_align_mask, which solves the conflicting alignment
> > >> problem pointed out by Petr[2]. Perhaps we should add a
> > >> "WARN_ON(alloc_align_mask < min_align_mask)" rather than
> > >> failing depending on which bits of the original address are
> > >> set. Again, the historical requirement for page alignment does
> > >> not apply.
> > >
> > > AFAICS the only reason this works in practice is that there are only
> > > two in-tree users of min_align_mask: NVMe and Hyper-V. Both use a mask
> > > of 12 bits, and the IOVA granule size is never smaller than 4K.
> >
> > If we assume a nonzero alloc_align_mask exclusively signifies iommu-dma,
> > then for this situation SWIOTLB should only need to worry about the
> > intersection of alloc_align_mask & min_align_mask, since any
> > min_align_mask bits larger than the IOVA granule would need to be
> > accounted for in the IOVA allocation regardless of SWIOTLB.
>
> Ah, right, it's not limited to bounce buffers.
>
> > > If we want to rely on this, then I propose to make a BUG_ON() rather
> > > than WARN_ON().
> >
> > I've just proposed a patch to make it not matter for now - the nature of
> > iommu-dma makes it slightly more awkward to prevent SWIOTLB from ever
> > seeing this condition at all, so I chose not to do that, but as long as
> > swiotlb_tbl_map_single() does *something* for conflicting constraints
> > without completely falling over, which swiotlb_tbl_unmap_single can then
> > undo again, then it should be fine.
>
> Yes. It may allocate an unsuitably aligned bounce buffer, or it may
> fail, but your IOMMU patch will continue to work (and also cover the
> non-SWIOTLB case).
>
> I believe this patch series is now good as is, except the commit
> message should make it clear that alloc_align_mask and min_align_mask
> can both be zero, but that simply means no alignment constraints.
>
No, I think we need to add the historical page alignment functionality
back again for my #1 (alloc_align_mask and min_align_mask both zero).
We don't know what old drivers might be depending on, and we don't
want to risk breaking them.
Michael
From: Will Deacon <[email protected]> Sent: Monday, March 4, 2024 8:02 AM
>
> Hi folks,
>
> On Mon, Mar 04, 2024 at 04:55:06PM +0100, Petr Tesařík wrote:
> > On Mon, 4 Mar 2024 13:37:56 +0000
> > Robin Murphy <[email protected]> wrote:
> > > On 04/03/2024 11:00 am, Petr Tesařík wrote:
> > > [...]
> > > >> Here's my take on tying all the threads together. There are
> > > >> four alignment combinations:
> > > >>
> > > >> 1. alloc_align_mask: zero; min_align_mask: zero
>
> Based on this ^^^ ...
>
> > > >> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> > > >> via swiotlb_map() and swiotlb_tbl_map_single()
> > > >>
> > > >> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
> > > >>
> > > >> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
> > > >>
> > > >> For #1, the returned physical address has no constraints if
> > > >> the requested size is less than a page. For page size or
> > > >> greater, the discussed historical requirement for page
> > > >> alignment applies.
>
> ... and this ^^^ ...
>
>
> > I believe this patch series is now good as is, except the commit
> > message should make it clear that alloc_align_mask and min_align_mask
> > can both be zero, but that simply means no alignment constraints.
>
> ... my (possibly incorrect!) reading of the thread so far is that we
> should preserve page-aligned allocation in this case if the allocation
> size is >= PAGE_SIZE.
>
> Something like the diff below, to replace this final patch?
>
> Will
>
> --->8
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c381a7ed718f..67eac05728c0 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -992,6 +992,14 @@ static int swiotlb_search_pool_area(struct device
> *dev, struct io_tlb_pool *pool
> BUG_ON(!nslots);
> BUG_ON(area_index >= pool->nareas);
>
> + /*
> + * Historically, allocations >= PAGE_SIZE were guaranteed to be
> + * page-aligned in the absence of any other alignment requirements.
> + * Since drivers may be relying on this, preserve the old behaviour.
> + */
> + if (!alloc_align_mask && !iotlb_align_mask && alloc_size >= PAGE_SIZE)
> + alloc_align_mask = PAGE_SIZE - 1;
> +
Yes, I think that should do it.
Michael
> /*
> * Ensure that the allocation is at least slot-aligned and update
> * 'iotlb_align_mask' to ignore bits that will be preserved when
> @@ -1006,13 +1014,6 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> */
> stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
>
> - /*
> - * For allocations of PAGE_SIZE or larger only look for page aligned
> - * allocations.
> - */
> - if (alloc_size >= PAGE_SIZE)
> - stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> -
> spin_lock_irqsave(&area->lock, flags);
> if (unlikely(nslots > pool->area_nslabs - area->used))
> goto not_found;
Hi folks,
On Mon, Mar 04, 2024 at 04:55:06PM +0100, Petr Tesařík wrote:
> On Mon, 4 Mar 2024 13:37:56 +0000
> Robin Murphy <[email protected]> wrote:
> > On 04/03/2024 11:00 am, Petr Tesařík wrote:
> > [...]
> > >> Here's my take on tying all the threads together. There are
> > >> four alignment combinations:
> > >>
> > >> 1. alloc_align_mask: zero; min_align_mask: zero
Based on this ^^^ ...
> > >> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> > >> via swiotlb_map() and swiotlb_tbl_map_single()
> > >>
> > >> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
> > >>
> > >> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
> > >>
> > >> For #1, the returned physical address has no constraints if
> > >> the requested size is less than a page. For page size or
> > >> greater, the discussed historical requirement for page
> > >> alignment applies.
.. and this ^^^ ...
> I believe this patch series is now good as is, except the commit
> message should make it clear that alloc_align_mask and min_align_mask
> can both be zero, but that simply means no alignment constraints.
.. my (possibly incorrect!) reading of the thread so far is that we
should preserve page-aligned allocation in this case if the allocation
size is >= PAGE_SIZE.
Something like the diff below, to replace this final patch?
Will
--->8
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c381a7ed718f..67eac05728c0 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -992,6 +992,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
BUG_ON(!nslots);
BUG_ON(area_index >= pool->nareas);
+ /*
+ * Historically, allocations >= PAGE_SIZE were guaranteed to be
+ * page-aligned in the absence of any other alignment requirements.
+ * Since drivers may be relying on this, preserve the old behaviour.
+ */
+ if (!alloc_align_mask && !iotlb_align_mask && alloc_size >= PAGE_SIZE)
+ alloc_align_mask = PAGE_SIZE - 1;
+
/*
* Ensure that the allocation is at least slot-aligned and update
* 'iotlb_align_mask' to ignore bits that will be preserved when
@@ -1006,13 +1014,6 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
*/
stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
- /*
- * For allocations of PAGE_SIZE or larger only look for page aligned
- * allocations.
- */
- if (alloc_size >= PAGE_SIZE)
- stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
-
spin_lock_irqsave(&area->lock, flags);
if (unlikely(nslots > pool->area_nslabs - area->used))
goto not_found;
On 04/03/2024 4:10 pm, Michael Kelley wrote:
> From: Will Deacon <[email protected]> Sent: Monday, March 4, 2024 8:02 AM
>>
>> Hi folks,
>>
>> On Mon, Mar 04, 2024 at 04:55:06PM +0100, Petr Tesařík wrote:
>>> On Mon, 4 Mar 2024 13:37:56 +0000
>>> Robin Murphy <[email protected]> wrote:
>>>> On 04/03/2024 11:00 am, Petr Tesařík wrote:
>>>> [...]
>>>>>> Here's my take on tying all the threads together. There are
>>>>>> four alignment combinations:
>>>>>>
>>>>>> 1. alloc_align_mask: zero; min_align_mask: zero
>>
>> Based on this ^^^ ...
>>
>>>>>> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
>>>>>> via swiotlb_map() and swiotlb_tbl_map_single()
>>>>>>
>>>>>> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
>>>>>>
>>>>>> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
>>>>>>
>>>>>> For #1, the returned physical address has no constraints if
>>>>>> the requested size is less than a page. For page size or
>>>>>> greater, the discussed historical requirement for page
>>>>>> alignment applies.
>>
>> ... and this ^^^ ...
>>
>>
>>> I believe this patch series is now good as is, except the commit
>>> message should make it clear that alloc_align_mask and min_align_mask
>>> can both be zero, but that simply means no alignment constraints.
>>
>> ... my (possibly incorrect!) reading of the thread so far is that we
>> should preserve page-aligned allocation in this case if the allocation
>> size is >= PAGE_SIZE.
>>
>> Something like the diff below, to replace this final patch?
>>
>> Will
>>
>> --->8
>>
>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>> index c381a7ed718f..67eac05728c0 100644
>> --- a/kernel/dma/swiotlb.c
>> +++ b/kernel/dma/swiotlb.c
>> @@ -992,6 +992,14 @@ static int swiotlb_search_pool_area(struct device
>> *dev, struct io_tlb_pool *pool
>> BUG_ON(!nslots);
>> BUG_ON(area_index >= pool->nareas);
>>
>> + /*
>> + * Historically, allocations >= PAGE_SIZE were guaranteed to be
>> + * page-aligned in the absence of any other alignment requirements.
>> + * Since drivers may be relying on this, preserve the old behaviour.
>> + */
>> + if (!alloc_align_mask && !iotlb_align_mask && alloc_size >= PAGE_SIZE)
>> + alloc_align_mask = PAGE_SIZE - 1;
>> +
>
> Yes, I think that should do it.
In principle it might be more logical to fudge this into
iotlb_align_mask rather than alloc_align_mask - since that's really the
effective behaviour to preserve for streaming mappings - and then pass
an explicit alloc_align_mask from swiotlb_alloc() to honour the
dma-coherent requirements. However I also wouldn't really object to not
going that far and instead just making the comment a bit clearer that
this is still serving both purposes.
Cheers,
Robin.
>
> Michael
>
>> /*
>> * Ensure that the allocation is at least slot-aligned and update
>> * 'iotlb_align_mask' to ignore bits that will be preserved when
>> @@ -1006,13 +1014,6 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
>> */
>> stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
>>
>> - /*
>> - * For allocations of PAGE_SIZE or larger only look for page aligned
>> - * allocations.
>> - */
>> - if (alloc_size >= PAGE_SIZE)
>> - stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
>> -
>> spin_lock_irqsave(&area->lock, flags);
>> if (unlikely(nslots > pool->area_nslabs - area->used))
>> goto not_found;
>
On 04/03/2024 4:04 pm, Michael Kelley wrote:
> From: Petr Tesařík <[email protected]> Sent: Monday, March 4, 2024 7:55 AM
>>
>> On Mon, 4 Mar 2024 13:37:56 +0000
>> Robin Murphy <[email protected]> wrote:
>>
>>> On 04/03/2024 11:00 am, Petr Tesařík wrote:
>>> [...]
>>>>> Here's my take on tying all the threads together. There are
>>>>> four alignment combinations:
>>>>>
>>>>> 1. alloc_align_mask: zero; min_align_mask: zero
>>>>> 2. alloc_align_mask: zero; min_align_mask: non-zero
>>>>> 3. alloc_align_mask: non-zero; min_align_mask: zero/ignored
>>>>> 4. alloc_align_mask: non-zero; min_align_mask: non-zero
>>>>
>>>> What does "min_align_mask: zero/ignored" mean? Under which
>>>> circumstances should be a non-zero min_align_mask ignored?
>
> "Ignored" was my short-hand for the swiotlb_alloc() case where
> orig_addr is zero. Even if min_align_mask is set for the device, it
> doesn't have any effect when orig_addr is zero.
>
>>>>
>>>>> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
>>>>> via swiotlb_map() and swiotlb_tbl_map_single()
>>>>>
>>>>> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
>>>>>
>>>>> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
>>>>>
>>>>> For #1, the returned physical address has no constraints if
>>>>> the requested size is less than a page. For page size or
>>>>> greater, the discussed historical requirement for page
>>>>> alignment applies.
>>>>>
>>>>> For #2, min_align_mask governs the bits of the returned
>>>>> physical address that must match the original address. When
>>>>> needed, swiotlb must also allocate pre-padding aligned to
>>>>> IO_TLB_SIZE that precedes the returned physical address. A
>>>>> request size <= swiotlb_max_mapping_size() will not exceed
>>>>> IO_TLB_SEGSIZE even with the padding. The historical
>>>>> requirement for page alignment does not apply because the
>>>>> driver has explicitly used the newer min_align_mask feature.
>>>>
>>>> What is the idea here? Is it the assumption that only old drivers rely
>>>> on page alignment, so if they use min_align_mask, it proves that they
>>>> are new and must not rely on page alignment?
>>>
>>> Yes, if a driver goes out of its way to set a min_align_mask which is
>>> smaller than its actual alignment constraint, that is clearly the
>>> driver's own bug. Strictly we only need to be sympathetic to drivers
>>> which predate min_align_mask, when implicitly relying on page alignment
>>> was all they had.
>>>
>>>>> For #3, alloc_align_mask specifies the required alignment. No
>>>>> pre-padding is needed. Per earlier comments from Robin[1],
>>>>> it's reasonable to assume alloc_align_mask (i.e., the granule)
>>>>> is >= IO_TLB_SIZE. The original address is not relevant in
>>>>> determining the alignment, and the historical page alignment
>>>>> requirement does not apply since alloc_align_mask explicitly
>>>>> states the alignment.
>>>
>>> FWIW I'm also starting to wonder about getting rid of the alloc_size
>>> argument and just have SWIOTLB round the end address up to
>>> alloc_align_mask itself as part of all these calculations. Seems like it
>>> could potentially end up a little simpler, maybe?
>
> Yes, I was thinking exactly this. But my reasoning was to solve the
> bug in #4 that I previously pointed out. If iommu_dma_map_page()
> does *not* do
>
> aligned_size = iova_align(iovad, size);
>
> but swiotlb_tbl_map_single() rounds up the size based on
> alloc_align_mask *after* adding the offset modulo
> min_align_mask, then the rounded-up size won't exceed IO_TLB_SIZE,
> regardless of which bits are set in orig_addr.
Ah, neat, I had a gut feeling that something like that might also fall
out, I just didn't feel like working through the details to see if
"simpler" could lead to "objectively better" :)
I guess at worst we might also need to pass an alloc_align_mask to
swiotlb_max_mapping_size() as well, but even that's not necessarily a
bad thing if it keeps the equivalent calculations close together within
SWIOTLB and makes things more robust overall.
Cheers,
Robin.
From: Robin Murphy <[email protected]> Sent: Monday, March 4, 2024 9:11 AM
>
> On 04/03/2024 4:04 pm, Michael Kelley wrote:
> > From: Petr Tesařík <[email protected]> Sent: Monday, March 4, 2024 7:55 AM
> >>
> >> On Mon, 4 Mar 2024 13:37:56 +0000
> >> Robin Murphy <[email protected]> wrote:
> >>
> >>> On 04/03/2024 11:00 am, Petr Tesařík wrote:
> >>> [...]
> >>>
> >>>>> For #3, alloc_align_mask specifies the required alignment. No
> >>>>> pre-padding is needed. Per earlier comments from Robin[1],
> >>>>> it's reasonable to assume alloc_align_mask (i.e., the granule)
> >>>>> is >= IO_TLB_SIZE. The original address is not relevant in
> >>>>> determining the alignment, and the historical page alignment
> >>>>> requirement does not apply since alloc_align_mask explicitly
> >>>>> states the alignment.
> >>>
> >>> FWIW I'm also starting to wonder about getting rid of the alloc_size
> >>> argument and just have SWIOTLB round the end address up to
> >>> alloc_align_mask itself as part of all these calculations. Seems like it
> >>> could potentially end up a little simpler, maybe?
> >
> > Yes, I was thinking exactly this. But my reasoning was to solve the
> > bug in #4 that I previously pointed out. If iommu_dma_map_page()
> > does *not* do
> >
> > aligned_size = iova_align(iovad, size);
> >
> > but swiotlb_tbl_map_single() rounds up the size based on
> > alloc_align_mask *after* adding the offset modulo
> > min_align_mask, then the rounded-up size won't exceed IO_TLB_SIZE,
> > regardless of which bits are set in orig_addr.
>
> Ah, neat, I had a gut feeling that something like that might also fall
> out, I just didn't feel like working through the details to see if
> "simpler" could lead to "objectively better" :)
>
> I guess at worst we might also need to pass an alloc_align_mask to
> swiotlb_max_mapping_size() as well, but even that's not necessarily a
> bad thing if it keeps the equivalent calculations close together within
> SWIOTLB and makes things more robust overall.
>
I haven't seen a reason to incorporate alloc_align_mask into
swiotlb_max_mapping_size(). But let me confirm my
understanding of this scenario:
1. The requested size without any rounding is 11K (just an example).
2. The original address starts at offset 7K modulo a 64K granule.
3. The min_align_mask for the device is 4K - 1
4. Then it's OK for swiotlb to return an address at offset 3K modulo
the 64K granule. Such an address meets the min_align_mask, even
though it is a different offset in the granule.
5. swiotlb will allocate 3K of pre-padding for the min_align_mask
requirement. If swiotlb also does the rounding up, it would take
the original 11K, add 3K of pre-padding, then round up to 64K and
effectively allocate 50K of post-padding.
6. The zeroing of the pre-padding and post-padding is messed
up in iommu_dma_map_page(), but that's a separate issue.
Assuming my understanding is correct, this works correctly
when the originally requested size is as large as
swiotlb_max_mapping_size(). It works because the
pre-padding is never more than min_align_mask, and
swiotlb_max_mapping_size() takes that pre-padding into
account.
I would normally propose a patch to implement these changes,
but I don't have hardware on which to test IOMMU code paths.
But I'll write an untested patch if that would be a helpful starting
point for someone else to test.
FYI, I'm treating this discussion as separate from Will's patch
set. It's something that could be done as a follow-on set.
Michael
From: Robin Murphy <[email protected]> Sent: Monday, March 4, 2024 8:54 AM
>
> On 04/03/2024 4:10 pm, Michael Kelley wrote:
> > From: Will Deacon <[email protected]> Sent: Monday, March 4, 2024 8:02 AM
> >>
> >> Hi folks,
> >>
> >> On Mon, Mar 04, 2024 at 04:55:06PM +0100, Petr Tesařík wrote:
> >>> On Mon, 4 Mar 2024 13:37:56 +0000
> >>> Robin Murphy <[email protected]> wrote:
> >>>> On 04/03/2024 11:00 am, Petr Tesařík wrote:
> >>>> [...]
> >>>>>> Here's my take on tying all the threads together. There are
> >>>>>> four alignment combinations:
> >>>>>>
> >>>>>> 1. alloc_align_mask: zero; min_align_mask: zero
> >>
> >> Based on this ^^^ ...
> >>
> >>>>>> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> >>>>>> via swiotlb_map() and swiotlb_tbl_map_single()
> >>>>>>
> >>>>>> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
> >>>>>>
> >>>>>> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
> >>>>>>
> >>>>>> For #1, the returned physical address has no constraints if
> >>>>>> the requested size is less than a page. For page size or
> >>>>>> greater, the discussed historical requirement for page
> >>>>>> alignment applies.
> >>
> >> ... and this ^^^ ...
> >>
> >>
> >>> I believe this patch series is now good as is, except the commit
> >>> message should make it clear that alloc_align_mask and min_align_mask
> >>> can both be zero, but that simply means no alignment constraints.
> >>
> >> ... my (possibly incorrect!) reading of the thread so far is that we
> >> should preserve page-aligned allocation in this case if the allocation
> >> size is >= PAGE_SIZE.
> >>
> >> Something like the diff below, to replace this final patch?
> >>
> >> Will
> >>
> >> --->8
> >>
> >> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> >> index c381a7ed718f..67eac05728c0 100644
> >> --- a/kernel/dma/swiotlb.c
> >> +++ b/kernel/dma/swiotlb.c
> >> @@ -992,6 +992,14 @@ static int swiotlb_search_pool_area(struct device
> >> *dev, struct io_tlb_pool *pool
> >> BUG_ON(!nslots);
> >> BUG_ON(area_index >= pool->nareas);
> >>
> >> + /*
> >> + * Historically, allocations >= PAGE_SIZE were guaranteed to be
> >> + * page-aligned in the absence of any other alignment requirements.
> >> + * Since drivers may be relying on this, preserve the old behaviour.
> >> + */
> >> + if (!alloc_align_mask && !iotlb_align_mask && alloc_size >= PAGE_SIZE)
> >> + alloc_align_mask = PAGE_SIZE - 1;
> >> +
> >
> > Yes, I think that should do it.
>
> In principle it might be more logical to fudge this into
> iotlb_align_mask rather than alloc_align_mask
I'm not understanding what you are getting at, but maybe we are
interpreting the historical page alignment requirement differently.
I think of the page alignment requirement as independent of the
orig_addr -- the returned physical address should always be exactly
page aligned, and not offset to match bits in orig_addr. If that's
the case, then implementing the page alignment via
alloc_align_mask is logically the right place. Fudging into
iotlb_align_mask would do matching of bits in orig_addr.
Or is there something else I'm not considering?
Michael
> - since that's really the
> effective behaviour to preserve for streaming mappings - and then pass
> an explicit alloc_align_mask from swiotlb_alloc() to honour the
> dma-coherent requirements. However I also wouldn't really object to not
> going that far and instead just making the comment a bit clearer that
> this is still serving both purposes.
>
> Cheers,
> Robin.
>
> >
> > Michael
> >
> >> /*
> >> * Ensure that the allocation is at least slot-aligned and update
> >> * 'iotlb_align_mask' to ignore bits that will be preserved when
> >> @@ -1006,13 +1014,6 @@ static int swiotlb_search_pool_area(struct
> device *dev, struct io_tlb_pool *pool
> >> */
> >> stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> >>
> >> - /*
> >> - * For allocations of PAGE_SIZE or larger only look for page aligned
> >> - * allocations.
> >> - */
> >> - if (alloc_size >= PAGE_SIZE)
> >> - stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> >> -
> >> spin_lock_irqsave(&area->lock, flags);
> >> if (unlikely(nslots > pool->area_nslabs - area->used))
> >> goto not_found;
> >
On Mon, 4 Mar 2024 16:10:46 +0000
Michael Kelley <[email protected]> wrote:
> From: Will Deacon <[email protected]> Sent: Monday, March 4, 2024 8:02 AM
> >
> > Hi folks,
> >
> > On Mon, Mar 04, 2024 at 04:55:06PM +0100, Petr Tesařík wrote:
> > > On Mon, 4 Mar 2024 13:37:56 +0000
> > > Robin Murphy <[email protected]> wrote:
> > > > On 04/03/2024 11:00 am, Petr Tesařík wrote:
> > > > [...]
> > > > >> Here's my take on tying all the threads together. There are
> > > > >> four alignment combinations:
> > > > >>
> > > > >> 1. alloc_align_mask: zero; min_align_mask: zero
> >
> > Based on this ^^^ ...
> >
> > > > >> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> > > > >> via swiotlb_map() and swiotlb_tbl_map_single()
> > > > >>
> > > > >> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
> > > > >>
> > > > >> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
> > > > >>
> > > > >> For #1, the returned physical address has no constraints if
> > > > >> the requested size is less than a page. For page size or
> > > > >> greater, the discussed historical requirement for page
> > > > >> alignment applies.
> >
> > ... and this ^^^ ...
> >
> >
> > > I believe this patch series is now good as is, except the commit
> > > message should make it clear that alloc_align_mask and min_align_mask
> > > can both be zero, but that simply means no alignment constraints.
> >
> > ... my (possibly incorrect!) reading of the thread so far is that we
> > should preserve page-aligned allocation in this case if the allocation
> > size is >= PAGE_SIZE.
> >
> > Something like the diff below, to replace this final patch?
> >
> > Will
> >
> > --->8
> >
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index c381a7ed718f..67eac05728c0 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -992,6 +992,14 @@ static int swiotlb_search_pool_area(struct device
> > *dev, struct io_tlb_pool *pool
> > BUG_ON(!nslots);
> > BUG_ON(area_index >= pool->nareas);
> >
> > + /*
> > + * Historically, allocations >= PAGE_SIZE were guaranteed to be
> > + * page-aligned in the absence of any other alignment requirements.
> > + * Since drivers may be relying on this, preserve the old behaviour.
> > + */
> > + if (!alloc_align_mask && !iotlb_align_mask && alloc_size >= PAGE_SIZE)
> > + alloc_align_mask = PAGE_SIZE - 1;
> > +
>
> Yes, I think that should do it.
Sure, this will solve the allocations. But my understanding of this
same thread is that we don't need it here. The historical page order
alignment applies ONLY to allocations, NOT to mappings. It is
documented in Documentation/core-api/dma-api-howto.rst under Consistent
DMA mappings, for dma_alloc_coherent(). IIUC it does not apply to the
streaming DMA mappings. At least, it would explain why nobody
complained that the more strict guarantee for sizes greater than
PAGE_SIZE was not kept...
The SWIOTLB can be used for allocation if CONFIG_DMA_RESTRICTED_POOL=y,
but this case is handled by patch 3/6 of this patch series.
Do I miss something again?
Petr T
> Michael
>
> > /*
> > * Ensure that the allocation is at least slot-aligned and update
> > * 'iotlb_align_mask' to ignore bits that will be preserved when
> > @@ -1006,13 +1014,6 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > */
> > stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> >
> > - /*
> > - * For allocations of PAGE_SIZE or larger only look for page aligned
> > - * allocations.
> > - */
> > - if (alloc_size >= PAGE_SIZE)
> > - stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> > -
> > spin_lock_irqsave(&area->lock, flags);
> > if (unlikely(nslots > pool->area_nslabs - area->used))
> > goto not_found;
>
On Mon, 4 Mar 2024 03:31:34 +0000
Michael Kelley <[email protected]> wrote:
> From: Petr Tesařík <[email protected]> Sent: Friday, March 1, 2024 10:42 AM
> >
> > On Fri, 1 Mar 2024 17:54:06 +0000
> > Robin Murphy <[email protected]> wrote:
> >
> > > On 2024-03-01 5:08 pm, Petr Tesařík wrote:
> > > > On Fri, 1 Mar 2024 16:39:27 +0100
> > > > Petr Tesařík <[email protected]> wrote:
> > > >
> > > >> On Thu, 29 Feb 2024 16:47:56 +0100
> > > >> Christoph Hellwig <[email protected]> wrote:
> > > >>
> > > >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> > > >>>> Any thoughts on how that historical behavior should apply if
> > > >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask
> > > >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently
> > > >>>> used, alloc_align_mask is page aligned if the IOMMU granule is
> > > >>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate
> > > >>>> returning a buffer that is not page aligned. Perhaps do the
> > > >>>> historical behavior only if alloc_align_mask and min_align_mask
> > > >>>> are both zero?
> > > >>>
> > > >>> I think the driver setting min_align_mask is a clear indicator
> > > >>> that the driver requested a specific alignment and the defaults
> > > >>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu
> > > >>> I'd have to tak a closer look at how it is used.
> > > >>
> > > >> I'm not sure it helps in this discussion, but let me dive into a bit
> > > >> of ancient history to understand how we ended up here.
> > > >>
> > > >> IIRC this behaviour was originally motivated by limitations of PC AT
> > > >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
> > > >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
> > > >> added a page register, but it was on a separate chip and it did not
> > > >> increment when the 8237 address rolled over back to zero. Effectively,
> > > >> the page register selected a 64K-aligned window of 64K buffers.
> > > >> Consequently, DMA buffers could not cross a 64K physical boundary.
> > > >>
> > > >> Thanks to how the buddy allocator works, the 64K-boundary constraint
> > > >> was satisfied by allocation size, and drivers took advantage of it when
> > > >> allocating device buffers. IMO software bounce buffers simply followed
> > > >> the same logic that worked for buffers allocated by the buddy allocator.
> > > >>
> > > >> OTOH min_align_mask was motivated by NVME which prescribes the value of
> > > >> a certain number of low bits in the DMA address (for simplicity assumed
> > > >> to be identical with the same bits in the physical address).
> > > >>
> > > >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
> > > >> IIUC it is used to guarantee that unaligned transactions do not share
> > > >> the IOMMU granule with another device. This whole thing is weird,
> > > >> because swiotlb_tbl_map_single() is called like this:
> > > >>
> > > >> aligned_size = iova_align(iovad, size);
> > > >> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> > > >> iova_mask(iovad), dir, attrs);
> > > >>
> > > >> Here:
> > > >>
> > > >> * alloc_size = iova_align(iovad, size)
> > > >> * alloc_align_mask = iova_mask(iovad)
> > > >>
> > > >> Now, iova_align() rounds up its argument to a multiple of iova granule
> > > >> and iova_mask() is simply "granule - 1". This works, because granule
> > > >> size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
> > > >>
> > > >> In that case, the alloc_align_mask argument is not even needed if you
> > > >> adjust the code to match documentation---the resulting buffer will be
> > > >> aligned to a granule boundary by virtue of having a size that is a
> > > >> multiple of the granule size.
> > > >>
> > > >> To sum it up:
> > > >>
> > > >> 1. min_align_mask is by far the most important constraint. Devices will
> > > >> simply stop working if it is not met.
> > > >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or
> > > >> equal to the requested size has been documented, and some drivers
> > > >> may rely on it.
> > > >> 3. alloc_align_mask is a misguided fix for a bug in the above.
> > > >>
> > > >> Correct me if anything of the above is wrong.
> > > >
> > > > I thought about it some more, and I believe I know what should happen
> > > > if the first two constraints appear to be mutually exclusive.
> > > >
> > > > First, the alignment based on size does not guarantee that the resulting
> > > > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must
> > > > be always identical to the original buffer address.
> > > >
> > > > Let's take an example request like this:
> > > >
> > > > TLB_SIZE = 0x00000800
> > > > min_align_mask = 0x0000ffff
> > > > orig_addr = 0x....1234
> > > > alloc_size = 0x00002800
> > > >
> > > > Minimum alignment mask requires to keep the 1234 at the end. Allocation
> > > > size requires a buffer that is aligned to 16K. Of course, there is no
> > > > 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT
> > > > was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are
> > > > masked off). Since the SWIOTLB API does not guarantee any specific
> > > > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a
> > > > perfectly valid bounce buffer address for this example.
> > > >
> > > > The caller may rightfully expect that the 16K granule containing the
> > > > bounce buffer is not shared with any other user. For the above case I
> > > > suggest to increase the allocation size to 0x4000 already in
> > > > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot
> > > > address.
> > >
> > > That doesn't make sense - a caller asks to map some range of kernel
> > > addresses and they get back a corresponding range of DMA addresses; they
> > > cannot make any reasonable assumptions about DMA addresses *outside*
> > > that range. In the example above, the caller has explicitly chosen not
> > > to map the range xxx0000-xxx1234; if they expect the device to actually
> > > access bytes in the DMA range yyy0000-yyy1234, then they should have
> > > mapped the whole range starting from xxx0000 and it is their own error.
> >
> > I agree that the range was not requested. But it is not wrong if
> > SWIOTLB overallocates. In fact, it usually does overallocate because it
> > works with slot granularity.
> >
> > > SWIOTLB does not and cannot provide any memory protection itself, so
> > > there is no functional benefit to automatically over-allocating, all it
> > > will do is waste slots. iommu-dma *can* provide memory protection
> > > between individual mappings via additional layers that SWIOTLB doesn't
> > > know about, so in that case it's iommu-dma's responsibility to
> > > explicitly manage whatever over-allocation is necessary at the SWIOTLB
> > > level to match the IOMMU level.
> >
> > I'm trying to understand what the caller expects to get if they request
> > both buffer alignment (either given implicitly through mapping size or
> > explicitly with an alloc_align_mask) with a min_align_mask and non-zero
> > low bits covered by the buffer alignment.
> >
> > In other words, if iommu_dma_map_page() gets into this situation:
> >
> > * granule size is 4k
> > * device specifies 64k min_align_mask
> > * bit 11 of the original buffer address is non-zero
> >
> > Then you ask for a pair of slots where the first slot has bit 11 == 0
> > (required by alignment to granule size) and also has bit 11 == 1
> > (required to preserve the lowest 16 bits of the original address).
> >
> > Sure, you can fail such a mapping, but is it what the caller expects?
> >
Upfront, thank you very much for the overview. Much appreciated!
> Here's my take on tying all the threads together. There are
> four alignment combinations:
>
> 1. alloc_align_mask: zero; min_align_mask: zero
> 2. alloc_align_mask: zero; min_align_mask: non-zero
> 3. alloc_align_mask: non-zero; min_align_mask: zero/ignored
> 4. alloc_align_mask: non-zero; min_align_mask: non-zero
What does "min_align_mask: zero/ignored" mean? Under which
circumstances should be a non-zero min_align_mask ignored?
> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> via swiotlb_map() and swiotlb_tbl_map_single()
>
> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
>
> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
>
> For #1, the returned physical address has no constraints if
> the requested size is less than a page. For page size or
> greater, the discussed historical requirement for page
> alignment applies.
>
> For #2, min_align_mask governs the bits of the returned
> physical address that must match the original address. When
> needed, swiotlb must also allocate pre-padding aligned to
> IO_TLB_SIZE that precedes the returned physical address. A
> request size <= swiotlb_max_mapping_size() will not exceed
> IO_TLB_SEGSIZE even with the padding. The historical
> requirement for page alignment does not apply because the
> driver has explicitly used the newer min_align_mask feature.
What is the idea here? Is it the assumption that only old drivers rely
on page alignment, so if they use min_align_mask, it proves that they
are new and must not rely on page alignment?
> For #3, alloc_align_mask specifies the required alignment. No
> pre-padding is needed. Per earlier comments from Robin[1],
> it's reasonable to assume alloc_align_mask (i.e., the granule)
> is >= IO_TLB_SIZE. The original address is not relevant in
> determining the alignment, and the historical page alignment
> requirement does not apply since alloc_align_mask explicitly
> states the alignment.
>
> For #4, the returned physical address must match the bits
> in the original address specified by min_align_mask. swiotlb
> swiotlb must also allocate pre-padding aligned to
> alloc_align_mask that precedes the returned physical address.
> Also per Robin[1], assume alloc_align_mask is >=
> min_align_mask, which solves the conflicting alignment
> problem pointed out by Petr[2]. Perhaps we should add a
> "WARN_ON(alloc_align_mask < min_align_mask)" rather than
> failing depending on which bits of the original address are
> set. Again, the historical requirement for page alignment does
> not apply.
AFAICS the only reason this works in practice is that there are only
two in-tree users of min_align_mask: NVMe and Hyper-V. Both use a mask
of 12 bits, and the IOVA granule size is never smaller than 4K.
If we want to rely on this, then I propose to make a BUG_ON() rather
than WARN_ON().
> I believe Will's patch set implements everything in #2, #3,
> and #4, except my suggested WARN_ON in #4. The historical page
> alignment in #1 presumably needs to be added. Also, the current
> implementation of #4 has a bug in that IO_TLB_SEGSIZE could be
> exceeded as pointed out here[3], but Robin was OK with not
> fixing that now.
Agreed.
Thank you again, this helps a lot.
Petr T
>
> Michael
>
> [1] https://lore.kernel.org/linux-iommu/20240221113504.7161-1-will@kernelorg/T/#mbd31cbfbdf841336e25f37758c8af1a0b6d8f3eb
> [2] https://lore.kernel.org/linux-iommu/[email protected]/T/#mf631679b302b1f5c7cacc82f4c15fb4b19f3dea1
> [3] https://lore.kernel.org/linux-iommu/20240221113504.7161-1-will@kernelorg/T/#m4179a909777ec751f3dc15b515617477e6682600
On 2024-03-04 6:22 pm, Michael Kelley wrote:
> From: Robin Murphy <[email protected]> Sent: Monday, March 4, 2024 8:54 AM
>>
>> On 04/03/2024 4:10 pm, Michael Kelley wrote:
>>> From: Will Deacon <[email protected]> Sent: Monday, March 4, 2024 8:02 AM
>>>>
>>>> Hi folks,
>>>>
>>>> On Mon, Mar 04, 2024 at 04:55:06PM +0100, Petr Tesařík wrote:
>>>>> On Mon, 4 Mar 2024 13:37:56 +0000
>>>>> Robin Murphy <[email protected]> wrote:
>>>>>> On 04/03/2024 11:00 am, Petr Tesařík wrote:
>>>>>> [...]
>>>>>>>> Here's my take on tying all the threads together. There are
>>>>>>>> four alignment combinations:
>>>>>>>>
>>>>>>>> 1. alloc_align_mask: zero; min_align_mask: zero
>>>>
>>>> Based on this ^^^ ...
>>>>
>>>>>>>> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
>>>>>>>> via swiotlb_map() and swiotlb_tbl_map_single()
>>>>>>>>
>>>>>>>> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
>>>>>>>>
>>>>>>>> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
>>>>>>>>
>>>>>>>> For #1, the returned physical address has no constraints if
>>>>>>>> the requested size is less than a page. For page size or
>>>>>>>> greater, the discussed historical requirement for page
>>>>>>>> alignment applies.
>>>>
>>>> ... and this ^^^ ...
>>>>
>>>>
>>>>> I believe this patch series is now good as is, except the commit
>>>>> message should make it clear that alloc_align_mask and min_align_mask
>>>>> can both be zero, but that simply means no alignment constraints.
>>>>
>>>> ... my (possibly incorrect!) reading of the thread so far is that we
>>>> should preserve page-aligned allocation in this case if the allocation
>>>> size is >= PAGE_SIZE.
>>>>
>>>> Something like the diff below, to replace this final patch?
>>>>
>>>> Will
>>>>
>>>> --->8
>>>>
>>>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>>>> index c381a7ed718f..67eac05728c0 100644
>>>> --- a/kernel/dma/swiotlb.c
>>>> +++ b/kernel/dma/swiotlb.c
>>>> @@ -992,6 +992,14 @@ static int swiotlb_search_pool_area(struct device
>>>> *dev, struct io_tlb_pool *pool
>>>> BUG_ON(!nslots);
>>>> BUG_ON(area_index >= pool->nareas);
>>>>
>>>> + /*
>>>> + * Historically, allocations >= PAGE_SIZE were guaranteed to be
>>>> + * page-aligned in the absence of any other alignment requirements.
>>>> + * Since drivers may be relying on this, preserve the old behaviour.
>>>> + */
>>>> + if (!alloc_align_mask && !iotlb_align_mask && alloc_size >= PAGE_SIZE)
>>>> + alloc_align_mask = PAGE_SIZE - 1;
>>>> +
>>>
>>> Yes, I think that should do it.
>>
>> In principle it might be more logical to fudge this into
>> iotlb_align_mask rather than alloc_align_mask
>
> I'm not understanding what you are getting at, but maybe we are
> interpreting the historical page alignment requirement differently.
> I think of the page alignment requirement as independent of the
> orig_addr -- the returned physical address should always be exactly
> page aligned, and not offset to match bits in orig_addr. If that's
> the case, then implementing the page alignment via
> alloc_align_mask is logically the right place. Fudging into
> iotlb_align_mask would do matching of bits in orig_addr.
>
> Or is there something else I'm not considering?
In short, it's that alloc_align_mask is concerned with how slots are
allocated, while min_align_mask is concerned with where the data itself
is bounced (which may also place certain constraints on allocation).
The reason this page-alignment was here in the first place was seemingly
to serve the original swiotlb_alloc() path, and thus it could be
considered functionally equivalent to what is now alloc_align_mask.
However the side-effect it happened to also have for streaming mappings
was to prevent sufficiently large page-aligned buffers being bounced to
a non-page-aligned address, which apparently managed to work well enough
for NVMe until 64K pages became more common and ruined things by being
too big, and we formalised *that* desired effect into min_align_mask.
I get that forcing io_tlb_align mask here would introduce a stronger
constraint which affects non-page-aligned buffers as well, and wanting
to avoid that is perhaps a reasonable concern, so once again I'm really
just arguing semantics. Given the head-scratching we've been through
over this already, I think it would be valuable to at least be clearer
that a significant part of the "old behaviour" is to do the right thing
for swiotlb_alloc(), which is very much still current and necessary, but
the reason we're not setting alloc_align_mask there is because doing it
here also preserves this legacy side-effect which acts as a limited
version of min_align_mask to preserve page alignment when bouncing.
Cheers,
Robin.
On Mon, Mar 04, 2024 at 01:37:56PM +0000, Robin Murphy wrote:
> FWIW I'm also starting to wonder about getting rid of the alloc_size
> argument and just have SWIOTLB round the end address up to alloc_align_mask
> itself as part of all these calculations. Seems like it could potentially
> end up a little simpler, maybe?
Yes, that does sound simpler and most importantly harder to get wrong..
On Mon, Mar 04, 2024 at 08:04:28PM +0100, Petr Tesařík wrote:
> Sure, this will solve the allocations. But my understanding of this
> same thread is that we don't need it here. The historical page order
> alignment applies ONLY to allocations, NOT to mappings. It is
> documented in Documentation/core-api/dma-api-howto.rst under Consistent
> DMA mappings, for dma_alloc_coherent(). IIUC it does not apply to the
> streaming DMA mappings. At least, it would explain why nobody
> complained that the more strict guarantee for sizes greater than
> PAGE_SIZE was not kept...
Yes. arm32 (and before the dma-direct conversion various other
architectures) have relaxed the required to a PAGE_SIZE alignment,
and at least no native dma direct has ever returned less than PAGE_SIZE
alignment even for smaller allocations (as they are all rounded up
to PAGE_SIZE). So I think the documentation could also use some
updating to match reality.
On Tue, 5 Mar 2024 11:20:13 +0000
Robin Murphy <[email protected]> wrote:
> On 2024-03-04 6:22 pm, Michael Kelley wrote:
> > From: Robin Murphy <[email protected]> Sent: Monday, March 4, 2024 8:54 AM
> >>
> >> On 04/03/2024 4:10 pm, Michael Kelley wrote:
> >>> From: Will Deacon <[email protected]> Sent: Monday, March 4, 2024 8:02 AM
> >>>>
> >>>> Hi folks,
> >>>>
> >>>> On Mon, Mar 04, 2024 at 04:55:06PM +0100, Petr Tesařík wrote:
> >>>>> On Mon, 4 Mar 2024 13:37:56 +0000
> >>>>> Robin Murphy <[email protected]> wrote:
> >>>>>> On 04/03/2024 11:00 am, Petr Tesařík wrote:
> >>>>>> [...]
> >>>>>>>> Here's my take on tying all the threads together. There are
> >>>>>>>> four alignment combinations:
> >>>>>>>>
> >>>>>>>> 1. alloc_align_mask: zero; min_align_mask: zero
> >>>>
> >>>> Based on this ^^^ ...
> >>>>
> >>>>>>>> xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
> >>>>>>>> via swiotlb_map() and swiotlb_tbl_map_single()
> >>>>>>>>
> >>>>>>>> iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()
> >>>>>>>>
> >>>>>>>> swiotlb_alloc() is #3, directly to swiotlb_find_slots()
> >>>>>>>>
> >>>>>>>> For #1, the returned physical address has no constraints if
> >>>>>>>> the requested size is less than a page. For page size or
> >>>>>>>> greater, the discussed historical requirement for page
> >>>>>>>> alignment applies.
> >>>>
> >>>> ... and this ^^^ ...
> >>>>
> >>>>
> >>>>> I believe this patch series is now good as is, except the commit
> >>>>> message should make it clear that alloc_align_mask and min_align_mask
> >>>>> can both be zero, but that simply means no alignment constraints.
> >>>>
> >>>> ... my (possibly incorrect!) reading of the thread so far is that we
> >>>> should preserve page-aligned allocation in this case if the allocation
> >>>> size is >= PAGE_SIZE.
> >>>>
> >>>> Something like the diff below, to replace this final patch?
> >>>>
> >>>> Will
> >>>>
> >>>> --->8
> >>>>
> >>>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> >>>> index c381a7ed718f..67eac05728c0 100644
> >>>> --- a/kernel/dma/swiotlb.c
> >>>> +++ b/kernel/dma/swiotlb.c
> >>>> @@ -992,6 +992,14 @@ static int swiotlb_search_pool_area(struct device
> >>>> *dev, struct io_tlb_pool *pool
> >>>> BUG_ON(!nslots);
> >>>> BUG_ON(area_index >= pool->nareas);
> >>>>
> >>>> + /*
> >>>> + * Historically, allocations >= PAGE_SIZE were guaranteed to be
> >>>> + * page-aligned in the absence of any other alignment requirements.
> >>>> + * Since drivers may be relying on this, preserve the old behaviour.
> >>>> + */
> >>>> + if (!alloc_align_mask && !iotlb_align_mask && alloc_size >= PAGE_SIZE)
> >>>> + alloc_align_mask = PAGE_SIZE - 1;
> >>>> +
> >>>
> >>> Yes, I think that should do it.
> >>
> >> In principle it might be more logical to fudge this into
> >> iotlb_align_mask rather than alloc_align_mask
> >
> > I'm not understanding what you are getting at, but maybe we are
> > interpreting the historical page alignment requirement differently.
> > I think of the page alignment requirement as independent of the
> > orig_addr -- the returned physical address should always be exactly
> > page aligned, and not offset to match bits in orig_addr. If that's
> > the case, then implementing the page alignment via
> > alloc_align_mask is logically the right place. Fudging into
> > iotlb_align_mask would do matching of bits in orig_addr.
> >
> > Or is there something else I'm not considering?
>
> In short, it's that alloc_align_mask is concerned with how slots are
> allocated, while min_align_mask is concerned with where the data itself
> is bounced (which may also place certain constraints on allocation).
>
> The reason this page-alignment was here in the first place was seemingly
> to serve the original swiotlb_alloc() path, and thus it could be
> considered functionally equivalent to what is now alloc_align_mask.
> However the side-effect it happened to also have for streaming mappings
> was to prevent sufficiently large page-aligned buffers being bounced to
> a non-page-aligned address, which apparently managed to work well enough
> for NVMe until 64K pages became more common and ruined things by being
> too big, and we formalised *that* desired effect into min_align_mask.
>
> I get that forcing io_tlb_align mask here would introduce a stronger
> constraint which affects non-page-aligned buffers as well, and wanting
> to avoid that is perhaps a reasonable concern, so once again I'm really
> just arguing semantics. Given the head-scratching we've been through
> over this already, I think it would be valuable to at least be clearer
> that a significant part of the "old behaviour" is to do the right thing
> for swiotlb_alloc(), which is very much still current and necessary, but
> the reason we're not setting alloc_align_mask there is because doing it
> here also preserves this legacy side-effect which acts as a limited
> version of min_align_mask to preserve page alignment when bouncing.
I'm confused. Doesn't PATCH 3/6 of this series change swiotlb_alloc()
to use alloc_align_mask (and finally do it properly for sizes greater
than page size)?
Anyway, the placement of streaming DMA buffers is not under control of
the device driver IIUC. These buffers may have been allocated without
even knowing that they would be once used for I/O. I cannot imagine how
a device driver could rely on its physical address beyond offfset
within a page. And even then it's subtly broken, because the size of
the mapping may be less than PAGE_SIZE (e.g. 512 bytes as traditional
sector size under direct I/O). Even if we keep the alignment based on
mapping size for now, it should probably be deprecated in favour of
min_align_mask.
Petr T