2024-02-13 21:55:14

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

It's observed that an NVME device is causing timeouts when Ubuntu boots
with a kernel configured with PAGE_SIZE=64KB due to failures in swiotlb:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1

An NVME device under a PCIe bus can be behind an IOMMU, so dma mappings
going through dma-iommu might be also redirected to swiotlb allocations.
Similar to dma_direct_max_mapping_size(), dma-iommu should implement its
dma_map_ops->max_mapping_size to return swiotlb_max_mapping_size() too.

Though an iommu_dma_max_mapping_size() is a must, it alone can't fix the
issue. The swiotlb_max_mapping_size() returns 252KB, calculated from the
default pool 256KB subtracted by min_align_mask NVME_CTRL_PAGE_SIZE=4KB,
while dma-iommu can roundup a 252KB mapping to 256KB at its "alloc_size"
when PAGE_SIZE=64KB via iova->granule that is often set to PAGE_SIZE. So
this mismatch between NVME_CTRL_PAGE_SIZE=4KB and PAGE_SIZE=64KB results
in a similar failure, though its signature has a fixed size "256KB":
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 262144 bytes), total 32768 (slots), used 128 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1

Both failures above occur to NVME behind IOMMU when PAGE_SIZE=64KB. They
were likely introduced for the security feature by:
commit 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers"),

So, this series bundles two fixes together against that. They should be
taken at the same time to entirely fix the mapping failures.

Thanks
Nicolin

Nicolin Chen (2):
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

drivers/iommu/dma-iommu.c | 8 ++++++++
drivers/nvme/host/pci.c | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)

--
2.43.0



2024-02-13 21:56:05

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

The min_align_mask is set for swiotlb_max_mapping_size calculation used
by dma_opt_mapping_size in the a few lines behind.

By default, the swiotlb_max_mapping_size() returns 256KB subtracted by
roundup(min_align_mask, IO_TLB_SIZE). Since NVME_CTRL_PAGE_SIZE=4KB, in
this case the swiotlb_max_mapping_size() returns 252KB (256KB - 4KB).

Meanwhile, the alloc_size eventually passed in to swiotlb_tbl_map_single
in dma-iommu is aligned with its attaching domain's iova->granule. If the
domain (backed by an IO page table) is using a 64KB page size, alloc_size
can still ask for 256KB v.s. 252KB, which fails the mapping request:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 262144 bytes), total 32768 (slots), used 128 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1

Another factor is that the attached domain of an NVME device can change,
so can the value of iova->granule, meaning that the min_align_mask cannot
be set simply using the iova->granule value from its attached domain.

Since the iova->granule usually picks the smallest one from pgsize_bitmap
and it oftens matches with CPU's PAGE_SIZE, simply set min_align_mask to
"PAGE_SIZE - 1".

Note that the other patch "iommu/dma: Force swiotlb_max_mapping_size on
an untrusted device" is required to entirely fix this mapping failure.

Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e6267a6aa380..ad41fe0bf2e3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
else
dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
- dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
+ dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
dma_set_max_seg_size(&pdev->dev, 0xffffffff);

/*
--
2.43.0


2024-02-13 22:02:45

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 1/2] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
an NVME device can map a size between 300KB~512KB, which certainly failed
the swiotlb mappings, though the default pool of swiotlb has many slots:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1

Call trace:
[ 3.099918] swiotlb_tbl_map_single+0x214/0x240
[ 3.099921] iommu_dma_map_page+0x218/0x328
[ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
[ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
[ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
[ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
[ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
[ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
[ 3.103463] __submit_bio+0x44/0x220
[ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
[ 3.103470] submit_bio_noacct+0x180/0x6c8
[ 3.103471] submit_bio+0x34/0x130
[ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
[ 3.104766] mpage_submit_folio+0xa0/0x100
[ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
[ 3.104771] ext4_do_writepages+0x6a0/0xd78
[ 3.105615] ext4_writepages+0x80/0x118
[ 3.105616] do_writepages+0x90/0x1e8
[ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
[ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
[ 3.106656] file_write_and_wait_range+0x84/0x120
[ 3.106658] ext4_sync_file+0x7c/0x4c0
[ 3.106660] vfs_fsync_range+0x3c/0xa8
[ 3.106663] do_fsync+0x44/0xc0

Since untrusted devices might go down the swiotlb pathway with dma-iommu,
these devices should not map a size larger than swiotlb_max_mapping_size.

To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
the iommu_dma_opt_mapping_size().

Note that the other patch "nvme-pci: Fix iommu map (via swiotlb) failures
when PAGE_SIZE=64KB" is required to entirely fix this mapping failure.

Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/dma-iommu.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 50ccc4f1ef81..7d1a20da6d94 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1706,6 +1706,13 @@ static size_t iommu_dma_opt_mapping_size(void)
return iova_rcache_range();
}

+static size_t iommu_dma_max_mapping_size(struct device *dev)
+{
+ if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
+ return swiotlb_max_mapping_size(dev);
+ return SIZE_MAX;
+}
+
static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
@@ -1728,6 +1735,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
.opt_mapping_size = iommu_dma_opt_mapping_size,
+ .max_mapping_size = iommu_dma_max_mapping_size,
};

/*
--
2.43.0


2024-02-13 23:31:33

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> else
> dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> - dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> + dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> dma_set_max_seg_size(&pdev->dev, 0xffffffff);

I recall we had to do this for POWER because they have 64k pages, but
page aligned addresses IOMMU map to 4k, so we needed to allow the lower
dma alignment to efficiently use it.

2024-02-14 06:16:38

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> > else
> > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > - dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > + dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> > dma_set_max_seg_size(&pdev->dev, 0xffffffff);
>
> I recall we had to do this for POWER because they have 64k pages, but
> page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> dma alignment to efficiently use it.

Thanks for the input!

In that case, we might have to rely on iovad->granule from the
attached iommu_domain:

+static size_t iommu_dma_max_mapping_size(struct device *dev)
+{
+ struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+ if (!domain || !is_swiotlb_active(dev) || !dev_is_untrusted(dev))
+ return SIZE_MAX;
+ return ALIGN_DOWN(swiotlb_max_mapping_size(dev),
+ domain->iova_cookie->iovad.granule);
+}

With this in PATCH-1, we can drop PATCH-2.

Thanks
Nicolin

2024-02-14 16:44:08

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

Hi Nicolin,

On Tue, Feb 13, 2024 at 01:53:55PM -0800, Nicolin Chen wrote:
> It's observed that an NVME device is causing timeouts when Ubuntu boots
> with a kernel configured with PAGE_SIZE=64KB due to failures in swiotlb:
> systemd[1]: Started Journal Service.
> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
> note: journal-offline[392] exited with irqs disabled
> note: journal-offline[392] exited with preempt_count 1
>
> An NVME device under a PCIe bus can be behind an IOMMU, so dma mappings
> going through dma-iommu might be also redirected to swiotlb allocations.
> Similar to dma_direct_max_mapping_size(), dma-iommu should implement its
> dma_map_ops->max_mapping_size to return swiotlb_max_mapping_size() too.
>
> Though an iommu_dma_max_mapping_size() is a must, it alone can't fix the
> issue. The swiotlb_max_mapping_size() returns 252KB, calculated from the
> default pool 256KB subtracted by min_align_mask NVME_CTRL_PAGE_SIZE=4KB,
> while dma-iommu can roundup a 252KB mapping to 256KB at its "alloc_size"
> when PAGE_SIZE=64KB via iova->granule that is often set to PAGE_SIZE. So
> this mismatch between NVME_CTRL_PAGE_SIZE=4KB and PAGE_SIZE=64KB results
> in a similar failure, though its signature has a fixed size "256KB":
> systemd[1]: Started Journal Service.
> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 262144 bytes), total 32768 (slots), used 128 (slots)
> note: journal-offline[392] exited with irqs disabled
> note: journal-offline[392] exited with preempt_count 1
>
> Both failures above occur to NVME behind IOMMU when PAGE_SIZE=64KB. They
> were likely introduced for the security feature by:
> commit 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers"),
>
> So, this series bundles two fixes together against that. They should be
> taken at the same time to entirely fix the mapping failures.

It's a bit of a shot in the dark, but I've got a pending fix to some of
the alignment handling in swiotlb. It would be interesting to know if
patch 1 has any impact at all on your NVME allocations:

https://lore.kernel.org/r/[email protected]

Cheers,

Will

2024-02-15 01:36:49

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
> On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> > On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> > > else
> > > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > > - dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > > + dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> > > dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> >
> > I recall we had to do this for POWER because they have 64k pages, but
> > page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> > dma alignment to efficiently use it.
>
> Thanks for the input!
>
> In that case, we might have to rely on iovad->granule from the
> attached iommu_domain:

I explored a bit more, and there is some PPC weirdness that lead to
NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
path. It looks like swiotlb is the only user for this, so your original
patch may be just fine.

2024-02-15 04:46:47

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

On Wed, Feb 14, 2024 at 06:36:38PM -0700, Keith Busch wrote:
> On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
> > On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> > > On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > > > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > > > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> > > > else
> > > > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > > > - dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > > > + dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> > > > dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> > >
> > > I recall we had to do this for POWER because they have 64k pages, but
> > > page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> > > dma alignment to efficiently use it.
> >
> > Thanks for the input!
> >
> > In that case, we might have to rely on iovad->granule from the
> > attached iommu_domain:
>
> I explored a bit more, and there is some PPC weirdness that lead to
> NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
> path. It looks like swiotlb is the only user for this, so your original
> patch may be just fine.

Oh, that'll be great if we confirmed. And I think I forgot to add
CC line to the stable trees: the two patches should be applicable
cleanly to older kernels too. Let's wait for some day, so people
can give some tests and reviews. Then I will respin a v2 with the
CC line.

Thanks
Nicolin

2024-02-15 12:02:12

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

On 15/02/2024 4:46 am, Nicolin Chen wrote:
> On Wed, Feb 14, 2024 at 06:36:38PM -0700, Keith Busch wrote:
>> On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
>>> On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
>>>> On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
>>>>> @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
>>>>> dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
>>>>> else
>>>>> dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
>>>>> - dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
>>>>> + dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
>>>>> dma_set_max_seg_size(&pdev->dev, 0xffffffff);
>>>>
>>>> I recall we had to do this for POWER because they have 64k pages, but
>>>> page aligned addresses IOMMU map to 4k, so we needed to allow the lower
>>>> dma alignment to efficiently use it.
>>>
>>> Thanks for the input!
>>>
>>> In that case, we might have to rely on iovad->granule from the
>>> attached iommu_domain:
>>
>> I explored a bit more, and there is some PPC weirdness that lead to
>> NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
>> path. It looks like swiotlb is the only user for this, so your original
>> patch may be just fine.
>
> Oh, that'll be great if we confirmed. And I think I forgot to add
> CC line to the stable trees: the two patches should be applicable
> cleanly to older kernels too. Let's wait for some day, so people
> can give some tests and reviews. Then I will respin a v2 with the
> CC line.

Hmm, as far as I understand, NVME_CTRL_PAGE_SIZE represents the
alignment that NVMe actually cares about, so if specifying that per the
intended purpose of the API doesn't work then it implies the DMA layer
is still not doing its job properly, thus I'd rather keep digging and
try to fix that properly.

FWIW I have a strong suspicion that iommu-dma may not be correctly doing
what it thinks it's trying to do, so I would definitely think it
worthwhile to give that a really close inspection in light of Will's
SWIOTLB fixes.

Thanks,
Robin.

2024-02-15 14:22:28

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

On Wed, Feb 14, 2024 at 11:57:32AM -0800, Nicolin Chen wrote:
> On Wed, Feb 14, 2024 at 04:41:38PM +0000, Will Deacon wrote:
> > On Tue, Feb 13, 2024 at 01:53:55PM -0800, Nicolin Chen wrote:
> > > It's observed that an NVME device is causing timeouts when Ubuntu boots
> > > with a kernel configured with PAGE_SIZE=64KB due to failures in swiotlb:
> > > systemd[1]: Started Journal Service.
> > > => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
> > > note: journal-offline[392] exited with irqs disabled
> > > note: journal-offline[392] exited with preempt_count 1
> > >
> > > An NVME device under a PCIe bus can be behind an IOMMU, so dma mappings
> > > going through dma-iommu might be also redirected to swiotlb allocations.
> > > Similar to dma_direct_max_mapping_size(), dma-iommu should implement its
> > > dma_map_ops->max_mapping_size to return swiotlb_max_mapping_size() too.
> > >
> > > Though an iommu_dma_max_mapping_size() is a must, it alone can't fix the
> > > issue. The swiotlb_max_mapping_size() returns 252KB, calculated from the
> > > default pool 256KB subtracted by min_align_mask NVME_CTRL_PAGE_SIZE=4KB,
> > > while dma-iommu can roundup a 252KB mapping to 256KB at its "alloc_size"
> > > when PAGE_SIZE=64KB via iova->granule that is often set to PAGE_SIZE. So
> > > this mismatch between NVME_CTRL_PAGE_SIZE=4KB and PAGE_SIZE=64KB results
> > > in a similar failure, though its signature has a fixed size "256KB":
> > > systemd[1]: Started Journal Service.
> > > => nvme 0000:00:01.0: swiotlb buffer is full (sz: 262144 bytes), total 32768 (slots), used 128 (slots)
> > > note: journal-offline[392] exited with irqs disabled
> > > note: journal-offline[392] exited with preempt_count 1
> > >
> > > Both failures above occur to NVME behind IOMMU when PAGE_SIZE=64KB. They
> > > were likely introduced for the security feature by:
> > > commit 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers"),
> > >
> > > So, this series bundles two fixes together against that. They should be
> > > taken at the same time to entirely fix the mapping failures.
> >
> > It's a bit of a shot in the dark, but I've got a pending fix to some of
> > the alignment handling in swiotlb. It would be interesting to know if
> > patch 1 has any impact at all on your NVME allocations:
> >
> > https://lore.kernel.org/r/[email protected]
>
> I applied these three patches locally for a test.

Thank you!

> Though I am building with a v6.6 kernel, I see some warnings:
> from kernel/dma/swiotlb.c:26:
> kernel/dma/swiotlb.c: In function ‘swiotlb_area_find_slots’:
> ./include/linux/minmax.h:21:35: warning: comparison of distinct pointer types lacks a cast
> 21 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
> | ^~
> ./include/linux/minmax.h:27:18: note: in expansion of macro ‘__typecheck’
> 27 | (__typecheck(x, y) && __no_side_effects(x, y))
> | ^~~~~~~~~~~
> ./include/linux/minmax.h:37:31: note: in expansion of macro ‘__safe_cmp’
> 37 | __builtin_choose_expr(__safe_cmp(x, y), \
> | ^~~~~~~~~~
> ./include/linux/minmax.h:75:25: note: in expansion of macro ‘__careful_cmp’
> 75 | #define max(x, y) __careful_cmp(x, y, >)
> | ^~~~~~~~~~~~~
> kernel/dma/swiotlb.c:1007:26: note: in expansion of macro ‘max’
> 1007 | stride = max(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> | ^~~
>
> Replacing with a max_t() can fix these.

Weird, I haven't seen that. I can fix it as you suggest, but please can
you also share your .config so I can look into it further?

> And it seems to get worse, as even a 64KB mapping is failing:
> [ 0.239821] nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)
>
> With a printk, I found the iotlb_align_mask isn't correct:
> swiotlb_area_find_slots:alloc_align_mask 0xffff, iotlb_align_mask 0x800
>
> But fixing the iotlb_align_mask to 0x7ff still fails the 64KB
> mapping..

Hmm. A mask of 0x7ff doesn't make a lot of sense given that the slabs
are 2KiB aligned. I'll try plugging in some of the constants you have
here, as something definitely isn't right...

Will

2024-02-15 16:36:00

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

On Thu, Feb 15, 2024 at 02:22:09PM +0000, Will Deacon wrote:
> On Wed, Feb 14, 2024 at 11:57:32AM -0800, Nicolin Chen wrote:
> > On Wed, Feb 14, 2024 at 04:41:38PM +0000, Will Deacon wrote:
> > > On Tue, Feb 13, 2024 at 01:53:55PM -0800, Nicolin Chen wrote:
> > And it seems to get worse, as even a 64KB mapping is failing:
> > [ 0.239821] nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)
> >
> > With a printk, I found the iotlb_align_mask isn't correct:
> > swiotlb_area_find_slots:alloc_align_mask 0xffff, iotlb_align_mask 0x800
> >
> > But fixing the iotlb_align_mask to 0x7ff still fails the 64KB
> > mapping..
>
> Hmm. A mask of 0x7ff doesn't make a lot of sense given that the slabs
> are 2KiB aligned. I'll try plugging in some of the constants you have
> here, as something definitely isn't right...

Sorry, another ask: please can you print 'orig_addr' in the case of the
failing allocation?

Thanks!

Will

2024-02-16 00:27:34

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

On Thu, Feb 15, 2024 at 04:35:45PM +0000, Will Deacon wrote:
> On Thu, Feb 15, 2024 at 02:22:09PM +0000, Will Deacon wrote:
> > On Wed, Feb 14, 2024 at 11:57:32AM -0800, Nicolin Chen wrote:
> > > On Wed, Feb 14, 2024 at 04:41:38PM +0000, Will Deacon wrote:
> > > > On Tue, Feb 13, 2024 at 01:53:55PM -0800, Nicolin Chen wrote:
> > > And it seems to get worse, as even a 64KB mapping is failing:
> > > [ 0.239821] nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)
> > >
> > > With a printk, I found the iotlb_align_mask isn't correct:
> > > swiotlb_area_find_slots:alloc_align_mask 0xffff, iotlb_align_mask 0x800
> > >
> > > But fixing the iotlb_align_mask to 0x7ff still fails the 64KB
> > > mapping..
> >
> > Hmm. A mask of 0x7ff doesn't make a lot of sense given that the slabs
> > are 2KiB aligned. I'll try plugging in some of the constants you have
> > here, as something definitely isn't right...
>
> Sorry, another ask: please can you print 'orig_addr' in the case of the
> failing allocation?

I added nvme_print_sgl() in the nvme-pci driver before its
dma_map_sgtable() call, so the orig_addr isn't aligned with
PAGE_SIZE=64K or NVME_CTRL_PAGE_SIZE=4K:
sg[0] phys_addr:0x0000000105774600 offset:17920 length:512 dma_address:0x0000000000000000 dma_length:0

Also attaching some verbose logs, in case you'd like to check:
nvme 0000:00:01.0: swiotlb_area_find_slots: dma_get_min_align_mask 0xfff, IO_TLB_SIZE 0xfffff7ff
nvme 0000:00:01.0: swiotlb_area_find_slots: alloc_align_mask 0xffff, iotlb_align_mask 0x7ff
nvme 0000:00:01.0: swiotlb_area_find_slots: stride 0x20, max 0xffff
nvme 0000:00:01.0: swiotlb_area_find_slots: tlb_addr=0xbd830000, iotlb_align_mask=0x7ff, alloc_align_mask=0xffff
=> nvme 0000:00:01.0: swiotlb_area_find_slots: orig_addr=0x105774600, iotlb_align_mask=0x7ff
...
nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)
...

Call trace:
swiotlb_tbl_map_single+0x1c8/0x210
iommu_dma_map_page+0x1ec/0x280
iommu_dma_map_sg+0x1ec/0x570
__dma_map_sg_attrs+0x98/0xa8
dma_map_sgtable+0x30/0x58
nvme_prep_rq.part.0+0x1e4/0x7b8

Then, I tried this experiment:
------------------------------------------------------
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b32e3cff37b1..1c988493d3bf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1879,3 +1879,3 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
- blk_queue_dma_alignment(q, 3);
+ blk_queue_dma_alignment(q, NVME_CTRL_PAGE_SIZE - 1);
blk_queue_write_cache(q, vwc, vwc);
------------------------------------------------------

Then, all the SGs seem to be aligned properly, yet some offset
passed in to a dma_map_page_attrs() is still misaligned:
sg[0] phys_addr:0x0000000145140000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[1] phys_addr:0x0000000145130000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[2] phys_addr:0x0000000145120000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[0] phys_addr:0x0000000145110000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[1] phys_addr:0x0000000145100000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[2] phys_addr:0x00000001450f0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[3] phys_addr:0x00000001450e0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[0] phys_addr:0x00000001450d0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[1] phys_addr:0x00000001450c0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[0] phys_addr:0x00000001451b0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[1] phys_addr:0x0000000145140000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[2] phys_addr:0x0000000145130000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[0] phys_addr:0x0000000145120000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[1] phys_addr:0x00000001451a0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[2] phys_addr:0x0000000145190000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[3] phys_addr:0x0000000145160000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[0] phys_addr:0x0000000145000000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[1] phys_addr:0x0000000144fd0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[2] phys_addr:0x0000000144ff0000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[3] phys_addr:0x0000000144f60000 offset:0 length:65536 dma_address:0x0000000000000000 dma_length:0
sg[0] phys_addr:0x0000000145730000 offset:0 length:4096 dma_address:0x0000000000000000 dma_length:0
nvme 0000:00:01.0: swiotlb_area_find_slots: tlb_addr=0xbf020000, iotlb_align_mask=0x7ff, alloc_align_mask=0xffff
=> nvme 0000:00:01.0: swiotlb_area_find_slots: orig_addr=0x145700600, iotlb_align_mask=0x7ff
...
nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)

Call trace:
swiotlb_tbl_map_single+0x1c8/0x210
iommu_dma_map_page+0x1ec/0x280
dma_map_page_attrs+0x238/0x2d8
nvme_prep_rq.part.0+0xec/0x7b8
nvme_queue_rqs+0xb8/0x290

FWIW, with your patch, it seems to crash for me every time I
boot the system (PAGE_SIZE=64KB). I guess it somehow exposes
an underlying bug.

Attaching my .config file also.

Thanks
Nicolin


Attachments:
(No filename) (5.72 kB)
nvme_test_config (329.01 kB)
nvme_test_config
Download all attachments

2024-02-16 00:35:29

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

On Thu, Feb 15, 2024 at 02:22:09PM +0000, Will Deacon wrote:

> > Though I am building with a v6.6 kernel, I see some warnings:
> > from kernel/dma/swiotlb.c:26:
> > kernel/dma/swiotlb.c: In function ‘swiotlb_area_find_slots’:
> > ./include/linux/minmax.h:21:35: warning: comparison of distinct pointer types lacks a cast
> > 21 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
> > | ^~
> > ./include/linux/minmax.h:27:18: note: in expansion of macro ‘__typecheck’
> > 27 | (__typecheck(x, y) && __no_side_effects(x, y))
> > | ^~~~~~~~~~~
> > ./include/linux/minmax.h:37:31: note: in expansion of macro ‘__safe_cmp’
> > 37 | __builtin_choose_expr(__safe_cmp(x, y), \
> > | ^~~~~~~~~~
> > ./include/linux/minmax.h:75:25: note: in expansion of macro ‘__careful_cmp’
> > 75 | #define max(x, y) __careful_cmp(x, y, >)
> > | ^~~~~~~~~~~~~
> > kernel/dma/swiotlb.c:1007:26: note: in expansion of macro ‘max’
> > 1007 | stride = max(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> > | ^~~
> >
> > Replacing with a max_t() can fix these.
>
> Weird, I haven't seen that. I can fix it as you suggest, but please can
> you also share your .config so I can look into it further?

I attached it in my previous reply, yet forgot to mention before
hitting the send key that here is my gcc info:

# gcc -dumpmachine
aarch64-linux-gnu
# gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

2024-02-16 01:14:17

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

On Thu, Feb 15, 2024 at 12:01:34PM +0000, Robin Murphy wrote:
> On 15/02/2024 4:46 am, Nicolin Chen wrote:
> > On Wed, Feb 14, 2024 at 06:36:38PM -0700, Keith Busch wrote:
> > > On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
> > > > On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> > > > > On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > > > > > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > > > > > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> > > > > > else
> > > > > > dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > > > > > - dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > > > > > + dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> > > > > > dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> > > > >
> > > > > I recall we had to do this for POWER because they have 64k pages, but
> > > > > page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> > > > > dma alignment to efficiently use it.
> > > >
> > > > Thanks for the input!
> > > >
> > > > In that case, we might have to rely on iovad->granule from the
> > > > attached iommu_domain:
> > >
> > > I explored a bit more, and there is some PPC weirdness that lead to
> > > NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
> > > path. It looks like swiotlb is the only user for this, so your original
> > > patch may be just fine.
> >
> > Oh, that'll be great if we confirmed. And I think I forgot to add
> > CC line to the stable trees: the two patches should be applicable
> > cleanly to older kernels too. Let's wait for some day, so people
> > can give some tests and reviews. Then I will respin a v2 with the
> > CC line.
>
> Hmm, as far as I understand, NVME_CTRL_PAGE_SIZE represents the
> alignment that NVMe actually cares about, so if specifying that per the
> intended purpose of the API doesn't work then it implies the DMA layer
> is still not doing its job properly, thus I'd rather keep digging and
> try to fix that properly.
>
> FWIW I have a strong suspicion that iommu-dma may not be correctly doing
> what it thinks it's trying to do, so I would definitely think it
> worthwhile to give that a really close inspection in light of Will's
> SWIOTLB fixes.

Yes. Let's figure out what's breaking Will's change.

Thanks
Nicolin

2024-02-16 16:13:28

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

Hi Nicolin,

Thanks for sharing all the logs, .config etc.

On Thu, Feb 15, 2024 at 04:26:23PM -0800, Nicolin Chen wrote:
> On Thu, Feb 15, 2024 at 04:35:45PM +0000, Will Deacon wrote:
> > On Thu, Feb 15, 2024 at 02:22:09PM +0000, Will Deacon wrote:
> > > On Wed, Feb 14, 2024 at 11:57:32AM -0800, Nicolin Chen wrote:
> > > > On Wed, Feb 14, 2024 at 04:41:38PM +0000, Will Deacon wrote:
> > > > > On Tue, Feb 13, 2024 at 01:53:55PM -0800, Nicolin Chen wrote:
> > > > And it seems to get worse, as even a 64KB mapping is failing:
> > > > [ 0.239821] nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)
> > > >
> > > > With a printk, I found the iotlb_align_mask isn't correct:
> > > > swiotlb_area_find_slots:alloc_align_mask 0xffff, iotlb_align_mask 0x800
> > > >
> > > > But fixing the iotlb_align_mask to 0x7ff still fails the 64KB
> > > > mapping..
> > >
> > > Hmm. A mask of 0x7ff doesn't make a lot of sense given that the slabs
> > > are 2KiB aligned. I'll try plugging in some of the constants you have
> > > here, as something definitely isn't right...
> >
> > Sorry, another ask: please can you print 'orig_addr' in the case of the
> > failing allocation?
>
> I added nvme_print_sgl() in the nvme-pci driver before its
> dma_map_sgtable() call, so the orig_addr isn't aligned with
> PAGE_SIZE=64K or NVME_CTRL_PAGE_SIZE=4K:
> sg[0] phys_addr:0x0000000105774600 offset:17920 length:512 dma_address:0x0000000000000000 dma_length:0
>
> Also attaching some verbose logs, in case you'd like to check:
> nvme 0000:00:01.0: swiotlb_area_find_slots: dma_get_min_align_mask 0xfff, IO_TLB_SIZE 0xfffff7ff
> nvme 0000:00:01.0: swiotlb_area_find_slots: alloc_align_mask 0xffff, iotlb_align_mask 0x7ff
> nvme 0000:00:01.0: swiotlb_area_find_slots: stride 0x20, max 0xffff
> nvme 0000:00:01.0: swiotlb_area_find_slots: tlb_addr=0xbd830000, iotlb_align_mask=0x7ff, alloc_align_mask=0xffff
> => nvme 0000:00:01.0: swiotlb_area_find_slots: orig_addr=0x105774600, iotlb_align_mask=0x7ff

With my patches, I think 'iotlb_align_mask' will be 0x800 here, so this
particular allocation might be alright, however I think I'm starting to
see the wider problem. The IOMMU code is asking for a 64k-aligned
allocation so that it can map it safely, but at the same time
dma_get_min_align_mask() is asking for congruence in the 4k NVME page
offset. Now, because we're going to allocate a 64k-aligned mapping and
offset it, I think the NVME alignment will just fall out in the wash and
checking the 'orig_addr' (which includes the offset) is wrong.

So perhaps this diff (which I'm sadly not able to test) will help? You'll
want to apply it on top of my other patches. The idea is to ignore the
bits of 'orig_addr' which will be aligned automatically by offseting from
the aligned allocation. I fixed the max() thing too, although that's only
an issue for older kernels.

Cheers,

Will

--->8

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 283eea33dd22..4a000d97f568 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
unsigned long max_slots = get_max_slots(boundary_mask);
- unsigned int iotlb_align_mask =
- dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
+ unsigned int iotlb_align_mask = dma_get_min_align_mask(dev);
unsigned int nslots = nr_slots(alloc_size), stride;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int index, slots_checked, count = 0, i;
@@ -993,6 +992,9 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
BUG_ON(!nslots);
BUG_ON(area_index >= pool->nareas);

+ alloc_align_mask |= (IO_TLB_SIZE - 1);
+ iotlb_align_mask &= ~alloc_align_mask;
+
/*
* For mappings with an alignment requirement don't bother looping to
* unaligned slots once we found an aligned one.
@@ -1004,7 +1006,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
* allocations.
*/
if (alloc_size >= PAGE_SIZE)
- stride = max(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
+ stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);

spin_lock_irqsave(&area->lock, flags);
if (unlikely(nslots > pool->area_nslabs - area->used))


2024-02-17 05:20:09

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

Hi Will,

On Fri, Feb 16, 2024 at 04:13:12PM +0000, Will Deacon wrote:
> On Thu, Feb 15, 2024 at 04:26:23PM -0800, Nicolin Chen wrote:
> > On Thu, Feb 15, 2024 at 04:35:45PM +0000, Will Deacon wrote:
> > > On Thu, Feb 15, 2024 at 02:22:09PM +0000, Will Deacon wrote:
> > > > On Wed, Feb 14, 2024 at 11:57:32AM -0800, Nicolin Chen wrote:
> > > > > On Wed, Feb 14, 2024 at 04:41:38PM +0000, Will Deacon wrote:
> > > > > > On Tue, Feb 13, 2024 at 01:53:55PM -0800, Nicolin Chen wrote:
> > > > > And it seems to get worse, as even a 64KB mapping is failing:
> > > > > [ 0.239821] nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)
> > > > >
> > > > > With a printk, I found the iotlb_align_mask isn't correct:
> > > > > swiotlb_area_find_slots:alloc_align_mask 0xffff, iotlb_align_mask 0x800
> > > > >
> > > > > But fixing the iotlb_align_mask to 0x7ff still fails the 64KB
> > > > > mapping..
> > > >
> > > > Hmm. A mask of 0x7ff doesn't make a lot of sense given that the slabs
> > > > are 2KiB aligned. I'll try plugging in some of the constants you have
> > > > here, as something definitely isn't right...
> > >
> > > Sorry, another ask: please can you print 'orig_addr' in the case of the
> > > failing allocation?
> >
> > I added nvme_print_sgl() in the nvme-pci driver before its
> > dma_map_sgtable() call, so the orig_addr isn't aligned with
> > PAGE_SIZE=64K or NVME_CTRL_PAGE_SIZE=4K:
> > sg[0] phys_addr:0x0000000105774600 offset:17920 length:512 dma_address:0x0000000000000000 dma_length:0
> >
> > Also attaching some verbose logs, in case you'd like to check:
> > nvme 0000:00:01.0: swiotlb_area_find_slots: dma_get_min_align_mask 0xfff, IO_TLB_SIZE 0xfffff7ff
> > nvme 0000:00:01.0: swiotlb_area_find_slots: alloc_align_mask 0xffff, iotlb_align_mask 0x7ff
> > nvme 0000:00:01.0: swiotlb_area_find_slots: stride 0x20, max 0xffff
> > nvme 0000:00:01.0: swiotlb_area_find_slots: tlb_addr=0xbd830000, iotlb_align_mask=0x7ff, alloc_align_mask=0xffff
> > => nvme 0000:00:01.0: swiotlb_area_find_slots: orig_addr=0x105774600, iotlb_align_mask=0x7ff
>
> With my patches, I think 'iotlb_align_mask' will be 0x800 here, so this

Oops, my bad. I forgot to revert the part that I mentioned in
my previous reply.

> particular allocation might be alright, however I think I'm starting to
> see the wider problem. The IOMMU code is asking for a 64k-aligned
> allocation so that it can map it safely, but at the same time
> dma_get_min_align_mask() is asking for congruence in the 4k NVME page
> offset. Now, because we're going to allocate a 64k-aligned mapping and
> offset it, I think the NVME alignment will just fall out in the wash and
> checking the 'orig_addr' (which includes the offset) is wrong.
>
> So perhaps this diff (which I'm sadly not able to test) will help? You'll
> want to apply it on top of my other patches. The idea is to ignore the
> bits of 'orig_addr' which will be aligned automatically by offseting from
> the aligned allocation. I fixed the max() thing too, although that's only
> an issue for older kernels.

Yea, I tested all 4 patches. They still failed at some large
mapping, until I added on top of them my PATCH-1 implementing
the max_mapping_size op. IOW, with your patches it looks like
252KB max_mapping_size is working :)

Though we seem to have a solution now, I hope we can make it
applicable to older kernels too. The mapping failure on arm64
with PAGE_SIZE=64KB looks like a regression to me, since dma-
iommu started to use swiotlb bounce buffer.

Thanks
Nicolin

> --->8
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 283eea33dd22..4a000d97f568 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> dma_addr_t tbl_dma_addr =
> phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> unsigned long max_slots = get_max_slots(boundary_mask);
> - unsigned int iotlb_align_mask =
> - dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> + unsigned int iotlb_align_mask = dma_get_min_align_mask(dev);
> unsigned int nslots = nr_slots(alloc_size), stride;
> unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> unsigned int index, slots_checked, count = 0, i;
> @@ -993,6 +992,9 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> BUG_ON(!nslots);
> BUG_ON(area_index >= pool->nareas);
>
> + alloc_align_mask |= (IO_TLB_SIZE - 1);
> + iotlb_align_mask &= ~alloc_align_mask;
> +
> /*
> * For mappings with an alignment requirement don't bother looping to
> * unaligned slots once we found an aligned one.
> @@ -1004,7 +1006,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> * allocations.
> */
> if (alloc_size >= PAGE_SIZE)
> - stride = max(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
>
> spin_lock_irqsave(&area->lock, flags);
> if (unlikely(nslots > pool->area_nslabs - area->used))
>

2024-02-19 04:05:43

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH v1 0/2] nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB

From: Nicolin Chen <[email protected]> Sent: Friday, February 16, 2024 9:20 PM
>
> Hi Will,
>
> On Fri, Feb 16, 2024 at 04:13:12PM +0000, Will Deacon wrote:
> > On Thu, Feb 15, 2024 at 04:26:23PM -0800, Nicolin Chen wrote:
> > > On Thu, Feb 15, 2024 at 04:35:45PM +0000, Will Deacon wrote:
> > > > On Thu, Feb 15, 2024 at 02:22:09PM +0000, Will Deacon wrote:
> > > > > On Wed, Feb 14, 2024 at 11:57:32AM -0800, Nicolin Chen wrote:
> > > > > > On Wed, Feb 14, 2024 at 04:41:38PM +0000, Will Deacon wrote:
> > > > > > > On Tue, Feb 13, 2024 at 01:53:55PM -0800, Nicolin Chen wrote:
> > > > > > And it seems to get worse, as even a 64KB mapping is failing:
> > > > > > [ 0.239821] nvme 0000:00:01.0: swiotlb buffer is full (sz: 65536 bytes), total 32768 (slots), used 0 (slots)
> > > > > >
> > > > > > With a printk, I found the iotlb_align_mask isn't correct:
> > > > > > swiotlb_area_find_slots:alloc_align_mask 0xffff, iotlb_align_mask 0x800
> > > > > >
> > > > > > But fixing the iotlb_align_mask to 0x7ff still fails the 64KB
> > > > > > mapping..
> > > > >
> > > > > Hmm. A mask of 0x7ff doesn't make a lot of sense given that the slabs
> > > > > are 2KiB aligned. I'll try plugging in some of the constants you have
> > > > > here, as something definitely isn't right...
> > > >
> > > > Sorry, another ask: please can you print 'orig_addr' in the case of the
> > > > failing allocation?
> > >
> > > I added nvme_print_sgl() in the nvme-pci driver before its
> > > dma_map_sgtable() call, so the orig_addr isn't aligned with
> > > PAGE_SIZE=64K or NVME_CTRL_PAGE_SIZE=4K:
> > > sg[0] phys_addr:0x0000000105774600 offset:17920 length:512 dma_address:0x0000000000000000 dma_length:0
> > >
> > > Also attaching some verbose logs, in case you'd like to check:
> > > nvme 0000:00:01.0: swiotlb_area_find_slots: dma_get_min_align_mask 0xfff, IO_TLB_SIZE 0xfffff7ff
> > > nvme 0000:00:01.0: swiotlb_area_find_slots: alloc_align_mask 0xffff, iotlb_align_mask 0x7ff
> > > nvme 0000:00:01.0: swiotlb_area_find_slots: stride 0x20, max 0xffff
> > > nvme 0000:00:01.0: swiotlb_area_find_slots: tlb_addr=0xbd830000, iotlb_align_mask=0x7ff, alloc_align_mask=0xffff
> > > => nvme 0000:00:01.0: swiotlb_area_find_slots: orig_addr=0x105774600, iotlb_align_mask=0x7ff
> >
> > With my patches, I think 'iotlb_align_mask' will be 0x800 here, so this
>
> Oops, my bad. I forgot to revert the part that I mentioned in
> my previous reply.
>
> > particular allocation might be alright, however I think I'm starting to
> > see the wider problem. The IOMMU code is asking for a 64k-aligned
> > allocation so that it can map it safely, but at the same time
> > dma_get_min_align_mask() is asking for congruence in the 4k NVME page
> > offset. Now, because we're going to allocate a 64k-aligned mapping and
> > offset it, I think the NVME alignment will just fall out in the wash and
> > checking the 'orig_addr' (which includes the offset) is wrong.
> >
> > So perhaps this diff (which I'm sadly not able to test) will help? You'll
> > want to apply it on top of my other patches. The idea is to ignore the
> > bits of 'orig_addr' which will be aligned automatically by offseting from
> > the aligned allocation. I fixed the max() thing too, although that's only
> > an issue for older kernels.
>
> Yea, I tested all 4 patches. They still failed at some large
> mapping, until I added on top of them my PATCH-1 implementing
> the max_mapping_size op. IOW, with your patches it looks like
> 252KB max_mapping_size is working :)
>
> Though we seem to have a solution now, I hope we can make it
> applicable to older kernels too. The mapping failure on arm64
> with PAGE_SIZE=64KB looks like a regression to me, since dma-
> iommu started to use swiotlb bounce buffer.
>
> Thanks
> Nicolin
>
> > --->8
> >
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index 283eea33dd22..4a000d97f568 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -981,8 +981,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > dma_addr_t tbl_dma_addr =
> > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> > unsigned long max_slots = get_max_slots(boundary_mask);
> > - unsigned int iotlb_align_mask =
> > - dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > + unsigned int iotlb_align_mask = dma_get_min_align_mask(dev);
> > unsigned int nslots = nr_slots(alloc_size), stride;
> > unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > unsigned int index, slots_checked, count = 0, i;
> > @@ -993,6 +992,9 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > BUG_ON(!nslots);
> > BUG_ON(area_index >= pool->nareas);
> >
> > + alloc_align_mask |= (IO_TLB_SIZE - 1);
> > + iotlb_align_mask &= ~alloc_align_mask;
> > +
> > /*
> > * For mappings with an alignment requirement don't bother looping to
> > * unaligned slots once we found an aligned one.
> > @@ -1004,7 +1006,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > * allocations.
> > */
> > if (alloc_size >= PAGE_SIZE)
> > - stride = max(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> >
> > spin_lock_irqsave(&area->lock, flags);
> > if (unlikely(nslots > pool->area_nslabs - area->used))
> >

This thread prompted me to test another scenario similar to Nicolin's, and
on the DMA direct map path I saw failures like Nicolin's. Will's Patch 1/3
fixes my scenario, though I'm still looking at the details to convince myself
that it is fully correct, even with the above tweak.

Here's my scenario:
* ARM64 VM with 8 vCPUs running in the Azure public cloud
* Linux 6.8-rc4 kernel built with 64 Kbyte pages
* Using the Azure/Hyper-V synthetic storage driver (drivers/scsi/storvsc_drv.c).
This driver sets dma_min_align_mask to 4 Kbytes, because the DMA
descriptors sent to the Hyper-V host are always in 4K units, regardless of
the guest page size.
* Running with the standard 64 Mbytes swiotlb size. Added swiotlb=force
on the kernel boot line, to see if it could handle the scenario. This
simulates what CoCo VMs do on the x86 side. We don't have arm64
CoCo VMs, so the scenario is artificial for the moment. But it ought
to work.
* There's no vIOMMU in the VM, so all DMA requests go through the
DMA direct map path, not the IOMMU path (which is Nicolin's scenario)

In my scenario, the block layer generates I/O requests up to 252 Kbytes in
size. But with the unmodified 6.8-rc4 code, these requests fail to be mapped
in the swiotlb even though there's plenty of space available. Here's why:

The 252 Kbyte request size is determined by the block layer, ultimately
based on the value returned by swiotlb_max_mapping_size(), which is
based on the dma_min_align_mask, which is 4K in my scenario. But the
current upstream code in swiotlb_search_pool_area() adds ~PAGE_MASK
into the iotlb_align_mask, which increases the alignment requirement.
With the increased alignment, the offset in orig_addr is likely larger than
4 Kbytes, and that offset plus 252 Kbytes no longer fits in the 256 Kbyte
swiotlb limit. swiotlb_search_pool_area() can never find a slot with
enough space. (We could easily add a WARN_ON in
swiotlb_search_pool_area() to catch such an occurrence.)

Will's Patch 1/3 fixes this problem by using PAGE_SIZE/PAGE_SHIFT
only to compute the stride, and does not add ~PAGE_MASK into
iotlb_align_mask. In my scenario with Will's patch, iotlb_align_mask
expresses a 4 Kbyte alignment requirement, and
swiotlb_search_pool_area() succeeds for a 252 Kbyte request.

But what about the alloc_align_mask argument to
swiotlb_search_pool_area()? If it increases the alignment beyond
what swiotlb_max_mapping_size() is based on, the same problem
will occur. This is what happens in Nicolin's scenario when the
NVMe driver sets dma_min_align_mask to 4K, but the dma-iommu
code passes a larger value in alloc_align_mask. The later version of
Nicolin's Patch 1/2 that uses iovad.granule in
iommu_dma_max_mapping_size() solves the immediate problem.
But something still seems messed up, because the value from
iommu_dma_max_mapping_size() must be coordinated with the
value passed as the alloc_align_mask argument.

Will's 3-patch series is based on a different scenario -- the
swiotlb_alloc() case. Patch 3 of his series also uses the
alloc_align_mask argument, and I haven't thought through
all the implications.

Finally, there's this comment in swiotlb_search_pool_area():

/*
* For allocations of PAGE_SIZE or larger only look for page aligned
* allocations.
*/

The comment has been around for a while, and it confused me.
It seems to apply only for the swiotlb_alloc() case when
orig_addr is zero. From what I can tell, allocations for DMA
mapping purposes don't do such alignment. Or is there some
other meaning to that comment that I'm missing?

Michael