Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
PCI SAC address trick") and its subsequent revert, this mechanism no
longer serves its original purpose, but now only works around broken
hardware/drivers in a way that is unfortunately too impactful to remove.
This does not, however, prevent us from solving the performance impact
which that workaround has on large-scale systems that don't need it.
Once the 32-bit IOVA space fills up and a workload starts allocating and
freeing on both sides of the boundary, the opportunistic SAC allocation
can then end up spending significant time hunting down scattered
fragments of free 32-bit space, or just reestablishing max32_alloc_size.
This can easily be exacerbated by a change in allocation pattern, such
as by changing the network MTU, which can increase pressure on the
32-bit space by leaving a large quantity of cached IOVAs which are now
the wrong size to be recycled, but also won't be freed since the
non-opportunistic allocations can still be satisfied from the whole
64-bit space without triggering the reclaim path.
However, in the context of a workaround where smaller DMA addresses
aren't simply a preference but a necessity, if we get to that point at
all then in fact it's already the endgame. The nature of the allocator
is currently such that the first IOVA we give to a device after the
32-bit space runs out will be the highest possible address for that
device, ever. If that works, then great, we know we can optimise for
speed by always allocating from the full range. And if it doesn't, then
the worst has already happened and any brokenness is now showing, so
there's little point in continuing to try to hide it.
To that end, implement a flag to refine the SAC business into a
per-device policy that can automatically get itself out of the way if
and when it stops being useful.
CC: Linus Torvalds <[email protected]>
CC: Jakub Kicinski <[email protected]>
Reviewed-by: John Garry <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
---
v4: Rebase to use the new bitfield in dev_iommu, expand commit message.
drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++------
drivers/iommu/dma-iommu.h | 8 ++++++++
drivers/iommu/iommu.c | 3 +++
include/linux/iommu.h | 2 ++
4 files changed, 33 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 99b2646cb5c7..9193ad5bc72f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -630,7 +630,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
{
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = &cookie->iovad;
- unsigned long shift, iova_len, iova = 0;
+ unsigned long shift, iova_len, iova;
if (cookie->type == IOMMU_DMA_MSI_COOKIE) {
cookie->msi_iova += size;
@@ -645,15 +645,29 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
if (domain->geometry.force_aperture)
dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
- /* Try to get PCI devices a SAC address */
- if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && dev_is_pci(dev))
+ /*
+ * Try to use all the 32-bit PCI addresses first. The original SAC vs.
+ * DAC reasoning loses relevance with PCIe, but enough hardware and
+ * firmware bugs are still lurking out there that it's safest not to
+ * venture into the 64-bit space until necessary.
+ *
+ * If your device goes wrong after seeing the notice then likely either
+ * its driver is not setting DMA masks accurately, the hardware has
+ * some inherent bug in handling >32-bit addresses, or not all the
+ * expected address bits are wired up between the device and the IOMMU.
+ */
+ if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) {
iova = alloc_iova_fast(iovad, iova_len,
DMA_BIT_MASK(32) >> shift, false);
+ if (iova)
+ goto done;
- if (!iova)
- iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
- true);
+ dev->iommu->pci_32bit_workaround = false;
+ dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
+ }
+ iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift, true);
+done:
return (dma_addr_t)iova << shift;
}
diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h
index 942790009292..c829f1f82a99 100644
--- a/drivers/iommu/dma-iommu.h
+++ b/drivers/iommu/dma-iommu.h
@@ -17,6 +17,10 @@ int iommu_dma_init_fq(struct iommu_domain *domain);
void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
extern bool iommu_dma_forcedac;
+static inline void iommu_dma_set_pci_32bit_workaround(struct device *dev)
+{
+ dev->iommu->pci_32bit_workaround = !iommu_dma_forcedac;
+}
#else /* CONFIG_IOMMU_DMA */
@@ -38,5 +42,9 @@ static inline void iommu_dma_get_resv_regions(struct device *dev, struct list_he
{
}
+static inline void iommu_dma_set_pci_32bit_workaround(struct device *dev)
+{
+}
+
#endif /* CONFIG_IOMMU_DMA */
#endif /* __DMA_IOMMU_H */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 10db680acaed..8ea5821b637c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -354,6 +354,9 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
mutex_unlock(&iommu_probe_device_lock);
iommu_device_link(iommu_dev, dev);
+ if (dev_is_pci(dev))
+ iommu_dma_set_pci_32bit_workaround(dev);
+
return 0;
out_release:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 6595454d4f48..2c908e87a89a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -403,6 +403,7 @@ struct iommu_fault_param {
* @priv: IOMMU Driver private data
* @max_pasids: number of PASIDs this device can consume
* @attach_deferred: the dma domain attachment is deferred
+ * @pci_32bit_workaround: Limit DMA allocations to 32-bit IOVAs
*
* TODO: migrate other per device data pointers under iommu_dev_data, e.g.
* struct iommu_group *iommu_group;
@@ -416,6 +417,7 @@ struct dev_iommu {
void *priv;
u32 max_pasids;
u32 attach_deferred:1;
+ u32 pci_32bit_workaround:1;
};
int iommu_device_register(struct iommu_device *iommu,
--
2.39.2.101.g768bb238c484.dirty
On Thu, 13 Apr 2023 14:40:25 +0100 Robin Murphy wrote:
> CC: Linus Torvalds <[email protected]>
> CC: Jakub Kicinski <[email protected]>
> Reviewed-by: John Garry <[email protected]>
> Signed-off-by: Robin Murphy <[email protected]>
Tested-by: Jakub Kicinski <[email protected]>
Hi Robin,
On Thu, Apr 13, 2023 at 02:40:25PM +0100, Robin Murphy wrote:
> Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
> PCI SAC address trick") and its subsequent revert, this mechanism no
> longer serves its original purpose, but now only works around broken
> hardware/drivers in a way that is unfortunately too impactful to remove.
>
> This does not, however, prevent us from solving the performance impact
> which that workaround has on large-scale systems that don't need it.
> Once the 32-bit IOVA space fills up and a workload starts allocating and
> freeing on both sides of the boundary, the opportunistic SAC allocation
> can then end up spending significant time hunting down scattered
> fragments of free 32-bit space, or just reestablishing max32_alloc_size.
> This can easily be exacerbated by a change in allocation pattern, such
> as by changing the network MTU, which can increase pressure on the
> 32-bit space by leaving a large quantity of cached IOVAs which are now
> the wrong size to be recycled, but also won't be freed since the
> non-opportunistic allocations can still be satisfied from the whole
> 64-bit space without triggering the reclaim path.
>
> However, in the context of a workaround where smaller DMA addresses
> aren't simply a preference but a necessity, if we get to that point at
> all then in fact it's already the endgame. The nature of the allocator
> is currently such that the first IOVA we give to a device after the
> 32-bit space runs out will be the highest possible address for that
> device, ever. If that works, then great, we know we can optimise for
> speed by always allocating from the full range. And if it doesn't, then
> the worst has already happened and any brokenness is now showing, so
> there's little point in continuing to try to hide it.
>
> To that end, implement a flag to refine the SAC business into a
> per-device policy that can automatically get itself out of the way if
> and when it stops being useful.
Thanks for working on this, I think this is good to go. But given the
issues we had with last attempt I'd like to have this in linux-next for
a few weeks before sending it upstream. Therefore I will defer this
patch and merge it early in the next cycle.
Regards,
Joerg
On 2023-04-14 12:45, Joerg Roedel wrote:
> Hi Robin,
>
> On Thu, Apr 13, 2023 at 02:40:25PM +0100, Robin Murphy wrote:
>> Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
>> PCI SAC address trick") and its subsequent revert, this mechanism no
>> longer serves its original purpose, but now only works around broken
>> hardware/drivers in a way that is unfortunately too impactful to remove.
>>
>> This does not, however, prevent us from solving the performance impact
>> which that workaround has on large-scale systems that don't need it.
>> Once the 32-bit IOVA space fills up and a workload starts allocating and
>> freeing on both sides of the boundary, the opportunistic SAC allocation
>> can then end up spending significant time hunting down scattered
>> fragments of free 32-bit space, or just reestablishing max32_alloc_size.
>> This can easily be exacerbated by a change in allocation pattern, such
>> as by changing the network MTU, which can increase pressure on the
>> 32-bit space by leaving a large quantity of cached IOVAs which are now
>> the wrong size to be recycled, but also won't be freed since the
>> non-opportunistic allocations can still be satisfied from the whole
>> 64-bit space without triggering the reclaim path.
>>
>> However, in the context of a workaround where smaller DMA addresses
>> aren't simply a preference but a necessity, if we get to that point at
>> all then in fact it's already the endgame. The nature of the allocator
>> is currently such that the first IOVA we give to a device after the
>> 32-bit space runs out will be the highest possible address for that
>> device, ever. If that works, then great, we know we can optimise for
>> speed by always allocating from the full range. And if it doesn't, then
>> the worst has already happened and any brokenness is now showing, so
>> there's little point in continuing to try to hide it.
>>
>> To that end, implement a flag to refine the SAC business into a
>> per-device policy that can automatically get itself out of the way if
>> and when it stops being useful.
>
> Thanks for working on this, I think this is good to go. But given the
> issues we had with last attempt I'd like to have this in linux-next for
> a few weeks before sending it upstream. Therefore I will defer this
> patch and merge it early in the next cycle.
Sounds good - I'm considerably more confident in this approach, but
although it should not be able to break any scenario which wasn't
already broken, it could potentially still make such a breakage more
noticeable. Thus in all honesty I'd feel happiest giving it a full cycle
of -next coverage as well.
Cheers,
Robin.
Robin,
On 4/13/2023 7:10 PM, Robin Murphy wrote:
> Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
> PCI SAC address trick") and its subsequent revert, this mechanism no
> longer serves its original purpose, but now only works around broken
> hardware/drivers in a way that is unfortunately too impactful to remove.
>
> This does not, however, prevent us from solving the performance impact
> which that workaround has on large-scale systems that don't need it.
> Once the 32-bit IOVA space fills up and a workload starts allocating and
> freeing on both sides of the boundary, the opportunistic SAC allocation
> can then end up spending significant time hunting down scattered
> fragments of free 32-bit space, or just reestablishing max32_alloc_size.
> This can easily be exacerbated by a change in allocation pattern, such
> as by changing the network MTU, which can increase pressure on the
> 32-bit space by leaving a large quantity of cached IOVAs which are now
> the wrong size to be recycled, but also won't be freed since the
> non-opportunistic allocations can still be satisfied from the whole
> 64-bit space without triggering the reclaim path.
>
> However, in the context of a workaround where smaller DMA addresses
> aren't simply a preference but a necessity, if we get to that point at
> all then in fact it's already the endgame. The nature of the allocator
> is currently such that the first IOVA we give to a device after the
> 32-bit space runs out will be the highest possible address for that
> device, ever. If that works, then great, we know we can optimise for
> speed by always allocating from the full range. And if it doesn't, then
> the worst has already happened and any brokenness is now showing, so
> there's little point in continuing to try to hide it.
>
> To that end, implement a flag to refine the SAC business into a
> per-device policy that can automatically get itself out of the way if
> and when it stops being useful.
>
> CC: Linus Torvalds <[email protected]>
> CC: Jakub Kicinski <[email protected]>
> Reviewed-by: John Garry <[email protected]>
> Signed-off-by: Robin Murphy <[email protected]>
We hit kernel softlockup while running stress-ng test system having 384 CPU and
NVMe disk. This patch helped to solve one soft lockup in allocation path.
> ---
>
> v4: Rebase to use the new bitfield in dev_iommu, expand commit message.
>
> drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++------
> drivers/iommu/dma-iommu.h | 8 ++++++++
> drivers/iommu/iommu.c | 3 +++
> include/linux/iommu.h | 2 ++
> 4 files changed, 33 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 99b2646cb5c7..9193ad5bc72f 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -630,7 +630,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
> {
> struct iommu_dma_cookie *cookie = domain->iova_cookie;
> struct iova_domain *iovad = &cookie->iovad;
> - unsigned long shift, iova_len, iova = 0;
> + unsigned long shift, iova_len, iova;
>
> if (cookie->type == IOMMU_DMA_MSI_COOKIE) {
> cookie->msi_iova += size;
> @@ -645,15 +645,29 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
> if (domain->geometry.force_aperture)
> dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
>
> - /* Try to get PCI devices a SAC address */
> - if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && dev_is_pci(dev))
> + /*
> + * Try to use all the 32-bit PCI addresses first. The original SAC vs.
> + * DAC reasoning loses relevance with PCIe, but enough hardware and
> + * firmware bugs are still lurking out there that it's safest not to
> + * venture into the 64-bit space until necessary.
> + *
> + * If your device goes wrong after seeing the notice then likely either
> + * its driver is not setting DMA masks accurately, the hardware has
> + * some inherent bug in handling >32-bit addresses, or not all the
> + * expected address bits are wired up between the device and the IOMMU.
> + */
> + if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) {
> iova = alloc_iova_fast(iovad, iova_len,
> DMA_BIT_MASK(32) >> shift, false);
> + if (iova)
> + goto done;
>
> - if (!iova)
> - iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
> - true);
> + dev->iommu->pci_32bit_workaround = false;
> + dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
May be dev_notice_once? Otherwise we may see this message multiple time for same
device like below:
[ 172.017120] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.022955] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.028720] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.031815] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.031816] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.038727] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.038726] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.038917] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.038968] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.038970] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.039007] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.039091] nvme 0000:41:00.0: Using 64-bit DMA addresses
[ 172.039102] nvme 0000:41:00.0: Using 64-bit DMA addresses
Otherwise patch worked fine for us.
Tested-by: Vasant Hegde <[email protected]>
-Vasant
On 18/04/2023 10:23, Vasant Hegde wrote:
> [ 172.017120] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.022955] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.028720] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.031815] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.031816] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038727] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038726] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038917] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038968] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038970] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.039007] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.039091] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.039102] nvme 0000:41:00.0: Using 64-bit DMA addresses
>
> Otherwise patch worked fine for us.
Hi Vasant,
JFYI, Since you are using NVMe, you could also alternatively try
something like which I did for some SCSI storage controller drivers to
limit the request_queue max_sectors soft limit, like:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c2730b116dc6..0a99c9a629c9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1814,6 +1814,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl
*ctrl,
max_segments = min_not_zero(max_segments, ctrl->max_segments);
blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
q->limits.max_sectors = min(q->limits.max_hw_sectors,
+ (unsigned int)dma_opt_mapping_size(ctrl->dev));
blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
}
blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
lines 1-25/25 (END)
dma_opt_mapping_size() will return the max IOVA caching size for iommu
dma ops, so this would mean that we avoid alloc'ing and free'ing IOVAs
at such a high rate (which I assume was your problem).
Thanks,
John
On 2023-04-18 10:23, Vasant Hegde wrote:
> Robin,
>
>
> On 4/13/2023 7:10 PM, Robin Murphy wrote:
>> Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
>> PCI SAC address trick") and its subsequent revert, this mechanism no
>> longer serves its original purpose, but now only works around broken
>> hardware/drivers in a way that is unfortunately too impactful to remove.
>>
>> This does not, however, prevent us from solving the performance impact
>> which that workaround has on large-scale systems that don't need it.
>> Once the 32-bit IOVA space fills up and a workload starts allocating and
>> freeing on both sides of the boundary, the opportunistic SAC allocation
>> can then end up spending significant time hunting down scattered
>> fragments of free 32-bit space, or just reestablishing max32_alloc_size.
>> This can easily be exacerbated by a change in allocation pattern, such
>> as by changing the network MTU, which can increase pressure on the
>> 32-bit space by leaving a large quantity of cached IOVAs which are now
>> the wrong size to be recycled, but also won't be freed since the
>> non-opportunistic allocations can still be satisfied from the whole
>> 64-bit space without triggering the reclaim path.
>>
>> However, in the context of a workaround where smaller DMA addresses
>> aren't simply a preference but a necessity, if we get to that point at
>> all then in fact it's already the endgame. The nature of the allocator
>> is currently such that the first IOVA we give to a device after the
>> 32-bit space runs out will be the highest possible address for that
>> device, ever. If that works, then great, we know we can optimise for
>> speed by always allocating from the full range. And if it doesn't, then
>> the worst has already happened and any brokenness is now showing, so
>> there's little point in continuing to try to hide it.
>>
>> To that end, implement a flag to refine the SAC business into a
>> per-device policy that can automatically get itself out of the way if
>> and when it stops being useful.
>>
>> CC: Linus Torvalds <[email protected]>
>> CC: Jakub Kicinski <[email protected]>
>> Reviewed-by: John Garry <[email protected]>
>> Signed-off-by: Robin Murphy <[email protected]>
>
> We hit kernel softlockup while running stress-ng test system having 384 CPU and
> NVMe disk. This patch helped to solve one soft lockup in allocation path.
>
>> ---
>>
>> v4: Rebase to use the new bitfield in dev_iommu, expand commit message.
>>
>> drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++------
>> drivers/iommu/dma-iommu.h | 8 ++++++++
>> drivers/iommu/iommu.c | 3 +++
>> include/linux/iommu.h | 2 ++
>> 4 files changed, 33 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index 99b2646cb5c7..9193ad5bc72f 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -630,7 +630,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
>> {
>> struct iommu_dma_cookie *cookie = domain->iova_cookie;
>> struct iova_domain *iovad = &cookie->iovad;
>> - unsigned long shift, iova_len, iova = 0;
>> + unsigned long shift, iova_len, iova;
>>
>> if (cookie->type == IOMMU_DMA_MSI_COOKIE) {
>> cookie->msi_iova += size;
>> @@ -645,15 +645,29 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
>> if (domain->geometry.force_aperture)
>> dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
>>
>> - /* Try to get PCI devices a SAC address */
>> - if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && dev_is_pci(dev))
>> + /*
>> + * Try to use all the 32-bit PCI addresses first. The original SAC vs.
>> + * DAC reasoning loses relevance with PCIe, but enough hardware and
>> + * firmware bugs are still lurking out there that it's safest not to
>> + * venture into the 64-bit space until necessary.
>> + *
>> + * If your device goes wrong after seeing the notice then likely either
>> + * its driver is not setting DMA masks accurately, the hardware has
>> + * some inherent bug in handling >32-bit addresses, or not all the
>> + * expected address bits are wired up between the device and the IOMMU.
>> + */
>> + if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) {
>> iova = alloc_iova_fast(iovad, iova_len,
>> DMA_BIT_MASK(32) >> shift, false);
>> + if (iova)
>> + goto done;
>>
>> - if (!iova)
>> - iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
>> - true);
>> + dev->iommu->pci_32bit_workaround = false;
>> + dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
>
> May be dev_notice_once? Otherwise we may see this message multiple time for same
> device like below:
Oh, that's a bit irritating. Of course multiple threads can reach this
in parallel, silly me :(
I would really prefer the notice to be once per device rather than once
globally, since there's clearly no guarantee that the first device to
hit this case is going to be the one which is liable to go wrong. Does
the (untested) diff below do any better?
> [ 172.017120] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.022955] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.028720] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.031815] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.031816] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038727] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038726] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038917] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038968] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.038970] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.039007] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.039091] nvme 0000:41:00.0: Using 64-bit DMA addresses
> [ 172.039102] nvme 0000:41:00.0: Using 64-bit DMA addresses
>
> Otherwise patch worked fine for us.
>
> Tested-by: Vasant Hegde <[email protected]>
Thanks!
Robin.
----->8-----
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9193ad5bc72f..db246fd3037f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -662,8 +662,8 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
if (iova)
goto done;
- dev->iommu->pci_32bit_workaround = false;
- dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
+ if (xchg(&dev->iommu->pci_32bit_workaround, false))
+ dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
}
iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift, true);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2575630d402d..8a12e923718f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -419,7 +419,7 @@ struct dev_iommu {
void *priv;
u32 max_pasids;
u32 attach_deferred:1;
- u32 pci_32bit_workaround:1;
+ bool pci_32bit_workaround;
};
int iommu_device_register(struct iommu_device *iommu,
Robin,
On 4/18/2023 4:27 PM, Robin Murphy wrote:
> On 2023-04-18 10:23, Vasant Hegde wrote:
>> Robin,
>>
>>
>> On 4/13/2023 7:10 PM, Robin Murphy wrote:
>>> Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
>>> PCI SAC address trick") and its subsequent revert, this mechanism no
>>> longer serves its original purpose, but now only works around broken
>>> hardware/drivers in a way that is unfortunately too impactful to remove.
>>>
>>> This does not, however, prevent us from solving the performance impact
>>> which that workaround has on large-scale systems that don't need it.
>>> Once the 32-bit IOVA space fills up and a workload starts allocating and
>>> freeing on both sides of the boundary, the opportunistic SAC allocation
>>> can then end up spending significant time hunting down scattered
>>> fragments of free 32-bit space, or just reestablishing max32_alloc_size.
>>> This can easily be exacerbated by a change in allocation pattern, such
>>> as by changing the network MTU, which can increase pressure on the
>>> 32-bit space by leaving a large quantity of cached IOVAs which are now
>>> the wrong size to be recycled, but also won't be freed since the
>>> non-opportunistic allocations can still be satisfied from the whole
>>> 64-bit space without triggering the reclaim path.
>>>
>>> However, in the context of a workaround where smaller DMA addresses
>>> aren't simply a preference but a necessity, if we get to that point at
>>> all then in fact it's already the endgame. The nature of the allocator
>>> is currently such that the first IOVA we give to a device after the
>>> 32-bit space runs out will be the highest possible address for that
>>> device, ever. If that works, then great, we know we can optimise for
>>> speed by always allocating from the full range. And if it doesn't, then
>>> the worst has already happened and any brokenness is now showing, so
>>> there's little point in continuing to try to hide it.
>>>
>>> To that end, implement a flag to refine the SAC business into a
>>> per-device policy that can automatically get itself out of the way if
>>> and when it stops being useful.
>>>
>>> CC: Linus Torvalds <[email protected]>
>>> CC: Jakub Kicinski <[email protected]>
>>> Reviewed-by: John Garry <[email protected]>
>>> Signed-off-by: Robin Murphy <[email protected]>
>>
>> We hit kernel softlockup while running stress-ng test system having 384 CPU and
>> NVMe disk. This patch helped to solve one soft lockup in allocation path.
>>
>>> ---
>>>
>>> v4: Rebase to use the new bitfield in dev_iommu, expand commit message.
>>>
>>> drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++------
>>> drivers/iommu/dma-iommu.h | 8 ++++++++
>>> drivers/iommu/iommu.c | 3 +++
>>> include/linux/iommu.h | 2 ++
>>> 4 files changed, 33 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>> index 99b2646cb5c7..9193ad5bc72f 100644
>>> --- a/drivers/iommu/dma-iommu.c
>>> +++ b/drivers/iommu/dma-iommu.c
>>> @@ -630,7 +630,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct
>>> iommu_domain *domain,
>>> {
>>> struct iommu_dma_cookie *cookie = domain->iova_cookie;
>>> struct iova_domain *iovad = &cookie->iovad;
>>> - unsigned long shift, iova_len, iova = 0;
>>> + unsigned long shift, iova_len, iova;
>>> if (cookie->type == IOMMU_DMA_MSI_COOKIE) {
>>> cookie->msi_iova += size;
>>> @@ -645,15 +645,29 @@ static dma_addr_t iommu_dma_alloc_iova(struct
>>> iommu_domain *domain,
>>> if (domain->geometry.force_aperture)
>>> dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
>>> - /* Try to get PCI devices a SAC address */
>>> - if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && dev_is_pci(dev))
>>> + /*
>>> + * Try to use all the 32-bit PCI addresses first. The original SAC vs.
>>> + * DAC reasoning loses relevance with PCIe, but enough hardware and
>>> + * firmware bugs are still lurking out there that it's safest not to
>>> + * venture into the 64-bit space until necessary.
>>> + *
>>> + * If your device goes wrong after seeing the notice then likely either
>>> + * its driver is not setting DMA masks accurately, the hardware has
>>> + * some inherent bug in handling >32-bit addresses, or not all the
>>> + * expected address bits are wired up between the device and the IOMMU.
>>> + */
>>> + if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) {
>>> iova = alloc_iova_fast(iovad, iova_len,
>>> DMA_BIT_MASK(32) >> shift, false);
>>> + if (iova)
>>> + goto done;
>>> - if (!iova)
>>> - iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
>>> - true);
>>> + dev->iommu->pci_32bit_workaround = false;
>>> + dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
>>
>> May be dev_notice_once? Otherwise we may see this message multiple time for same
>> device like below:
>
> Oh, that's a bit irritating. Of course multiple threads can reach this
> in parallel, silly me :(
>
> I would really prefer the notice to be once per device rather than once
> globally, since there's clearly no guarantee that the first device to
Agree. Makes sense.
> hit this case is going to be the one which is liable to go wrong. Does
> the (untested) diff below do any better?
Thanks for the patch. I have tested and its working fine.
-Vasant
On Tue, Apr 18, 2023 at 3:20 AM John Garry <[email protected]> wrote:
>
> JFYI, Since you are using NVMe, you could also alternatively try
> something like which I did for some SCSI storage controller drivers to
> limit the request_queue max_sectors soft limit, like:
That patch is not only whitespace-damaged, it's randomly missing one
'+' character so it makes no sense even ignoring the whitespace
problems. _and_ it has a nonsensical cast to 'unsigned int' which
makes that 'min()' possibly do crazy and invalid things (ie imagine
dma_opt_mapping_size() returning 4GB).
You can't cast things to the smaller size just to get rid of a
warning, for chrissake!
In fact, even without the cast, it seems entirely broken, since the
fallback for dma_opt_mapping_size() is to return 0 (admittedly _that_
case only happens with HAS_DMA=n).
Finally, doing this inside the
if (ctrl->max_hw_sectors) {
conditional seems entirely wrong, since any dma mapping limits would
be entirely independent of any driver maximum hw size, and in fact
*easier* to hit if the block device itself doesn't have any max
limits.
So please burn that patch in the darkest pits of hell and let's try to
forget it ever existed. Ok?
Also, shouldn't any possible dma mapping size affect not
'max_sectors', but 'max_segment_size'? At least the docs imply that
dma_opt_mapping_size() is about the max size of a _single_ mapping,
not of the whole thing?
Anyway, if this is actually an issue, to the point that it's now being
discussed for a _second_ block driver subsystem, then shouldn't the
queue handling just do this all automatically, instead of adding
random crap to random block driver architectures?
And no, I don't know this code, so maybe I'm entirely missing
something, but that patch just raised my hackles enough that I had to
say something.
Linus
On 18/04/2023 18:36, Linus Torvalds wrote:
>> JFYI, Since you are using NVMe, you could also alternatively try
>> something like which I did for some SCSI storage controller drivers to
>> limit the request_queue max_sectors soft limit, like:
> That patch is not only whitespace-damaged, it's randomly missing one
> '+' character
My copy and paste error
> so it makes no sense even ignoring the whitespace
> problems._and_ it has a nonsensical cast to 'unsigned int' which
> makes that 'min()' possibly do crazy and invalid things (ie imagine
> dma_opt_mapping_size() returning 4GB).
>
> You can't cast things to the smaller size just to get rid of a
> warning, for chrissake!
Yeah, sorry, I was just trying to show a very quick demo of how this can
actually be done.
Indeed, I could have mentioned that it would actually have been easier
to test by feeding a lower limit into /sys/block/<dev>/queue/max_sectors_kb
>
> In fact, even without the cast, it seems entirely broken, since the
> fallback for dma_opt_mapping_size() is to return 0 (admittedly_that_
> case only happens with HAS_DMA=n).
>
> Finally, doing this inside the
>
> if (ctrl->max_hw_sectors) {
I think that this would be set for PCI NVMe controllers, which we were
interested in here. But, indeed, I could check for a better place to set
this.
>
> conditional seems entirely wrong, since any dma mapping limits would
> be entirely independent of any driver maximum hw size, and in fact
> *easier* to hit if the block device itself doesn't have any max
> limits.
>
> So please burn that patch in the darkest pits of hell and let's try to
> forget it ever existed. Ok?
Sure
>
> Also, shouldn't any possible dma mapping size affect not
> 'max_sectors', but 'max_segment_size'? At least the docs imply that
> dma_opt_mapping_size() is about the max size of a_single_ mapping,
> not of the whole thing?
It's meant to apply to total mapping length and not a single segment, so
then the doc would be misleading.
>
> Anyway, if this is actually an issue, to the point that it's now being
> discussed for a_second_ block driver subsystem, then shouldn't the
> queue handling just do this all automatically, instead of adding
> random crap to random block driver architectures?
Other storage controllers may enjoy better performance with very large
DMA mappings (whose total length exceed the IOVA caching limit), so it
was too risky to apply a performance-related change of this nature
across the board when that API was introduced.
So far it had only been a single controller where we were actually
seeing the issue of alloc'ing IOVAs giving very (very) poor performance.
However, as far as I am aware, there was nothing special about that
controller, apart from the fact that it was often creating requests
whose length exceeded that IOVA caching limit, and it also filling the
32b IOVA space quickly - that may be because the system had lots of CPUs.
Since there are now reports of poor performance in other storage
controllers and also in networking adapters, I can only assume that
people are testing more often with IOMMU-enabled systems with lots of
CPUs. Having said that, I would still be cautious of applying that limit
everywhere.
>
> And no, I don't know this code, so maybe I'm entirely missing
> something, but that patch just raised my hackles enough that I had to
> say something.
Sure.
Thanks,
John
On Fri, Apr 14, 2023 at 06:45:57PM +0100, Robin Murphy wrote:
> Sounds good - I'm considerably more confident in this approach, but although
> it should not be able to break any scenario which wasn't already broken, it
> could potentially still make such a breakage more noticeable. Thus in all
> honesty I'd feel happiest giving it a full cycle of -next coverage as well.
I had some second thoughts on this, wouldn't it be better to change the
allocator to allocate from lowest addresses first? Then we can just
remove the SAC trick and rely on dma-masks only.
Thoughts?
Regards,
Joerg
On 23/05/2023 5:06 pm, Joerg Roedel wrote:
> On Fri, Apr 14, 2023 at 06:45:57PM +0100, Robin Murphy wrote:
>> Sounds good - I'm considerably more confident in this approach, but although
>> it should not be able to break any scenario which wasn't already broken, it
>> could potentially still make such a breakage more noticeable. Thus in all
>> honesty I'd feel happiest giving it a full cycle of -next coverage as well.
>
> I had some second thoughts on this, wouldn't it be better to change the
> allocator to allocate from lowest addresses first? Then we can just
> remove the SAC trick and rely on dma-masks only.
>
> Thoughts?
Yes, in the long term I definitely would like to have a more flexible
allocator - this is more of a stop-gap measure that's an easy win with
what we have now.
Top-down allocation is nice in that it makes for easily recognisable DMA
addresses, and does do a great job of flushing out bugs, but having the
option of bottom-up allocation would definitely be useful in various
cases - realistically it's pretty much a prerequisite for converting
arch/arm to use iommu-dma. However, given all the other scalability
issues that keep coming to light, I think that's going to be the tipping
point for ripping up the existing code and giving it a major overhaul,
which I would love to be able to get stuck in to, but don't have the
capacity for at the moment.
Thanks,
Robin.
On Fri, 14 Apr 2023 13:45:34 +0200 Joerg Roedel wrote:
> Thanks for working on this, I think this is good to go. But given the
> issues we had with last attempt I'd like to have this in linux-next for
> a few weeks before sending it upstream. Therefore I will defer this
> patch and merge it early in the next cycle.
Is this patch queued up? I don't see it in linux-next and we keep
hitting this issue in production. After a few NIC reconfigurations
IOMMU starts consuming 90+% of CPU time.
On 13/06/2023 18:58, Jakub Kicinski wrote:
> On Fri, 14 Apr 2023 13:45:34 +0200 Joerg Roedel wrote:
>> Thanks for working on this, I think this is good to go. But given the
>> issues we had with last attempt I'd like to have this in linux-next for
>> a few weeks before sending it upstream. Therefore I will defer this
>> patch and merge it early in the next cycle.
> Is this patch queued up? I don't see it in linux-next and we keep
> hitting this issue in production. After a few NIC reconfigurations
> IOMMU starts consuming 90+% of CPU time.
>
Since we're at rc6 time and a cautious approach was wanted to merge this
change, I doubt that this will be merged for this cycle. That's quite
unfortunate.
Please note what I mentioned earlier about using dma_opt_mapping_size().
This API is used by some block storage drivers to avoid your same
problem, by clamping max_sectors_kb at this size - see sysfs-block Doc
for info there. Maybe it can be used similarly for network drivers.
Thanks,
John
On 2023-06-15 08:49, John Garry wrote:
> On 13/06/2023 18:58, Jakub Kicinski wrote:
>> On Fri, 14 Apr 2023 13:45:34 +0200 Joerg Roedel wrote:
>>> Thanks for working on this, I think this is good to go. But given the
>>> issues we had with last attempt I'd like to have this in linux-next for
>>> a few weeks before sending it upstream. Therefore I will defer this
>>> patch and merge it early in the next cycle.
>> Is this patch queued up? I don't see it in linux-next and we keep
>> hitting this issue in production. After a few NIC reconfigurations
>> IOMMU starts consuming 90+% of CPU time.
>>
>
> Since we're at rc6 time and a cautious approach was wanted to merge this
> change, I doubt that this will be merged for this cycle. That's quite
> unfortunate.
>
> Please note what I mentioned earlier about using dma_opt_mapping_size().
> This API is used by some block storage drivers to avoid your same
> problem, by clamping max_sectors_kb at this size - see sysfs-block Doc
> for info there. Maybe it can be used similarly for network drivers.
It's not the same problem - in this case the mappings are already small
enough to use the rcaches, and it seems more to do with the total number
of unusable cached IOVAs being enough to keep the 32-bit space
almost-but-not-quite full most of the time, defeating the
max32_alloc_size optimisation whenever the caches run out of the right
size entries.
The manual workaround for now would be to boot with "iommu.forcedac=1"
and hope that no other devices break because of it.
Thanks,
Robin.
On 15/06/2023 10:04, Robin Murphy wrote:
>> Since we're at rc6 time and a cautious approach was wanted to merge
>> this change, I doubt that this will be merged for this cycle. That's
>> quite unfortunate.
>>
>> Please note what I mentioned earlier about using
>> dma_opt_mapping_size(). This API is used by some block storage drivers
>> to avoid your same problem, by clamping max_sectors_kb at this size -
>> see sysfs-block Doc for info there. Maybe it can be used similarly for
>> network drivers.
>
> It's not the same problem - in this case the mappings are already small
> enough to use the rcaches, and it seems more to do with the total number
> of unusable cached IOVAs being enough to keep the 32-bit space
> almost-but-not-quite full most of the time, defeating the
> max32_alloc_size optimisation whenever the caches run out of the right
> size entries.
Sure, not the same problem.
However when we switched storage drivers to use dma_opt_mapping_size()
then performance is similar to iommu.forcedac=1 - that's what I found,
anyway.
This tells me that that even though IOVA allocator performance is poor
when the 32b space fills, it was those large IOVAs which don't fit in
the rcache which were the major contributor to hogging the CPU in the
allocator.
>
> The manual workaround for now would be to boot with "iommu.forcedac=1"
> and hope that no other devices break because of it.
Thanks,
John
On 2023-06-15 11:11, John Garry wrote:
> On 15/06/2023 10:04, Robin Murphy wrote:
>>> Since we're at rc6 time and a cautious approach was wanted to merge
>>> this change, I doubt that this will be merged for this cycle. That's
>>> quite unfortunate.
>>>
>>> Please note what I mentioned earlier about using
>>> dma_opt_mapping_size(). This API is used by some block storage
>>> drivers to avoid your same problem, by clamping max_sectors_kb at
>>> this size - see sysfs-block Doc for info there. Maybe it can be used
>>> similarly for network drivers.
>>
>> It's not the same problem - in this case the mappings are already
>> small enough to use the rcaches, and it seems more to do with the
>> total number of unusable cached IOVAs being enough to keep the 32-bit
>> space almost-but-not-quite full most of the time, defeating the
>> max32_alloc_size optimisation whenever the caches run out of the right
>> size entries.
>
> Sure, not the same problem.
>
> However when we switched storage drivers to use dma_opt_mapping_size()
> then performance is similar to iommu.forcedac=1 - that's what I found,
> anyway.
>
> This tells me that that even though IOVA allocator performance is poor
> when the 32b space fills, it was those large IOVAs which don't fit in
> the rcache which were the major contributor to hogging the CPU in the
> allocator.
The root cause is that every time the last usable 32-bit IOVA is
allocated, the *next* PCI caller to hit the rbtree for a SAC allocation
is burdened with walking the whole 32-bit subtree to determine that it's
full again and re-set max32_alloc_size. That's the overhead that
forcedac avoids.
In the storage case with larger buffers, dma_opt_mapping_size() also
means you spend less time in the rbtree, but because you're inherently
hitting it less often at all, since most allocations can now hopefully
be fulfilled by the caches. That's obviously moot when the mappings are
already small enough to be cached and the only reason for hitting the
rbtree is overflow/underflow in the depot because the working set is
sufficiently large and the allocation pattern sufficiently "bursty".
Thanks,
Robin.
On 15/06/2023 12:41, Robin Murphy wrote:
>>
>> Sure, not the same problem.
>>
>> However when we switched storage drivers to use dma_opt_mapping_size()
>> then performance is similar to iommu.forcedac=1 - that's what I found,
>> anyway.
>>
>> This tells me that that even though IOVA allocator performance is poor
>> when the 32b space fills, it was those large IOVAs which don't fit in
>> the rcache which were the major contributor to hogging the CPU in the
>> allocator.
>
> The root cause is that every time the last usable 32-bit IOVA is
> allocated, the *next* PCI caller to hit the rbtree for a SAC allocation
> is burdened with walking the whole 32-bit subtree to determine that it's
> full again and re-set max32_alloc_size. That's the overhead that
> forcedac avoids.
>
Sure
> In the storage case with larger buffers, dma_opt_mapping_size() also
> means you spend less time in the rbtree, but because you're inherently
> hitting it less often at all, since most allocations can now hopefully
> be fulfilled by the caches.
Sure
> That's obviously moot when the mappings are
> already small enough to be cached and the only reason for hitting the
> rbtree is overflow/underflow in the depot because the working set is
> sufficiently large and the allocation pattern sufficiently "bursty".
After a bit of checking, this is same issue
https://lore.kernel.org/linux-iommu/[email protected]/,
and indeed we would always be using rcache'able-sized mappings.
So you think that we are reaching the depot full issue when we start to
free depot magazines in __iova_rcache_insert(), right? From my
experience in storage testing, it takes a long time to get to this state.
Thanks,
John
On Thu, Apr 13, 2023 at 02:40:25PM +0100, Robin Murphy wrote:
> To that end, implement a flag to refine the SAC business into a
> per-device policy that can automatically get itself out of the way if
> and when it stops being useful.
It is summer time and it may be the heat, but I am in a brave mood and
decided to give this a shoot. I applied it to the core branch and will
push it to linux-next for testing.
Regards,
Joerg
On 14/07/2023 15:09, Joerg Roedel wrote:
> On Thu, Apr 13, 2023 at 02:40:25PM +0100, Robin Murphy wrote:
>> To that end, implement a flag to refine the SAC business into a
>> per-device policy that can automatically get itself out of the way if
>> and when it stops being useful.
> It is summer time and it may be the heat, but I am in a brave mood and
> decided to give this a shoot. I applied it to the core branch and will
> push it to linux-next for testing.
Cool.
@Robin, I think that you sent an incremental change to this original
patch in this same thread. The issue which it solved was that we saw
dev_notice() logs many times. I glanced at linux-next and Joerg does not
seem to have included that.
Thanks,
John