VFIO currently exposes the usable IOVA regions through the
VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
capability. However it fails to take into account the dma_mask
of the devices within the container. The top limit currently is
defined by the iommu aperture.
So, for instance, if the IOMMU supports up to 48bits, it may give
the impression the max IOVA is 48b while a device may have a
dma_mask of 42b. So this API cannot really be used to compute
the max usable IOVA.
This patch removes the IOVA region beyond the dma_mask's.
As we start to expose this reserved region in the sysfs file
/sys/kernel/iommu_groups/<n>/reserved_regions, we also need to
handle the IOVA range beyond the IOMMU aperture to handle the case
where the dma_mask would have a higher number of bits than the iommu
max input address.
This is a change to the ABI as this reserved region was not yet
exposed in sysfs /sys/kernel/iommu_groups/<n>/reserved_regions or
through the VFIO ioctl. At VFIO level we increment the version of
the VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE capability to advertise
that change.
Best Regards
Eric
This series can be found at:
https://github.com/eauger/qemu/tree/dma_mask_rfc
Eric Auger (3):
iommu: Fix merging in iommu_insert_resv_region
iommu: Account for dma_mask and iommu aperture in IOVA reserved
regions
vfio/type1: Increase the version of
VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
.../ABI/testing/sysfs-kernel-iommu_groups | 7 ++++
drivers/iommu/iommu.c | 41 ++++++++++++++++++-
drivers/vfio/vfio_iommu_type1.c | 2 +-
3 files changed, 48 insertions(+), 2 deletions(-)
--
2.21.3
VFIO currently exposes the usable IOVA regions through the
VFIO_IOMMU_GET_INFO ioctl. However it fails to take into account
the dma_mask of the devices within the container. The top limit
currently is defined by the iommu aperture.
So, for instance, if the IOMMU supports up to 48bits, it may give
the impression the max IOVA is 48b while a device may have a
dma_mask of 42b. So this API cannot really be used to compute
the max usable IOVA.
This patch removes the IOVA region beyond the dma_mask's. As we
start to expose this reserved region in the sysfs file
/sys/kernel/iommu_groups/<n>/reserved_regions, we also need to
expose the IOVA range beyond the IOMMU aperture to handle the case
where the dma_mask would have a higher number of bits than the iommu
max input address. Those out-of-reach regions get the
IOMMU_RESV_RESERVED type.
This is a change to the ABI as this reserved region was not yet
exposed in sysfs /sys/kernel/iommu_groups/<n>/reserved_regions or
through the VFIO ioctl. Document that change.
Signed-off-by: Eric Auger <[email protected]>
---
.../ABI/testing/sysfs-kernel-iommu_groups | 7 ++++
drivers/iommu/iommu.c | 39 +++++++++++++++++++
2 files changed, 46 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
index 017f5bc3920c..2f316686c88b 100644
--- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups
+++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
@@ -33,3 +33,10 @@ Description: In case an RMRR is used only by graphics or USB devices
it is now exposed as "direct-relaxable" instead of "direct".
In device assignment use case, for instance, those RMRR
are considered to be relaxable and safe.
+
+What: /sys/kernel/iommu_groups/reserved_regions
+Date: Sept 2020
+KernelVersion: v5.11
+Contact: Eric Auger <[email protected]>
+Description: Regions beyond the device dma_mask and the iommu aperture
+ now are exposed as IOMMU_RESV_RESERVED reserved regions.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index dd8cda340e62..d797f07b3625 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2511,9 +2511,48 @@ EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
void iommu_get_resv_regions(struct device *dev, struct list_head *list)
{
const struct iommu_ops *ops = dev->bus->iommu_ops;
+ struct iommu_resv_region *region;
+ struct iommu_domain *domain;
+
+ domain = iommu_get_domain_for_dev(dev);
+
+ if (domain) {
+ struct iommu_domain_geometry geo;
+
+ if (iommu_domain_get_attr(domain, DOMAIN_ATTR_GEOMETRY, &geo))
+ return;
+
+ if (geo.aperture_end < ULLONG_MAX && geo.aperture_end != geo.aperture_start) {
+ region = iommu_alloc_resv_region(geo.aperture_end + 1,
+ ULLONG_MAX - geo.aperture_end,
+ 0, IOMMU_RESV_RESERVED);
+ if (!region)
+ return;
+ list_add_tail(®ion->list, list);
+ }
+
+ if (geo.aperture_start > 0) {
+ region = iommu_alloc_resv_region(0, geo.aperture_start,
+ 0, IOMMU_RESV_RESERVED);
+ if (!region)
+ return;
+ list_add_tail(®ion->list, list);
+ }
+ }
if (ops && ops->get_resv_regions)
ops->get_resv_regions(dev, list);
+
+ if (!dev->dma_mask || *dev->dma_mask == ULLONG_MAX)
+ return;
+
+ region = iommu_alloc_resv_region(*dev->dma_mask + 1,
+ ULLONG_MAX - *dev->dma_mask,
+ 0, IOMMU_RESV_RESERVED);
+ if (!region)
+ return;
+
+ list_add_tail(®ion->list, list);
}
void iommu_put_resv_regions(struct device *dev, struct list_head *list)
--
2.21.3
Now the IOVA regions beyond the dma_mask and the vfio aperture are
removed from the usable IOVA ranges, the API becomes reliable to
compute the max IOVA. Let's advertise this by using a new version
for the capability.
Signed-off-by: Eric Auger <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5fbf0c1f7433..af4596123954 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2541,7 +2541,7 @@ static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
struct vfio_iommu_type1_info_cap_iova_range *iova_cap;
header = vfio_info_cap_add(caps, size,
- VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE, 1);
+ VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE, 2);
if (IS_ERR(header))
return PTR_ERR(header);
--
2.21.3
On Mon, 28 Sep 2020 21:50:34 +0200
Eric Auger <[email protected]> wrote:
> VFIO currently exposes the usable IOVA regions through the
> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> capability. However it fails to take into account the dma_mask
> of the devices within the container. The top limit currently is
> defined by the iommu aperture.
I think that dma_mask is traditionally a DMA API interface for a device
driver to indicate to the DMA layer which mappings are accessible to the
device. On the other hand, vfio makes use of the IOMMU API where the
driver is in userspace. That userspace driver has full control of the
IOVA range of the device, therefore dma_mask is mostly irrelevant to
vfio. I think the issue you're trying to tackle is that the IORT code
is making use of the dma_mask to try to describe a DMA address
limitation imposed by the PCI root bus, living between the endpoint
device and the IOMMU. Therefore, if the IORT code is exposing a
topology or system imposed device limitation, this seems much more akin
to something like an MSI reserved range, where it's not necessarily the
device or the IOMMU with the limitation, but something that sits
between them.
> So, for instance, if the IOMMU supports up to 48bits, it may give
> the impression the max IOVA is 48b while a device may have a
> dma_mask of 42b. So this API cannot really be used to compute
> the max usable IOVA.
>
> This patch removes the IOVA region beyond the dma_mask's.
Rather it adds a reserved region accounting for the range above the
device's dma_mask. I don't think the IOMMU API should be consuming
dma_mask like this though. For example, what happens in
pci_dma_configure() when there are no OF or ACPI DMA restrictions? It
appears to me that the dma_mask from whatever previous driver had the
device carries over to the new driver. That's generally ok for the DMA
API because a driver is required to set the device's DMA mask. It
doesn't make sense however to blindly consume that dma_mask and export
it via an IOMMU API. For example I would expect to see different
results depending on whether a host driver has been bound to a device.
It seems the correct IOMMU API approach would be for the IORT code to
specifically register reserved ranges for the device.
> As we start to expose this reserved region in the sysfs file
> /sys/kernel/iommu_groups/<n>/reserved_regions, we also need to
> handle the IOVA range beyond the IOMMU aperture to handle the case
> where the dma_mask would have a higher number of bits than the iommu
> max input address.
Why? The IOMMU geometry already describes this and vfio combines both
the IOMMU geometry and the device reserved regions when generating the
IOVA ranges? Who is going to consume this information? Additionally
it appears that reserved regions will report different information
depending on whether a device is attached to a domain.
> This is a change to the ABI as this reserved region was not yet
> exposed in sysfs /sys/kernel/iommu_groups/<n>/reserved_regions or
> through the VFIO ioctl. At VFIO level we increment the version of
> the VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE capability to advertise
> that change.
Is this really an ABI change? The original entry for reserved regions
includes:
Not necessarily all reserved regions are listed. This is typically
used to output direct-mapped, MSI, non mappable regions.
I imagine the intention here was non-mappable relative to the IOMMU,
but non-mappable to the device is essentially what we're including
here.
I'm also concerned about bumping the vfio interface version for the
IOVA range. We're not changing the interface, we're modifying the
result, and even then only for a fraction of users. How many users are
potentially broken by that change? Are we going to bump the version
for everyone any time the result changes on any platform? Thanks,
Alex
On Mon, Sep 28, 2020 at 09:50:36PM +0200, Eric Auger wrote:
> VFIO currently exposes the usable IOVA regions through the
> VFIO_IOMMU_GET_INFO ioctl. However it fails to take into account
> the dma_mask of the devices within the container. The top limit
> currently is defined by the iommu aperture.
Can we take a step back here? The dma_mask only has a meaning for
the DMA API, and not the iommu API, it should have no relevance here.
More importantly if we are using vfio no dma_mask should be set to
start with.
> + if (geo.aperture_end < ULLONG_MAX && geo.aperture_end != geo.aperture_start) {
Please avoid pointlessly overlong lines.
Hi all,
[also correcting some outdated email addresses + adding Lorenzo in cc]
On 9/29/20 12:42 AM, Alex Williamson wrote:
> On Mon, 28 Sep 2020 21:50:34 +0200
> Eric Auger <[email protected]> wrote:
>
>> VFIO currently exposes the usable IOVA regions through the
>> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
>> capability. However it fails to take into account the dma_mask
>> of the devices within the container. The top limit currently is
>> defined by the iommu aperture.
>
> I think that dma_mask is traditionally a DMA API interface for a device
> driver to indicate to the DMA layer which mappings are accessible to the
> device. On the other hand, vfio makes use of the IOMMU API where the
> driver is in userspace. That userspace driver has full control of the
> IOVA range of the device, therefore dma_mask is mostly irrelevant to
> vfio. I think the issue you're trying to tackle is that the IORT code
> is making use of the dma_mask to try to describe a DMA address
> limitation imposed by the PCI root bus, living between the endpoint
> device and the IOMMU. Therefore, if the IORT code is exposing a
> topology or system imposed device limitation, this seems much more akin
> to something like an MSI reserved range, where it's not necessarily the
> device or the IOMMU with the limitation, but something that sits
> between them.
First I think I failed to explain the context. I worked on NVMe
passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
physical device. The IOVA allocator there currently uses an IOVA range
within [0x10000, 1ULL << 39]. This IOVA layout rather is arbitrary if I
understand correctly. I noticed we rapidly get some VFIO MAP DMA
failures because the allocated IOVA collide with the ARM MSI reserved
IOVA window [0x8000000, 0x8100000]. Since 9b77e5c79840 ("vfio/type1:
Check reserved region conflict and update iova list"), such VFIO MAP DMA
attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
by using the VFIO_IOMMU_GET_INFO ioctl /
VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
avoid allocating within this range and others. While working on this, I
tried to automatically compute the min/max IOVAs and change the
arbitrary [0x10000, 1ULL << 39]. My SMMUv2 supports up to 48b so
naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
allocates at the bottom and at the top of the range. I noticed the use
case was not working as soon as the top IOVA was more than 1ULL << 42.
And then we noticed the dma_mask was set to 42 by using
cat /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
interpretation is the dma_mask was somehow containing the info the
device couldn't handle IOVAs beyond a certain limit.
In my case the 42b limit is computed in iort_dma_setup() by
acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
Referring to the comment, it does "Evaluate DMA regions and return
respectively DMA region start, offset and size in dma_addr, offset and
size on parsing success". This parses the ACPI table, looking for ACPI
companions with _DMA methods.
But as Alex mentioned, the IORT also allows to define limits on "the
number of address bits, starting from the least significant bit that can
be generated by a device when it accesses memory". See Named component
node.Device Memory Address Size limit or PCI root complex node. Memory
address size limit.
ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
if (ret == -ENODEV)
ret = dev_is_pci(dev) ? rc_dma_get_range(dev, &size)
: nc_dma_get_range(dev, &size);
So eventually those info collected from the ACPI tables which do impact
the usable IOVA range seem to be stored in the dma_mask, hence that
proposal.
>
>> So, for instance, if the IOMMU supports up to 48bits, it may give
>> the impression the max IOVA is 48b while a device may have a
>> dma_mask of 42b. So this API cannot really be used to compute
>> the max usable IOVA.
>>
>> This patch removes the IOVA region beyond the dma_mask's.
>
> Rather it adds a reserved region accounting for the range above the
> device's dma_mask.
Yep. It adds new reserved regions in
/sys/kernel/iommu_groups/<n>/reserved_regions and remove those from the
usable regions exposed by VFIO GET_INFO.
I don't think the IOMMU API should be consuming
> dma_mask like this though. For example, what happens in
> pci_dma_configure() when there are no OF or ACPI DMA restrictions?
My guess was that the dma_mask was set to the max range but I did not
test it.
It
> appears to me that the dma_mask from whatever previous driver had the
> device carries over to the new driver. That's generally ok for the DMA
> API because a driver is required to set the device's DMA mask. It
> doesn't make sense however to blindly consume that dma_mask and export
> it via an IOMMU API. For example I would expect to see different
> results depending on whether a host driver has been bound to a device.
> It seems the correct IOMMU API approach would be for the IORT code to
> specifically register reserved ranges for the device.
Is it only specific to IORT table? acpi_dma_get_range() in
drivers/acpi/scan.c is generic.
>
>> As we start to expose this reserved region in the sysfs file
>> /sys/kernel/iommu_groups/<n>/reserved_regions, we also need to
>> handle the IOVA range beyond the IOMMU aperture to handle the case
>> where the dma_mask would have a higher number of bits than the iommu
>> max input address.
>
> Why? The IOMMU geometry already describes this and vfio combines both
> the IOMMU geometry and the device reserved regions when generating the
> IOVA ranges?
Yes VFIO layer does add the info about the topology but
/sys/kernel/iommu_groups/<n>/reserved_regions, generated by the IOMMU
code, does not. this latter only exposes reserved regions. Assume the
dma_mask is 48b and the IOMMU aperture is 42b (assuming it is possible),
if you only take into account the "dma_mask" limitation, the end-user
will interpret this as: I can use up to 48b.
Who is going to consume this information? Additionally
> it appears that reserved regions will report different information
> depending on whether a device is attached to a domain.
yes that's correct. Well at some point we decided to expose (some)
reserved regions through sysfs. Only printing a reduced set of those
also can be misleading, hence my attempt to be more comprehensive.
>
>> This is a change to the ABI as this reserved region was not yet
>> exposed in sysfs /sys/kernel/iommu_groups/<n>/reserved_regions or
>> through the VFIO ioctl. At VFIO level we increment the version of
>> the VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE capability to advertise
>> that change.
>
> Is this really an ABI change? The original entry for reserved regions
> includes:
>
> Not necessarily all reserved regions are listed. This is typically
> used to output direct-mapped, MSI, non mappable regions.
I agree. That's not really a change in the ABI but I wanted to make
things clear about the induced changes for the end-user. On the other
end there will be a change in the number of reported resv regions.
>
> I imagine the intention here was non-mappable relative to the IOMMU,
> but non-mappable to the device is essentially what we're including
> here.
>
> I'm also concerned about bumping the vfio interface version for the
> IOVA range. We're not changing the interface, we're modifying the
> result, and even then only for a fraction of users. How many users are
> potentially broken by that change? Are we going to bump the version
> for everyone any time the result changes on any platform? Thanks,
The userspace needs to know if the GET_INFO is reliable to compute the
min/max IOVAs. If we do not change the version, it cannot know and must
assume it is not. Wasn't the version field meant for that somehow?
Thanks
Eric
>
> Alex
>
Hi Christoph,
On 9/29/20 8:03 AM, Christoph Hellwig wrote:
> On Mon, Sep 28, 2020 at 09:50:36PM +0200, Eric Auger wrote:
>> VFIO currently exposes the usable IOVA regions through the
>> VFIO_IOMMU_GET_INFO ioctl. However it fails to take into account
>> the dma_mask of the devices within the container. The top limit
>> currently is defined by the iommu aperture.
>
> Can we take a step back here? The dma_mask only has a meaning for
> the DMA API, and not the iommu API, it should have no relevance here.
>
> More importantly if we are using vfio no dma_mask should be set to
> start with.
You will find more context in my reply to Alex.
Thanks
Eric
>
>> + if (geo.aperture_end < ULLONG_MAX && geo.aperture_end != geo.aperture_start) {
>
> Please avoid pointlessly overlong lines.
>
On Tue, 29 Sep 2020 09:18:22 +0200
Auger Eric <[email protected]> wrote:
> Hi all,
>
> [also correcting some outdated email addresses + adding Lorenzo in cc]
>
> On 9/29/20 12:42 AM, Alex Williamson wrote:
> > On Mon, 28 Sep 2020 21:50:34 +0200
> > Eric Auger <[email protected]> wrote:
> >
> >> VFIO currently exposes the usable IOVA regions through the
> >> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> >> capability. However it fails to take into account the dma_mask
> >> of the devices within the container. The top limit currently is
> >> defined by the iommu aperture.
> >
> > I think that dma_mask is traditionally a DMA API interface for a device
> > driver to indicate to the DMA layer which mappings are accessible to the
> > device. On the other hand, vfio makes use of the IOMMU API where the
> > driver is in userspace. That userspace driver has full control of the
> > IOVA range of the device, therefore dma_mask is mostly irrelevant to
> > vfio. I think the issue you're trying to tackle is that the IORT code
> > is making use of the dma_mask to try to describe a DMA address
> > limitation imposed by the PCI root bus, living between the endpoint
> > device and the IOMMU. Therefore, if the IORT code is exposing a
> > topology or system imposed device limitation, this seems much more akin
> > to something like an MSI reserved range, where it's not necessarily the
> > device or the IOMMU with the limitation, but something that sits
> > between them.
>
> First I think I failed to explain the context. I worked on NVMe
> passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
> physical device. The IOVA allocator there currently uses an IOVA range
> within [0x10000, 1ULL << 39]. This IOVA layout rather is arbitrary if I
> understand correctly.
39 bits is the minimum available on some VT-d systems, so it was
probably considered a reasonable minimum address width to consider.
> I noticed we rapidly get some VFIO MAP DMA
> failures because the allocated IOVA collide with the ARM MSI reserved
> IOVA window [0x8000000, 0x8100000]. Since 9b77e5c79840 ("vfio/type1:
> Check reserved region conflict and update iova list"), such VFIO MAP DMA
> attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
> by using the VFIO_IOMMU_GET_INFO ioctl /
> VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
> avoid allocating within this range and others. While working on this, I
> tried to automatically compute the min/max IOVAs and change the
> arbitrary [0x10000, 1ULL << 39]. My SMMUv2 supports up to 48b so
> naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
> allocates at the bottom and at the top of the range. I noticed the use
> case was not working as soon as the top IOVA was more than 1ULL << 42.
> And then we noticed the dma_mask was set to 42 by using
> cat /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
> interpretation is the dma_mask was somehow containing the info the
> device couldn't handle IOVAs beyond a certain limit.
I see that there are both OF and ACPI hooks in pci_dma_configure() and
both modify dev->dma_mask, which is what pci-sysfs is exposing here,
but I'm not convinced this even does what it's intended to do. The
driver core calls this via the bus->dma_configure callback before
probing a driver, but then what happens when the driver calls
pci_set_dma_mask()? This is just a wrapper for dma_set_mask() and I
don't see anywhere that would take into account the existing
dev->dma_mask. It seems for example that pci_dma_configure() could
produce a 42 bit mask as we have here, then the driver could override
that with anything that the dma_ops.dma_supported() callback finds
acceptable, and I don't see any instances where the current
dev->dma_mask is considered. Am I overlooking something?
> In my case the 42b limit is computed in iort_dma_setup() by
> acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
>
> Referring to the comment, it does "Evaluate DMA regions and return
> respectively DMA region start, offset and size in dma_addr, offset and
> size on parsing success". This parses the ACPI table, looking for ACPI
> companions with _DMA methods.
>
> But as Alex mentioned, the IORT also allows to define limits on "the
> number of address bits, starting from the least significant bit that can
> be generated by a device when it accesses memory". See Named component
> node.Device Memory Address Size limit or PCI root complex node. Memory
> address size limit.
>
> ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
> if (ret == -ENODEV)
> ret = dev_is_pci(dev) ? rc_dma_get_range(dev, &size)
> : nc_dma_get_range(dev, &size);
>
> So eventually those info collected from the ACPI tables which do impact
> the usable IOVA range seem to be stored in the dma_mask, hence that
> proposal.
As above, it's not clear to me that anyone other than the driver and
the dma_supported() callback on dma_ops have any input on the value of
dma_mask, so I'm a little baffled by the current operation.
> >> So, for instance, if the IOMMU supports up to 48bits, it may give
> >> the impression the max IOVA is 48b while a device may have a
> >> dma_mask of 42b. So this API cannot really be used to compute
> >> the max usable IOVA.
> >>
> >> This patch removes the IOVA region beyond the dma_mask's.
> >
> > Rather it adds a reserved region accounting for the range above the
> > device's dma_mask.
>
> Yep. It adds new reserved regions in
> /sys/kernel/iommu_groups/<n>/reserved_regions and remove those from the
> usable regions exposed by VFIO GET_INFO.
>
> I don't think the IOMMU API should be consuming
> > dma_mask like this though. For example, what happens in
> > pci_dma_configure() when there are no OF or ACPI DMA restrictions?
> My guess was that the dma_mask was set to the max range but I did not
> test it.
Still, we're making use of a driver configured property for the
purposes of using the DMA API and consuming it in the IOMMU API,
specifically to satisfy a userspace driver where the in-kernel meta-
driver can't make any assumptions about the device DMA mask. It's all
rather convoluted.
> It
> > appears to me that the dma_mask from whatever previous driver had the
> > device carries over to the new driver. That's generally ok for the DMA
> > API because a driver is required to set the device's DMA mask. It
> > doesn't make sense however to blindly consume that dma_mask and export
> > it via an IOMMU API. For example I would expect to see different
> > results depending on whether a host driver has been bound to a device.
> > It seems the correct IOMMU API approach would be for the IORT code to
> > specifically register reserved ranges for the device.
>
> Is it only specific to IORT table? acpi_dma_get_range() in
> drivers/acpi/scan.c is generic.
Yes, anything trying to implement similar restrictions. It appears to
me that platform code is stepping on a driver owned field used by
dma_ops already here. Maybe reserved regions should be consumed by
dma_ops to understand restrictions between the device and the IOMMU.
> >> As we start to expose this reserved region in the sysfs file
> >> /sys/kernel/iommu_groups/<n>/reserved_regions, we also need to
> >> handle the IOVA range beyond the IOMMU aperture to handle the case
> >> where the dma_mask would have a higher number of bits than the iommu
> >> max input address.
> >
> > Why? The IOMMU geometry already describes this and vfio combines both
> > the IOMMU geometry and the device reserved regions when generating the
> > IOVA ranges?
> Yes VFIO layer does add the info about the topology but
> /sys/kernel/iommu_groups/<n>/reserved_regions, generated by the IOMMU
> code, does not. this latter only exposes reserved regions. Assume the
> dma_mask is 48b and the IOMMU aperture is 42b (assuming it is possible),
> if you only take into account the "dma_mask" limitation, the end-user
> will interpret this as: I can use up to 48b.
What end user? The DMA API is allocating within the address space of
the IOMMU, so it will simply never encounter the issue. Within the
IOMMU API we can already query the geometry of the IOMMU to know its
width. It seems like reserved regions is trying to take on new
responsibilities here.
> Who is going to consume this information? Additionally
> > it appears that reserved regions will report different information
> > depending on whether a device is attached to a domain.
> yes that's correct. Well at some point we decided to expose (some)
> reserved regions through sysfs. Only printing a reduced set of those
> also can be misleading, hence my attempt to be more comprehensive.
Seems it further blurs the lines of what reserved regions is intended
to report. Is it trying to replace the IOMMU API geometry interface?
> >> This is a change to the ABI as this reserved region was not yet
> >> exposed in sysfs /sys/kernel/iommu_groups/<n>/reserved_regions or
> >> through the VFIO ioctl. At VFIO level we increment the version of
> >> the VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE capability to advertise
> >> that change.
> >
> > Is this really an ABI change? The original entry for reserved regions
> > includes:
> >
> > Not necessarily all reserved regions are listed. This is typically
> > used to output direct-mapped, MSI, non mappable regions.
>
> I agree. That's not really a change in the ABI but I wanted to make
> things clear about the induced changes for the end-user. On the other
> end there will be a change in the number of reported resv regions.
> >
> > I imagine the intention here was non-mappable relative to the IOMMU,
> > but non-mappable to the device is essentially what we're including
> > here.
> >
> > I'm also concerned about bumping the vfio interface version for the
> > IOVA range. We're not changing the interface, we're modifying the
> > result, and even then only for a fraction of users. How many users are
> > potentially broken by that change? Are we going to bump the version
> > for everyone any time the result changes on any platform? Thanks,
>
> The userspace needs to know if the GET_INFO is reliable to compute the
> min/max IOVAs. If we do not change the version, it cannot know and must
> assume it is not. Wasn't the version field meant for that somehow?
Our IOVA RANGE is correct to the best of our knowledge, but I don't
think we can or should rev the version every time we find something
incorrect. For instance, we wouldn't do the same for a firmware
induced range that was missing or incorrect. I think the version field
is for creating a new data structure to replace or expand the existing
one, not to simply say the returned data might be different. I
understand the userspace dilemma, but this doesn't seem like the right
solution. Thanks,
Alex
Hi Alex,
On 9/29/20 8:18 PM, Alex Williamson wrote:
> On Tue, 29 Sep 2020 09:18:22 +0200
> Auger Eric <[email protected]> wrote:
>
>> Hi all,
>>
>> [also correcting some outdated email addresses + adding Lorenzo in cc]
>>
>> On 9/29/20 12:42 AM, Alex Williamson wrote:
>>> On Mon, 28 Sep 2020 21:50:34 +0200
>>> Eric Auger <[email protected]> wrote:
>>>
>>>> VFIO currently exposes the usable IOVA regions through the
>>>> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
>>>> capability. However it fails to take into account the dma_mask
>>>> of the devices within the container. The top limit currently is
>>>> defined by the iommu aperture.
>>>
>>> I think that dma_mask is traditionally a DMA API interface for a device
>>> driver to indicate to the DMA layer which mappings are accessible to the
>>> device. On the other hand, vfio makes use of the IOMMU API where the
>>> driver is in userspace. That userspace driver has full control of the
>>> IOVA range of the device, therefore dma_mask is mostly irrelevant to
>>> vfio. I think the issue you're trying to tackle is that the IORT code
>>> is making use of the dma_mask to try to describe a DMA address
>>> limitation imposed by the PCI root bus, living between the endpoint
>>> device and the IOMMU. Therefore, if the IORT code is exposing a
>>> topology or system imposed device limitation, this seems much more akin
>>> to something like an MSI reserved range, where it's not necessarily the
>>> device or the IOMMU with the limitation, but something that sits
>>> between them.
>>
>> First I think I failed to explain the context. I worked on NVMe
>> passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
>> physical device. The IOVA allocator there currently uses an IOVA range
>> within [0x10000, 1ULL << 39]. This IOVA layout rather is arbitrary if I
>> understand correctly.
>
> 39 bits is the minimum available on some VT-d systems, so it was
> probably considered a reasonable minimum address width to consider.
OK
>
>> I noticed we rapidly get some VFIO MAP DMA
>> failures because the allocated IOVA collide with the ARM MSI reserved
>> IOVA window [0x8000000, 0x8100000]. Since 9b77e5c79840 ("vfio/type1:
>> Check reserved region conflict and update iova list"), such VFIO MAP DMA
>> attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
>> by using the VFIO_IOMMU_GET_INFO ioctl /
>> VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
>> avoid allocating within this range and others. While working on this, I
>> tried to automatically compute the min/max IOVAs and change the
>> arbitrary [0x10000, 1ULL << 39]. My SMMUv2 supports up to 48b so
>> naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
>> allocates at the bottom and at the top of the range. I noticed the use
>> case was not working as soon as the top IOVA was more than 1ULL << 42.
>> And then we noticed the dma_mask was set to 42 by using
>> cat /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
>> interpretation is the dma_mask was somehow containing the info the
>> device couldn't handle IOVAs beyond a certain limit.
>
> I see that there are both OF and ACPI hooks in pci_dma_configure() and
> both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> but I'm not convinced this even does what it's intended to do. The
> driver core calls this via the bus->dma_configure callback before
> probing a driver, but then what happens when the driver calls
> pci_set_dma_mask()? This is just a wrapper for dma_set_mask() and I
> don't see anywhere that would take into account the existing
> dev->dma_mask. It seems for example that pci_dma_configure() could
> produce a 42 bit mask as we have here, then the driver could override
> that with anything that the dma_ops.dma_supported() callback finds
> acceptable, and I don't see any instances where the current
> dev->dma_mask is considered. Am I overlooking something?
I don't see it either. So the dma_mask set by the driver would never be
checked against the dma_mask limited found when parsing OF/ACPI?
>
>> In my case the 42b limit is computed in iort_dma_setup() by
>> acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
>>
>> Referring to the comment, it does "Evaluate DMA regions and return
>> respectively DMA region start, offset and size in dma_addr, offset and
>> size on parsing success". This parses the ACPI table, looking for ACPI
>> companions with _DMA methods.
>>
>> But as Alex mentioned, the IORT also allows to define limits on "the
>> number of address bits, starting from the least significant bit that can
>> be generated by a device when it accesses memory". See Named component
>> node.Device Memory Address Size limit or PCI root complex node. Memory
>> address size limit.
>>
>> ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
>> if (ret == -ENODEV)
>> ret = dev_is_pci(dev) ? rc_dma_get_range(dev, &size)
>> : nc_dma_get_range(dev, &size);
>>
>> So eventually those info collected from the ACPI tables which do impact
>> the usable IOVA range seem to be stored in the dma_mask, hence that
>> proposal.
>
> As above, it's not clear to me that anyone other than the driver and
> the dma_supported() callback on dma_ops have any input on the value of
> dma_mask, so I'm a little baffled by the current operation.
me too
>
>>>> So, for instance, if the IOMMU supports up to 48bits, it may give
>>>> the impression the max IOVA is 48b while a device may have a
>>>> dma_mask of 42b. So this API cannot really be used to compute
>>>> the max usable IOVA.
>>>>
>>>> This patch removes the IOVA region beyond the dma_mask's.
>>>
>>> Rather it adds a reserved region accounting for the range above the
>>> device's dma_mask.
>>
>> Yep. It adds new reserved regions in
>> /sys/kernel/iommu_groups/<n>/reserved_regions and remove those from the
>> usable regions exposed by VFIO GET_INFO.
>>
>> I don't think the IOMMU API should be consuming
>>> dma_mask like this though. For example, what happens in
>>> pci_dma_configure() when there are no OF or ACPI DMA restrictions?
>> My guess was that the dma_mask was set to the max range but I did not
>> test it.
>
> Still, we're making use of a driver configured property for the
> purposes of using the DMA API and consuming it in the IOMMU API,
> specifically to satisfy a userspace driver where the in-kernel meta-
> driver can't make any assumptions about the device DMA mask. It's all
> rather convoluted.
I can't object here. Still we are missing missing consolidated info
about max IOVA range. I interpreted the dma_mask as the info missing to
get it consolidated but now we are not even sure it does the job and
even I would hijack its original goal.
>> It
>>> appears to me that the dma_mask from whatever previous driver had the
>>> device carries over to the new driver. That's generally ok for the DMA
>>> API because a driver is required to set the device's DMA mask. It
>>> doesn't make sense however to blindly consume that dma_mask and export
>>> it via an IOMMU API. For example I would expect to see different
>>> results depending on whether a host driver has been bound to a device.
>>> It seems the correct IOMMU API approach would be for the IORT code to
>>> specifically register reserved ranges for the device.
>>
>> Is it only specific to IORT table? acpi_dma_get_range() in
>> drivers/acpi/scan.c is generic.
>
> Yes, anything trying to implement similar restrictions. It appears to
> me that platform code is stepping on a driver owned field used by
> dma_ops already here. Maybe reserved regions should be consumed by
> dma_ops to understand restrictions between the device and the IOMMU.
>
>>>> As we start to expose this reserved region in the sysfs file
>>>> /sys/kernel/iommu_groups/<n>/reserved_regions, we also need to
>>>> handle the IOVA range beyond the IOMMU aperture to handle the case
>>>> where the dma_mask would have a higher number of bits than the iommu
>>>> max input address.
>>>
>>> Why? The IOMMU geometry already describes this and vfio combines both
>>> the IOMMU geometry and the device reserved regions when generating the
>>> IOVA ranges?
>> Yes VFIO layer does add the info about the topology but
>> /sys/kernel/iommu_groups/<n>/reserved_regions, generated by the IOMMU
>> code, does not. this latter only exposes reserved regions. Assume the
>> dma_mask is 48b and the IOMMU aperture is 42b (assuming it is possible),
>> if you only take into account the "dma_mask" limitation, the end-user
>> will interpret this as: I can use up to 48b.
>
> What end user?
Anyone who reads /sys/kernel/iommu_groups/<n>/reserved_regions. A user
application could use that info. Anyway I think I indeed wants this file
to do more than it was originally meant for, enumerating *some* of the
reserved regions. That's a bit frustrating though (same for the VFIO ioctl).
The DMA API is allocating within the address space of
> the IOMMU, so it will simply never encounter the issue. Within the
> IOMMU API we can already query the geometry of the IOMMU to know its
> width. It seems like reserved regions is trying to take on new
> responsibilities here.
>
>> Who is going to consume this information? Additionally
>>> it appears that reserved regions will report different information
>>> depending on whether a device is attached to a domain.
>> yes that's correct. Well at some point we decided to expose (some)
>> reserved regions through sysfs. Only printing a reduced set of those
>> also can be misleading, hence my attempt to be more comprehensive.
>
> Seems it further blurs the lines of what reserved regions is intended
> to report. Is it trying to replace the IOMMU API geometry interface?
>
>>>> This is a change to the ABI as this reserved region was not yet
>>>> exposed in sysfs /sys/kernel/iommu_groups/<n>/reserved_regions or
>>>> through the VFIO ioctl. At VFIO level we increment the version of
>>>> the VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE capability to advertise
>>>> that change.
>>>
>>> Is this really an ABI change? The original entry for reserved regions
>>> includes:
>>>
>>> Not necessarily all reserved regions are listed. This is typically
>>> used to output direct-mapped, MSI, non mappable regions.
>>
>> I agree. That's not really a change in the ABI but I wanted to make
>> things clear about the induced changes for the end-user. On the other
>> end there will be a change in the number of reported resv regions.
>>>
>>> I imagine the intention here was non-mappable relative to the IOMMU,
>>> but non-mappable to the device is essentially what we're including
>>> here.
>>>
>>> I'm also concerned about bumping the vfio interface version for the
>>> IOVA range. We're not changing the interface, we're modifying the
>>> result, and even then only for a fraction of users. How many users are
>>> potentially broken by that change? Are we going to bump the version
>>> for everyone any time the result changes on any platform? Thanks,
>>
>> The userspace needs to know if the GET_INFO is reliable to compute the
>> min/max IOVAs. If we do not change the version, it cannot know and must
>> assume it is not. Wasn't the version field meant for that somehow?
>
> Our IOVA RANGE is correct to the best of our knowledge, but I don't
> think we can or should rev the version every time we find something
> incorrect. For instance, we wouldn't do the same for a firmware
> induced range that was missing or incorrect. I think the version field
> is for creating a new data structure to replace or expand the existing
> one, not to simply say the returned data might be different. I
> understand the userspace dilemma, but this doesn't seem like the right
> solution. Thanks,
understood
Thanks
Eric
>
> Alex
>
[+Christoph]
On Tue, Sep 29, 2020 at 12:18:49PM -0600, Alex Williamson wrote:
> On Tue, 29 Sep 2020 09:18:22 +0200
> Auger Eric <[email protected]> wrote:
>
> > Hi all,
> >
> > [also correcting some outdated email addresses + adding Lorenzo in cc]
> >
> > On 9/29/20 12:42 AM, Alex Williamson wrote:
> > > On Mon, 28 Sep 2020 21:50:34 +0200
> > > Eric Auger <[email protected]> wrote:
> > >
> > >> VFIO currently exposes the usable IOVA regions through the
> > >> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> > >> capability. However it fails to take into account the dma_mask
> > >> of the devices within the container. The top limit currently is
> > >> defined by the iommu aperture.
> > >
> > > I think that dma_mask is traditionally a DMA API interface for a device
> > > driver to indicate to the DMA layer which mappings are accessible to the
> > > device. On the other hand, vfio makes use of the IOMMU API where the
> > > driver is in userspace. That userspace driver has full control of the
> > > IOVA range of the device, therefore dma_mask is mostly irrelevant to
> > > vfio. I think the issue you're trying to tackle is that the IORT code
> > > is making use of the dma_mask to try to describe a DMA address
> > > limitation imposed by the PCI root bus, living between the endpoint
> > > device and the IOMMU. Therefore, if the IORT code is exposing a
> > > topology or system imposed device limitation, this seems much more akin
> > > to something like an MSI reserved range, where it's not necessarily the
> > > device or the IOMMU with the limitation, but something that sits
> > > between them.
> >
> > First I think I failed to explain the context. I worked on NVMe
> > passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
> > physical device. The IOVA allocator there currently uses an IOVA range
> > within [0x10000, 1ULL << 39]. This IOVA layout rather is arbitrary if I
> > understand correctly.
>
> 39 bits is the minimum available on some VT-d systems, so it was
> probably considered a reasonable minimum address width to consider.
>
> > I noticed we rapidly get some VFIO MAP DMA
> > failures because the allocated IOVA collide with the ARM MSI reserved
> > IOVA window [0x8000000, 0x8100000]. Since 9b77e5c79840 ("vfio/type1:
> > Check reserved region conflict and update iova list"), such VFIO MAP DMA
> > attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
> > by using the VFIO_IOMMU_GET_INFO ioctl /
> > VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
> > avoid allocating within this range and others. While working on this, I
> > tried to automatically compute the min/max IOVAs and change the
> > arbitrary [0x10000, 1ULL << 39]. My SMMUv2 supports up to 48b so
> > naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
> > allocates at the bottom and at the top of the range. I noticed the use
> > case was not working as soon as the top IOVA was more than 1ULL << 42.
> > And then we noticed the dma_mask was set to 42 by using
> > cat /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
> > interpretation is the dma_mask was somehow containing the info the
> > device couldn't handle IOVAs beyond a certain limit.
>
> I see that there are both OF and ACPI hooks in pci_dma_configure() and
> both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> but I'm not convinced this even does what it's intended to do. The
> driver core calls this via the bus->dma_configure callback before
> probing a driver, but then what happens when the driver calls
> pci_set_dma_mask()? This is just a wrapper for dma_set_mask() and I
> don't see anywhere that would take into account the existing
> dev->dma_mask. It seems for example that pci_dma_configure() could
> produce a 42 bit mask as we have here, then the driver could override
> that with anything that the dma_ops.dma_supported() callback finds
> acceptable, and I don't see any instances where the current
> dev->dma_mask is considered. Am I overlooking something?
I don't think so but Christoph and Robin can provide more input on
this - it is a long story.
ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
this does not prevent a driver from overriding the dev->dma_mask but DMA
mapping code still takes into account the dev->bus_dma_limit.
This may help:
git log -p 03bfdc31176c
Thanks,
Lorenzo
On Mon, Oct 05, 2020 at 11:44:10AM +0100, Lorenzo Pieralisi wrote:
> > I see that there are both OF and ACPI hooks in pci_dma_configure() and
> > both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> > but I'm not convinced this even does what it's intended to do. The
> > driver core calls this via the bus->dma_configure callback before
> > probing a driver, but then what happens when the driver calls
> > pci_set_dma_mask()? This is just a wrapper for dma_set_mask() and I
> > don't see anywhere that would take into account the existing
> > dev->dma_mask. It seems for example that pci_dma_configure() could
> > produce a 42 bit mask as we have here, then the driver could override
> > that with anything that the dma_ops.dma_supported() callback finds
> > acceptable, and I don't see any instances where the current
> > dev->dma_mask is considered. Am I overlooking something?
>
> I don't think so but Christoph and Robin can provide more input on
> this - it is a long story.
>
> ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
> this does not prevent a driver from overriding the dev->dma_mask but DMA
> mapping code still takes into account the dev->bus_dma_limit.
>
> This may help:
>
> git log -p 03bfdc31176c
This is at best a historic artefact. Bus drivers have no business
messing with the DMA mask, dev->bus_dma_limit is the way to communicate
addressing limits on the bus (or another interconnect closer to the CPU).
Hi all,
On 10/5/20 3:08 PM, Christoph Hellwig wrote:
> On Mon, Oct 05, 2020 at 11:44:10AM +0100, Lorenzo Pieralisi wrote:
>>> I see that there are both OF and ACPI hooks in pci_dma_configure() and
>>> both modify dev->dma_mask, which is what pci-sysfs is exposing here,
>>> but I'm not convinced this even does what it's intended to do. The
>>> driver core calls this via the bus->dma_configure callback before
>>> probing a driver, but then what happens when the driver calls
>>> pci_set_dma_mask()? This is just a wrapper for dma_set_mask() and I
>>> don't see anywhere that would take into account the existing
>>> dev->dma_mask. It seems for example that pci_dma_configure() could
>>> produce a 42 bit mask as we have here, then the driver could override
>>> that with anything that the dma_ops.dma_supported() callback finds
>>> acceptable, and I don't see any instances where the current
>>> dev->dma_mask is considered. Am I overlooking something?
>>
>> I don't think so but Christoph and Robin can provide more input on
>> this - it is a long story.
>>
>> ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
>> this does not prevent a driver from overriding the dev->dma_mask but DMA
>> mapping code still takes into account the dev->bus_dma_limit.
>>
>> This may help:
>>
>> git log -p 03bfdc31176c
Thank you Lorenzo for the pointer.
>
> This is at best a historic artefact. Bus drivers have no business
> messing with the DMA mask, dev->bus_dma_limit is the way to communicate
> addressing limits on the bus (or another interconnect closer to the CPU).
>
Then could I envision to use the dev->bus_dma_limit instead of the
dev->dma_mask?
Nevertheless, I would need a way to let the userspace know that the
usable IOVA ranges reported by VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
takes into account the addressing limits of the bus.
Thanks
Eric