2021-09-19 12:22:23

by Yi Liu

[permalink] [raw]
Subject: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

After a device is bound to the iommufd, userspace can use this interface
to query the underlying iommu capability and format info for this device.
Based on this information the user then creates I/O address space in a
compatible format with the to-be-attached devices.

Device cookie which is registered at binding time is used to mark the
device which is being queried here.

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
include/uapi/linux/iommu.h | 49 ++++++++++++++++++++++++
2 files changed, 117 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e16ca21e4534..641f199f2d41 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
return 0;
}

+static struct device *
+iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
+{
+ struct iommufd_device *idev;
+ struct device *dev = NULL;
+ unsigned long index;
+
+ mutex_lock(&ictx->lock);
+ xa_for_each(&ictx->device_xa, index, idev) {
+ if (idev->dev_cookie == dev_cookie) {
+ dev = idev->dev;
+ break;
+ }
+ }
+ mutex_unlock(&ictx->lock);
+
+ return dev;
+}
+
+static void iommu_device_build_info(struct device *dev,
+ struct iommu_device_info *info)
+{
+ bool snoop;
+ u64 awidth, pgsizes;
+
+ if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
+ info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
+
+ if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
+ info->pgsize_bitmap = pgsizes;
+ info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
+ }
+
+ if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
+ info->addr_width = awidth;
+ info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
+ }
+}
+
+static int iommufd_get_device_info(struct iommufd_ctx *ictx,
+ unsigned long arg)
+{
+ struct iommu_device_info info;
+ unsigned long minsz;
+ struct device *dev;
+
+ minsz = offsetofend(struct iommu_device_info, addr_width);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ info.flags = 0;
+
+ dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
+ if (!dev)
+ return -EINVAL;
+
+ iommu_device_build_info(dev, &info);
+
+ return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+}
+
static long iommufd_fops_unl_ioctl(struct file *filep,
unsigned int cmd, unsigned long arg)
{
@@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
return ret;

switch (cmd) {
+ case IOMMU_DEVICE_GET_INFO:
+ ret = iommufd_get_device_info(ictx, arg);
+ break;
default:
pr_err_ratelimited("unsupported cmd %u\n", cmd);
break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 59178fc229ca..76b71f9d6b34 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -7,6 +7,55 @@
#define _UAPI_IOMMU_H

#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* -------- IOCTLs for IOMMU file descriptor (/dev/iommu) -------- */
+
+#define IOMMU_TYPE (';')
+#define IOMMU_BASE 100
+
+/*
+ * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
+ * struct iommu_device_info)
+ *
+ * Check IOMMU capabilities and format information on a bound device.
+ *
+ * The device is identified by device cookie (registered when binding
+ * this device).
+ *
+ * @argsz: user filled size of this data.
+ * @flags: tells userspace which capability info is available
+ * @dev_cookie: user assinged cookie.
+ * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
+ * bit in pgsize_bitmap[63:12] indicates a supported
+ * page size. Details as below table:
+ *
+ * +===============+============+
+ * | Bit[index] | Page Size |
+ * +---------------+------------+
+ * | 12 | 4 KB |
+ * +---------------+------------+
+ * | 13 | 8 KB |
+ * +---------------+------------+
+ * | 14 | 16 KB |
+ * +---------------+------------+
+ * ...
+ * @addr_width: the address width of supported I/O address spaces.
+ *
+ * Availability: after device is bound to iommufd
+ */
+struct iommu_device_info {
+ __u32 argsz;
+ __u32 flags;
+#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU enforced snoop */
+#define IOMMU_DEVICE_INFO_PGSIZES (1 << 1) /* supported page sizes */
+#define IOMMU_DEVICE_INFO_ADDR_WIDTH (1 << 2) /* addr_wdith field valid */
+ __u64 dev_cookie;
+ __u64 pgsize_bitmap;
+ __u32 addr_width;
+};
+
+#define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)

#define IOMMU_FAULT_PERM_READ (1 << 0) /* read */
#define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */
--
2.25.1


2021-09-21 17:43:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> After a device is bound to the iommufd, userspace can use this interface
> to query the underlying iommu capability and format info for this device.
> Based on this information the user then creates I/O address space in a
> compatible format with the to-be-attached devices.
>
> Device cookie which is registered at binding time is used to mark the
> device which is being queried here.
>
> Signed-off-by: Liu Yi L <[email protected]>
> drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
> include/uapi/linux/iommu.h | 49 ++++++++++++++++++++++++
> 2 files changed, 117 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e16ca21e4534..641f199f2d41 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
> return 0;
> }
>
> +static struct device *
> +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
> +{

We have an xarray ID for the device, why are we allowing userspace to
use the dev_cookie as input?

Userspace should always pass in the ID. The only place dev_cookie
should appear is if the kernel generates an event back to
userspace. Then the kernel should return both the ID and the
dev_cookie in the event to allow userspace to correlate it.

> +static void iommu_device_build_info(struct device *dev,
> + struct iommu_device_info *info)
> +{
> + bool snoop;
> + u64 awidth, pgsizes;
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
> + info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
> + info->pgsize_bitmap = pgsizes;
> + info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
> + }
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
> + info->addr_width = awidth;
> + info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
> + }

Another good option is to push the iommu_device_info uAPI struct down
through to the iommu driver to fill it in and forget about the crazy
enum.

A big part of thinking of this iommu interface is a way to bind the HW
IOMMU driver to a uAPI and allow the HW driver to expose its unique
functionalities.

> +static int iommufd_get_device_info(struct iommufd_ctx *ictx,
> + unsigned long arg)
> +{
> + struct iommu_device_info info;
> + unsigned long minsz;
> + struct device *dev;
> +
> + minsz = offsetofend(struct iommu_device_info, addr_width);
> +
> + if (copy_from_user(&info, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;

All of these patterns everywhere are wrongly coded for forward/back
compatibility.

static int iommufd_get_device_info(struct iommufd_ctx *ictx,
struct iommu_device_info __user *arg, size_t usize)
{
struct iommu_device_info info;
int ret;

if (usize < offsetofend(struct iommu_device_info, addr_flags))
return -EINVAL;

ret = copy_struct_from_user(&info, sizeof(info), arg, usize);
if (ret)
return ret;

'usize' should be in a 'common' header extracted by the main ioctl handler.

> +struct iommu_device_info {
> + __u32 argsz;
> + __u32 flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU enforced snoop */
> +#define IOMMU_DEVICE_INFO_PGSIZES (1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH (1 << 2) /* addr_wdith field valid */
> + __u64 dev_cookie;
> + __u64 pgsize_bitmap;
> + __u32 addr_width;
> +};

Be explicit with padding here too.

Jason

2021-09-22 03:34:35

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 1:41 AM
>
> On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > After a device is bound to the iommufd, userspace can use this interface
> > to query the underlying iommu capability and format info for this device.
> > Based on this information the user then creates I/O address space in a
> > compatible format with the to-be-attached devices.
> >
> > Device cookie which is registered at binding time is used to mark the
> > device which is being queried here.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > drivers/iommu/iommufd/iommufd.c | 68
> +++++++++++++++++++++++++++++++++
> > include/uapi/linux/iommu.h | 49 ++++++++++++++++++++++++
> > 2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index e16ca21e4534..641f199f2d41 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> *inode, struct file *filep)
> > return 0;
> > }
> >
> > +static struct device *
> > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> dev_cookie)
> > +{
>
> We have an xarray ID for the device, why are we allowing userspace to
> use the dev_cookie as input?
>
> Userspace should always pass in the ID. The only place dev_cookie
> should appear is if the kernel generates an event back to
> userspace. Then the kernel should return both the ID and the
> dev_cookie in the event to allow userspace to correlate it.
>

A little background.

In earlier design proposal we discussed two options. One is to return
an kernel-allocated ID (label) to userspace. The other is to have user
register a cookie and use it in iommufd uAPI. At that time the two
options were discussed exclusively and the cookie one is preferred.

Now you instead recommended a mixed option. We can follow it for
sure if nobody objects.

Thanks
Kevin

2021-09-22 12:47:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Wed, Sep 22, 2021 at 03:30:09AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 22, 2021 1:41 AM
> >
> > On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > > After a device is bound to the iommufd, userspace can use this interface
> > > to query the underlying iommu capability and format info for this device.
> > > Based on this information the user then creates I/O address space in a
> > > compatible format with the to-be-attached devices.
> > >
> > > Device cookie which is registered at binding time is used to mark the
> > > device which is being queried here.
> > >
> > > Signed-off-by: Liu Yi L <[email protected]>
> > > drivers/iommu/iommufd/iommufd.c | 68
> > +++++++++++++++++++++++++++++++++
> > > include/uapi/linux/iommu.h | 49 ++++++++++++++++++++++++
> > > 2 files changed, 117 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index e16ca21e4534..641f199f2d41 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> > *inode, struct file *filep)
> > > return 0;
> > > }
> > >
> > > +static struct device *
> > > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> > dev_cookie)
> > > +{
> >
> > We have an xarray ID for the device, why are we allowing userspace to
> > use the dev_cookie as input?
> >
> > Userspace should always pass in the ID. The only place dev_cookie
> > should appear is if the kernel generates an event back to
> > userspace. Then the kernel should return both the ID and the
> > dev_cookie in the event to allow userspace to correlate it.
> >
>
> A little background.
>
> In earlier design proposal we discussed two options. One is to return
> an kernel-allocated ID (label) to userspace. The other is to have user
> register a cookie and use it in iommufd uAPI. At that time the two
> options were discussed exclusively and the cookie one is preferred.
>
> Now you instead recommended a mixed option. We can follow it for
> sure if nobody objects.

Either or for the return is fine, I'd return both just because it is
more flexable

But the cookie should never be an input from userspace, and the kernel
should never search for it. Locating the kernel object is what the ID
and xarray is for.

Jason

2021-09-22 21:28:01

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Sun, 19 Sep 2021 14:38:38 +0800
Liu Yi L <[email protected]> wrote:

> +struct iommu_device_info {
> + __u32 argsz;
> + __u32 flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU enforced snoop */

Is this too PCI specific, or perhaps too much of the mechanism rather
than the result? ie. should we just indicate if the IOMMU guarantees
coherent DMA? Thanks,

Alex

> +#define IOMMU_DEVICE_INFO_PGSIZES (1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH (1 << 2) /* addr_wdith field valid */
> + __u64 dev_cookie;
> + __u64 pgsize_bitmap;
> + __u32 addr_width;
> +};
> +
> +#define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)
>
> #define IOMMU_FAULT_PERM_READ (1 << 0) /* read */
> #define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */

2021-09-22 23:51:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Wed, Sep 22, 2021 at 03:24:07PM -0600, Alex Williamson wrote:
> On Sun, 19 Sep 2021 14:38:38 +0800
> Liu Yi L <[email protected]> wrote:
>
> > +struct iommu_device_info {
> > + __u32 argsz;
> > + __u32 flags;
> > +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU enforced snoop */
>
> Is this too PCI specific, or perhaps too much of the mechanism rather
> than the result? ie. should we just indicate if the IOMMU guarantees
> coherent DMA? Thanks,

I think the name of "coherent DMA" for this feature inside the kernel
is very, very confusing. We already have something called coherent dma
and this usage on Intel has nothing at all to do with that.

In fact it looks like this confusing name has already caused
implementation problems as I see dma-iommu, is connecting
dev->dma_coherent to IOMMU_CACHE! eg in dma_info_to_prot(). This is
completely wrong if IOMMU_CACHE is linked to no_snoop.

And ARM seems to have fallen out of step with x86 as the ARM IOMMU
drivers are mapping IOMMU_CACHE to ARM_LPAE_PTE_MEMATTR_OIWB,
ARM_LPAE_MAIR_ATTR_IDX_CACHE

The SMMU spec for ARMv8 is pretty clear:

13.6.1.1 No_snoop

Support for No_snoop is system-dependent and, if implemented, No_snoop
transforms a final access attribute of a Normal cacheable type to
Normal-iNC-oNC-OSH downstream of (or appearing to be performed
downstream of) the SMMU. No_snoop does not transform a final access
attribute of any-Device.

Meaning setting ARM_LPAE_MAIR_ATTR_IDX_CACHE from IOMMU_CACHE does NOT
block non-snoop, in fact it *enables* it - the reverse of what Intel
is doing!

So this is all a mess.

Better to start clear and unambiguous names in the uAPI and someone
can try to clean up the kernel eventually.

The required behavior for iommufd is to have the IOMMU ignore the
no-snoop bit so that Intel HW can disable wbinvd. This bit should be
clearly documented for its exact purpose and if other arches also have
instructions that need to be disabled if snoop TLPs are allowed then
they can re-use this bit. It appears ARM does not have this issue and
does not need the bit.

What ARM is doing with IOMMU_CACHE is unclear to me, and I'm unclear
if/how iommufd should expose it as a controllable PTE flag. The ARM

Jason

2021-09-23 03:13:07

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 7:50 AM
>
> On Wed, Sep 22, 2021 at 03:24:07PM -0600, Alex Williamson wrote:
> > On Sun, 19 Sep 2021 14:38:38 +0800
> > Liu Yi L <[email protected]> wrote:
> >
> > > +struct iommu_device_info {
> > > + __u32 argsz;
> > > + __u32 flags;
> > > +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU
> enforced snoop */
> >
> > Is this too PCI specific, or perhaps too much of the mechanism rather

Isn't snoop vs. !snoop a general concept not pci specific?

> > than the result? ie. should we just indicate if the IOMMU guarantees
> > coherent DMA? Thanks,
>
> I think the name of "coherent DMA" for this feature inside the kernel
> is very, very confusing. We already have something called coherent dma
> and this usage on Intel has nothing at all to do with that.
>
> In fact it looks like this confusing name has already caused
> implementation problems as I see dma-iommu, is connecting
> dev->dma_coherent to IOMMU_CACHE! eg in dma_info_to_prot(). This is
> completely wrong if IOMMU_CACHE is linked to no_snoop.
>
> And ARM seems to have fallen out of step with x86 as the ARM IOMMU
> drivers are mapping IOMMU_CACHE to ARM_LPAE_PTE_MEMATTR_OIWB,
> ARM_LPAE_MAIR_ATTR_IDX_CACHE
>
> The SMMU spec for ARMv8 is pretty clear:
>
> 13.6.1.1 No_snoop
>
> Support for No_snoop is system-dependent and, if implemented, No_snoop
> transforms a final access attribute of a Normal cacheable type to
> Normal-iNC-oNC-OSH downstream of (or appearing to be performed
> downstream of) the SMMU. No_snoop does not transform a final access
> attribute of any-Device.
>
> Meaning setting ARM_LPAE_MAIR_ATTR_IDX_CACHE from IOMMU_CACHE
> does NOT
> block non-snoop, in fact it *enables* it - the reverse of what Intel
> is doing!

Checking the code:

if (data->iop.fmt == ARM_64_LPAE_S2 ||
data->iop.fmt == ARM_32_LPAE_S2) {
if (prot & IOMMU_MMIO)
pte |= ARM_LPAE_PTE_MEMATTR_DEV;
else if (prot & IOMMU_CACHE)
pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
else
pte |= ARM_LPAE_PTE_MEMATTR_NC;

It does set attribute to WB for IOMMU_CACHE and then NC (Non-cacheable)
for !IOMMU_CACHE. The main difference between Intel and ARM is that Intel
by default allows both snoop and non-snoop traffic with one additional bit
to enforce snoop, while ARM requires explicit SMMU configuration for snoop
and non-snoop respectively.

} else {
if (prot & IOMMU_MMIO)
pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
else if (prot & IOMMU_CACHE)
pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
}

same for this one. MAIR_ELx register is programmed to ARM_LPAE_MAIR_
ATTR_WBRWA for IDX_CACHE bit. I'm not sure why it doesn't use
IDX_NC though, when !IOMMU_CACHE.

>
> So this is all a mess.
>
> Better to start clear and unambiguous names in the uAPI and someone
> can try to clean up the kernel eventually.
>
> The required behavior for iommufd is to have the IOMMU ignore the
> no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> clearly documented for its exact purpose and if other arches also have
> instructions that need to be disabled if snoop TLPs are allowed then
> they can re-use this bit. It appears ARM does not have this issue and
> does not need the bit.

Disabling wbinvd is one purpose. imo the more important intention
is that iommu vendor uses different PTE formats between snoop and
!snoop. As long as we want allow userspace to opt in case of isoch
performance requirement (unlike current vfio which always choose
snoop format if available), such mechanism is required for all vendors.

When creating an ioas there could be three snoop modes:

1) snoop for all attached devices;
2) non-snoop for all attached devices;
3) device-selected snoop;

Intel supports 1) <enforce-snoop on> and 3) <enforce-snoop off>. snoop
and nonsnoop devices can be attached to a same ioas in 3).

ARM supports 1) <snoop format> and 2) <nonsnoop format>. snoop devices
and nonsnoop devices must be attached to different ioas's in 1) and 2)
respectively.

Then the device info should reports:

/* iommu enforced snoop */
+#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0)
/* iommu enforced nonsnoop */
+#define IOMMU_DEVICE_INFO_ENFORCE_NONSNOOP (1 << 1)
/* device selected snoop */
+#define IOMMU_DEVICE_INFO_DEVICE_SNOOP (1 << 2)

>
> What ARM is doing with IOMMU_CACHE is unclear to me, and I'm unclear
> if/how iommufd should expose it as a controllable PTE flag. The ARM
>

Based on above analysis I think the ARM usage with IOMMU_CACHE
doesn't change.

Thanks
Kevin

2021-09-23 03:40:14

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Tian, Kevin
> Sent: Thursday, September 23, 2021 11:11 AM
>
> >
> > The required behavior for iommufd is to have the IOMMU ignore the
> > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > clearly documented for its exact purpose and if other arches also have
> > instructions that need to be disabled if snoop TLPs are allowed then
> > they can re-use this bit. It appears ARM does not have this issue and
> > does not need the bit.
>
> Disabling wbinvd is one purpose. imo the more important intention
> is that iommu vendor uses different PTE formats between snoop and
> !snoop. As long as we want allow userspace to opt in case of isoch
> performance requirement (unlike current vfio which always choose
> snoop format if available), such mechanism is required for all vendors.
>

btw I'm not sure whether the wbinvd trick is Intel specific. All other
platforms (amd, arm, s390, etc.) currently always claim OMMU_CAP_
CACHE_COHERENCY (the source of IOMMU_CACHE). They didn't hit
this problem because vfio always sets IOMMU_CACHE to force every
DMA to snoop. Will they need to handle similar wbinvd-like trick (plus
necessary memory type virtualization) when non-snoop format is enabled?
Or are their architectures highly optimized to afford isoch traffic even
with snoop (then fine to not support user opt-in)?

Thanks
Kevin

2021-09-23 10:18:41

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 23, 2021 at 03:10:47AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, September 23, 2021 7:50 AM
> >
> > On Wed, Sep 22, 2021 at 03:24:07PM -0600, Alex Williamson wrote:
> > > On Sun, 19 Sep 2021 14:38:38 +0800
> > > Liu Yi L <[email protected]> wrote:
> > >
> > > > +struct iommu_device_info {
> > > > + __u32 argsz;
> > > > + __u32 flags;
> > > > +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU
> > enforced snoop */
> > >
> > > Is this too PCI specific, or perhaps too much of the mechanism rather
>
> Isn't snoop vs. !snoop a general concept not pci specific?
>
> > > than the result? ie. should we just indicate if the IOMMU guarantees
> > > coherent DMA? Thanks,
> >
> > I think the name of "coherent DMA" for this feature inside the kernel
> > is very, very confusing. We already have something called coherent dma
> > and this usage on Intel has nothing at all to do with that.
> >
> > In fact it looks like this confusing name has already caused
> > implementation problems as I see dma-iommu, is connecting
> > dev->dma_coherent to IOMMU_CACHE! eg in dma_info_to_prot(). This is
> > completely wrong if IOMMU_CACHE is linked to no_snoop.
> >
> > And ARM seems to have fallen out of step with x86 as the ARM IOMMU
> > drivers are mapping IOMMU_CACHE to ARM_LPAE_PTE_MEMATTR_OIWB,
> > ARM_LPAE_MAIR_ATTR_IDX_CACHE
> >
> > The SMMU spec for ARMv8 is pretty clear:
> >
> > 13.6.1.1 No_snoop
> >
> > Support for No_snoop is system-dependent and, if implemented, No_snoop
> > transforms a final access attribute of a Normal cacheable type to
> > Normal-iNC-oNC-OSH downstream of (or appearing to be performed
> > downstream of) the SMMU. No_snoop does not transform a final access
> > attribute of any-Device.
> >
> > Meaning setting ARM_LPAE_MAIR_ATTR_IDX_CACHE from IOMMU_CACHE
> > does NOT
> > block non-snoop, in fact it *enables* it - the reverse of what Intel
> > is doing!
>
> Checking the code:
>
> if (data->iop.fmt == ARM_64_LPAE_S2 ||
> data->iop.fmt == ARM_32_LPAE_S2) {
> if (prot & IOMMU_MMIO)
> pte |= ARM_LPAE_PTE_MEMATTR_DEV;
> else if (prot & IOMMU_CACHE)
> pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> else
> pte |= ARM_LPAE_PTE_MEMATTR_NC;
>
> It does set attribute to WB for IOMMU_CACHE and then NC (Non-cacheable)
> for !IOMMU_CACHE. The main difference between Intel and ARM is that Intel
> by default allows both snoop and non-snoop traffic with one additional bit
> to enforce snoop, while ARM requires explicit SMMU configuration for snoop
> and non-snoop respectively.
>
> } else {
> if (prot & IOMMU_MMIO)
> pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
> << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> else if (prot & IOMMU_CACHE)
> pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> }
>
> same for this one. MAIR_ELx register is programmed to ARM_LPAE_MAIR_
> ATTR_WBRWA for IDX_CACHE bit. I'm not sure why it doesn't use
> IDX_NC though, when !IOMMU_CACHE.

It is in effect since IDX_NC == 0

>
> >
> > So this is all a mess.
> >
> > Better to start clear and unambiguous names in the uAPI and someone
> > can try to clean up the kernel eventually.
> >
> > The required behavior for iommufd is to have the IOMMU ignore the
> > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > clearly documented for its exact purpose and if other arches also have
> > instructions that need to be disabled if snoop TLPs are allowed then
> > they can re-use this bit. It appears ARM does not have this issue and
> > does not need the bit.
>
> Disabling wbinvd is one purpose. imo the more important intention
> is that iommu vendor uses different PTE formats between snoop and
> !snoop. As long as we want allow userspace to opt in case of isoch
> performance requirement (unlike current vfio which always choose
> snoop format if available), such mechanism is required for all vendors.
>
> When creating an ioas there could be three snoop modes:
>
> 1) snoop for all attached devices;
> 2) non-snoop for all attached devices;
> 3) device-selected snoop;
>
> Intel supports 1) <enforce-snoop on> and 3) <enforce-snoop off>. snoop
> and nonsnoop devices can be attached to a same ioas in 3).
>
> ARM supports 1) <snoop format> and 2) <nonsnoop format>. snoop devices
> and nonsnoop devices must be attached to different ioas's in 1) and 2)
> respectively.

I think Arm mainly supports 3), ie. No_snoop PCI transactions on pages
mapped cacheable become non-cacheable memory accesses.

But the Arm Base System Architecture 1.0
(https://developer.arm.com/documentation/den0094/a) states that it's
implementation dependent whether the system supports No_snoop.

In the case where the system has a System MMU translating and
attributing the transactions from the root complex, the PCI Express
transactions must keep the memory attributes assigned by the System
MMU. If the System MMU-assigned attribute is cacheable then it is
IMPLEMENTATION DEFINED if No_snoop transactions replace the attribute
with non-cached.

So we can only tell userspace "No_snoop is not supported" (provided we
even want to allow them to enable No_snoop). Users in control of stage-1
tables can create non-cacheable mappings through MAIR attributes.

Thanks,
Jean

>
> Then the device info should reports:
>
> /* iommu enforced snoop */
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0)
> /* iommu enforced nonsnoop */
> +#define IOMMU_DEVICE_INFO_ENFORCE_NONSNOOP (1 << 1)
> /* device selected snoop */
> +#define IOMMU_DEVICE_INFO_DEVICE_SNOOP (1 << 2)
>
> >
> > What ARM is doing with IOMMU_CACHE is unclear to me, and I'm unclear
> > if/how iommufd should expose it as a controllable PTE flag. The ARM
> >
>
> Based on above analysis I think the ARM usage with IOMMU_CACHE
> doesn't change.
>
> Thanks
> Kevin

2021-09-23 11:29:10

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:

> So we can only tell userspace "No_snoop is not supported" (provided we
> even want to allow them to enable No_snoop). Users in control of stage-1
> tables can create non-cacheable mappings through MAIR attributes.

My point is that ARM is using IOMMU_CACHE to control the overall
cachability of the DMA

ie not specifying IOMMU_CACHE requires using the arch specific DMA
cache flushers.

Intel never uses arch specifc DMA cache flushers, and instead is
abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA that is always
cachable.

These are different things and need different bits. Since the ARM path
has a lot more code supporting it, I'd suggest Intel should change
their code to use IOMMU_BLOCK_NO_SNOOP and abandon IOMMU_CACHE.

Which clarifies what to do here as uAPI - these things need to have
different bits and Intel's should still have NO SNOOP in the
name. What the no-snoop bit is called on other busses can be clarified
in comments if that case ever arises.

Jason

2021-09-23 11:38:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 23, 2021 at 03:10:47AM +0000, Tian, Kevin wrote:

> Disabling wbinvd is one purpose. imo the more important intention
> is that iommu vendor uses different PTE formats between snoop and
> !snoop.

The PTE format for userspace is communicated through the format input,
not through random flags. If Intel has two different PTE formats then
userspace must negotiate which to use via the format input.

If the kernel controls the PTE then the format doesn't matter and the
kernel should configure things to match the requested behavior

> When creating an ioas there could be three snoop modes:
>
> 1) snoop for all attached devices;
> 2) non-snoop for all attached devices;
> 3) device-selected snoop;

I'd express the three cases like this:

0
ARM can avoid cache shooping, must use arch cache flush helpers
IOMMU_CACHE
Normal DMAs get cache coherence, do not need arch cache flush helpers
IOMMU_CACHE | IOMMU_BLOCK_NO_SNOOP
All DMAs get cache coherence, not supported on ARM

Jason

2021-09-23 11:43:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 23, 2021 at 03:38:10AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Thursday, September 23, 2021 11:11 AM
> >
> > >
> > > The required behavior for iommufd is to have the IOMMU ignore the
> > > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > > clearly documented for its exact purpose and if other arches also have
> > > instructions that need to be disabled if snoop TLPs are allowed then
> > > they can re-use this bit. It appears ARM does not have this issue and
> > > does not need the bit.
> >
> > Disabling wbinvd is one purpose. imo the more important intention
> > is that iommu vendor uses different PTE formats between snoop and
> > !snoop. As long as we want allow userspace to opt in case of isoch
> > performance requirement (unlike current vfio which always choose
> > snoop format if available), such mechanism is required for all vendors.
> >
>
> btw I'm not sure whether the wbinvd trick is Intel specific. All other
> platforms (amd, arm, s390, etc.) currently always claim OMMU_CAP_
> CACHE_COHERENCY (the source of IOMMU_CACHE).

This only means they don't need to use the arch cache flush
helpers. It has nothing to do with no-snoop on those platforms.

> They didn't hit this problem because vfio always sets IOMMU_CACHE to
> force every DMA to snoop. Will they need to handle similar
> wbinvd-like trick (plus necessary memory type virtualization) when
> non-snoop format is enabled? Or are their architectures highly
> optimized to afford isoch traffic even with snoop (then fine to not
> support user opt-in)?

In other arches the question is:
- Do they allow non-coherent DMA to exist in a VM?
- Can the VM issue cache maintaince ops to fix the decoherence?

The Intel functional issue is that Intel blocks the cache maintaince
ops from the VM and the VM has no way to self-discover that the cache
maintaince ops don't work.

Other arches don't seem to have this specific problem...

The other warped part of this is that Linux doesn't actually support
no-snoop DMA through the DMA API. The users in Intel GPU drivers are
all hacking it up, so it may well be that on other arches Linux never
ask devices to issue no-snoop DMA because there is no portable way
for the driver to restore coherence on a DMA by DMA basis..

Jason

2021-09-23 12:07:55

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 7:27 PM
>
> On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:
>
> > So we can only tell userspace "No_snoop is not supported" (provided we
> > even want to allow them to enable No_snoop). Users in control of stage-1
> > tables can create non-cacheable mappings through MAIR attributes.
>
> My point is that ARM is using IOMMU_CACHE to control the overall
> cachability of the DMA
>
> ie not specifying IOMMU_CACHE requires using the arch specific DMA
> cache flushers.
>
> Intel never uses arch specifc DMA cache flushers, and instead is
> abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA that
> is always
> cachable.

it uses IOMMU_CACHE to force all DMAs to snoop, including those which
has non_snoop flag and wouldn't snoop cache if iommu is disabled. Nothing
is blocked.

but why do you call it abuse? IOMMU_CACHE was first introduced for
Intel platform:

commit 9cf0669746be19a4906a6c48920060bcf54c708b
Author: Sheng Yang <[email protected]>
Date: Wed Mar 18 15:33:07 2009 +0800

intel-iommu: VT-d page table to support snooping control bit

The user can request to enable snooping control through VT-d page table.

Signed-off-by: Sheng Yang <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>

>
> These are different things and need different bits. Since the ARM path
> has a lot more code supporting it, I'd suggest Intel should change
> their code to use IOMMU_BLOCK_NO_SNOOP and abandon IOMMU_CACHE.

I didn't fully get this point. The end result is same, i.e. making the DMA
cache-coherent when IOMMU_CACHE is set. Or if you help define the
behavior of IOMMU_CACHE, what will you define now?

>
> Which clarifies what to do here as uAPI - these things need to have
> different bits and Intel's should still have NO SNOOP in the
> name. What the no-snoop bit is called on other busses can be clarified
> in comments if that case ever arises.
>
> Jason

2021-09-23 12:24:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 23, 2021 at 12:05:29PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, September 23, 2021 7:27 PM
> >
> > On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:
> >
> > > So we can only tell userspace "No_snoop is not supported" (provided we
> > > even want to allow them to enable No_snoop). Users in control of stage-1
> > > tables can create non-cacheable mappings through MAIR attributes.
> >
> > My point is that ARM is using IOMMU_CACHE to control the overall
> > cachability of the DMA
> >
> > ie not specifying IOMMU_CACHE requires using the arch specific DMA
> > cache flushers.
> >
> > Intel never uses arch specifc DMA cache flushers, and instead is
> > abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA that
> > is always
> > cachable.
>
> it uses IOMMU_CACHE to force all DMAs to snoop, including those which
> has non_snoop flag and wouldn't snoop cache if iommu is disabled. Nothing
> is blocked.

I see it differently, on Intel the only way to bypass the cache with
DMA is to specify the no-snoop bit in the TLP. The IOMMU PTE flag we
are talking about tells the IOMMU to ignore the no snoop bit.

Again, Intel arch in the kernel does not support the DMA cache flush
arch API and *DOES NOT* support incoherent DMA at all.

ARM *does* implement the DMA cache flush arch API and is using
IOMMU_CACHE to control if the caller will, or will not call the cache
flushes.

This is fundamentally different from what Intel is using it for.

> but why do you call it abuse? IOMMU_CACHE was first introduced for
> Intel platform:

IMHO ARM changed the meaning when Robin linked IOMMU_CACHE to
dma_is_coherent stuff. At that point it became linked to 'do I need to
call arch cache flushers or not'.

> > These are different things and need different bits. Since the ARM path
> > has a lot more code supporting it, I'd suggest Intel should change
> > their code to use IOMMU_BLOCK_NO_SNOOP and abandon IOMMU_CACHE.
>
> I didn't fully get this point. The end result is same, i.e. making the DMA
> cache-coherent when IOMMU_CACHE is set. Or if you help define the
> behavior of IOMMU_CACHE, what will you define now?

It is clearly specifying how the kernel API works:

!IOMMU_CACHE
must call arch cache flushers
IOMMU_CACHE -
do not call arch cache flushers
IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
dot not arch cache flushers, and ignore the no snoop bit.

On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
do that. All IOMMU formats can support IOMMU_CACHE. Only the special
no-snoop IOPTE format can support the final one, and it is only useful
for iommufd/vfio users that are interacting with VMs and wbvind.

Jason

2021-09-29 06:29:56

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> After a device is bound to the iommufd, userspace can use this interface
> to query the underlying iommu capability and format info for this device.
> Based on this information the user then creates I/O address space in a
> compatible format with the to-be-attached devices.
>
> Device cookie which is registered at binding time is used to mark the
> device which is being queried here.
>
> Signed-off-by: Liu Yi L <[email protected]>
> ---
> drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
> include/uapi/linux/iommu.h | 49 ++++++++++++++++++++++++
> 2 files changed, 117 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e16ca21e4534..641f199f2d41 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
> return 0;
> }
>
> +static struct device *
> +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
> +{
> + struct iommufd_device *idev;
> + struct device *dev = NULL;
> + unsigned long index;
> +
> + mutex_lock(&ictx->lock);
> + xa_for_each(&ictx->device_xa, index, idev) {
> + if (idev->dev_cookie == dev_cookie) {
> + dev = idev->dev;
> + break;
> + }
> + }
> + mutex_unlock(&ictx->lock);
> +
> + return dev;
> +}
> +
> +static void iommu_device_build_info(struct device *dev,
> + struct iommu_device_info *info)
> +{
> + bool snoop;
> + u64 awidth, pgsizes;
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
> + info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
> + info->pgsize_bitmap = pgsizes;
> + info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
> + }
> +
> + if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
> + info->addr_width = awidth;
> + info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
> + }
> +}
> +
> +static int iommufd_get_device_info(struct iommufd_ctx *ictx,
> + unsigned long arg)
> +{
> + struct iommu_device_info info;
> + unsigned long minsz;
> + struct device *dev;
> +
> + minsz = offsetofend(struct iommu_device_info, addr_width);
> +
> + if (copy_from_user(&info, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + info.flags = 0;
> +
> + dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
> + if (!dev)
> + return -EINVAL;
> +
> + iommu_device_build_info(dev, &info);
> +
> + return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
> +}
> +
> static long iommufd_fops_unl_ioctl(struct file *filep,
> unsigned int cmd, unsigned long arg)
> {
> @@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
> return ret;
>
> switch (cmd) {
> + case IOMMU_DEVICE_GET_INFO:
> + ret = iommufd_get_device_info(ictx, arg);
> + break;
> default:
> pr_err_ratelimited("unsupported cmd %u\n", cmd);
> break;
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 59178fc229ca..76b71f9d6b34 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -7,6 +7,55 @@
> #define _UAPI_IOMMU_H
>
> #include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/* -------- IOCTLs for IOMMU file descriptor (/dev/iommu) -------- */
> +
> +#define IOMMU_TYPE (';')
> +#define IOMMU_BASE 100
> +
> +/*
> + * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
> + * struct iommu_device_info)
> + *
> + * Check IOMMU capabilities and format information on a bound device.
> + *
> + * The device is identified by device cookie (registered when binding
> + * this device).
> + *
> + * @argsz: user filled size of this data.
> + * @flags: tells userspace which capability info is available
> + * @dev_cookie: user assinged cookie.
> + * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
> + * bit in pgsize_bitmap[63:12] indicates a supported
> + * page size. Details as below table:
> + *
> + * +===============+============+
> + * | Bit[index] | Page Size |
> + * +---------------+------------+
> + * | 12 | 4 KB |
> + * +---------------+------------+
> + * | 13 | 8 KB |
> + * +---------------+------------+
> + * | 14 | 16 KB |
> + * +---------------+------------+
> + * ...
> + * @addr_width: the address width of supported I/O address spaces.
> + *
> + * Availability: after device is bound to iommufd
> + */
> +struct iommu_device_info {
> + __u32 argsz;
> + __u32 flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU enforced snoop */
> +#define IOMMU_DEVICE_INFO_PGSIZES (1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH (1 << 2) /* addr_wdith field valid */
> + __u64 dev_cookie;
> + __u64 pgsize_bitmap;
> + __u32 addr_width;

I think this is where you should be reporting available IOVA windows,
rather than just an address width. I know that for ppc a real
situation will be to have two different windows of different sizes:
that is the effective address width depends on which IOVA window
you're mapping into.


> +};
> +
> +#define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)
>
> #define IOMMU_FAULT_PERM_READ (1 << 0) /* read */
> #define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.76 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-29 06:56:05

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Wed, Sep 22, 2021 at 09:41:50AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 03:30:09AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, September 22, 2021 1:41 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > > > After a device is bound to the iommufd, userspace can use this interface
> > > > to query the underlying iommu capability and format info for this device.
> > > > Based on this information the user then creates I/O address space in a
> > > > compatible format with the to-be-attached devices.
> > > >
> > > > Device cookie which is registered at binding time is used to mark the
> > > > device which is being queried here.
> > > >
> > > > Signed-off-by: Liu Yi L <[email protected]>
> > > > drivers/iommu/iommufd/iommufd.c | 68
> > > +++++++++++++++++++++++++++++++++
> > > > include/uapi/linux/iommu.h | 49 ++++++++++++++++++++++++
> > > > 2 files changed, 117 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index e16ca21e4534..641f199f2d41 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> > > *inode, struct file *filep)
> > > > return 0;
> > > > }
> > > >
> > > > +static struct device *
> > > > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> > > dev_cookie)
> > > > +{
> > >
> > > We have an xarray ID for the device, why are we allowing userspace to
> > > use the dev_cookie as input?
> > >
> > > Userspace should always pass in the ID. The only place dev_cookie
> > > should appear is if the kernel generates an event back to
> > > userspace. Then the kernel should return both the ID and the
> > > dev_cookie in the event to allow userspace to correlate it.
> > >
> >
> > A little background.
> >
> > In earlier design proposal we discussed two options. One is to return
> > an kernel-allocated ID (label) to userspace. The other is to have user
> > register a cookie and use it in iommufd uAPI. At that time the two
> > options were discussed exclusively and the cookie one is preferred.
> >
> > Now you instead recommended a mixed option. We can follow it for
> > sure if nobody objects.
>
> Either or for the return is fine, I'd return both just because it is
> more flexable
>
> But the cookie should never be an input from userspace, and the kernel
> should never search for it. Locating the kernel object is what the ID
> and xarray is for.

Why do we need two IDs at all? Can't we just use the cookie as the
sole ID?

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.82 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-29 08:54:41

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

+Robin.

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 8:22 PM
>
> On Thu, Sep 23, 2021 at 12:05:29PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Thursday, September 23, 2021 7:27 PM
> > >
> > > On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote:
> > >
> > > > So we can only tell userspace "No_snoop is not supported" (provided
> we
> > > > even want to allow them to enable No_snoop). Users in control of
> stage-1
> > > > tables can create non-cacheable mappings through MAIR attributes.
> > >
> > > My point is that ARM is using IOMMU_CACHE to control the overall
> > > cachability of the DMA
> > >
> > > ie not specifying IOMMU_CACHE requires using the arch specific DMA
> > > cache flushers.
> > >
> > > Intel never uses arch specifc DMA cache flushers, and instead is
> > > abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA
> that
> > > is always
> > > cachable.
> >
> > it uses IOMMU_CACHE to force all DMAs to snoop, including those which
> > has non_snoop flag and wouldn't snoop cache if iommu is disabled.
> Nothing
> > is blocked.
>
> I see it differently, on Intel the only way to bypass the cache with
> DMA is to specify the no-snoop bit in the TLP. The IOMMU PTE flag we
> are talking about tells the IOMMU to ignore the no snoop bit.
>
> Again, Intel arch in the kernel does not support the DMA cache flush
> arch API and *DOES NOT* support incoherent DMA at all.
>
> ARM *does* implement the DMA cache flush arch API and is using
> IOMMU_CACHE to control if the caller will, or will not call the cache
> flushes.

I still didn't fully understand this point after reading the code. Looking
at dma-iommu its cache flush functions are all coded with below as
the first check:

if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev))
return;

dev->dma_coherent is initialized upon firmware info, not decided by
IOMMU_CACHE.

i.e. it's not IOMMU_CACHE to decide whether cache flushes should
be called.

Probably the confusion comes from __iommu_dma_alloc_noncontiguous:

if (!(ioprot & IOMMU_CACHE)) {
struct scatterlist *sg;
int i;

for_each_sg(sgt->sgl, sg, sgt->orig_nents, i)
arch_dma_prep_coherent(sg_page(sg), sg->length);
}

Here it makes more sense to be if (!coherent) {}.

with above being corrected, I think all iommu drivers do associate
IOMMU_CACHE to the snoop aspect:

Intel:
- either force snooping by ignoring snoop bit in TLP (IOMMU_CACHE)
- or has snoop decided by TLP (!IOMMU_CACHE)

ARM:
- set to snoop format if IOMMU_CACHE
- set to nonsnoop format if !IOMMU_CACHE
(in both cases TLP snoop bit is ignored?)

Other archs
- ignore IOMMU_CACHE as cache is always snooped via their IOMMUs

>
> This is fundamentally different from what Intel is using it for.
>
> > but why do you call it abuse? IOMMU_CACHE was first introduced for
> > Intel platform:
>
> IMHO ARM changed the meaning when Robin linked IOMMU_CACHE to
> dma_is_coherent stuff. At that point it became linked to 'do I need to
> call arch cache flushers or not'.
>

I didn't identify the exact commit for above meaning change.

Robin, could you help share some thoughts here?

Thanks
Kevin

2021-09-29 14:11:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:

> ARM:
> - set to snoop format if IOMMU_CACHE
> - set to nonsnoop format if !IOMMU_CACHE
> (in both cases TLP snoop bit is ignored?)

Where do you see this? I couldn't even find this functionality in the
ARM HW manual??

What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
the cache coherence to be disabled, which is not ignoring no snoop.

> I didn't identify the exact commit for above meaning change.
>
> Robin, could you help share some thoughts here?

It is this:

static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
unsigned long attrs)
{
int prot = coherent ? IOMMU_CACHE : 0;

Which sets IOMMU_CACHE based on:

static void *iommu_dma_alloc(struct device *dev, size_t size,
dma_addr_t *handle, gfp_t gfp, unsigned long attrs)
{
bool coherent = dev_is_dma_coherent(dev);
int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs);

Driving IOMMU_CACHE from dev_is_dma_coherent() has *NOTHING* to do
with no-snoop TLPs and everything to do with the arch cache
maintenance API

Jason

2021-09-30 12:08:42

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe
> Sent: Wednesday, September 29, 2021 8:37 PM
>
> On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:
>
> > ARM:
> > - set to snoop format if IOMMU_CACHE
> > - set to nonsnoop format if !IOMMU_CACHE
> > (in both cases TLP snoop bit is ignored?)
>
> Where do you see this? I couldn't even find this functionality in the
> ARM HW manual??

Honestly speaking I'm getting confused by the complex attribute
transformation control (default, replace, combine, input, output, etc.)
in SMMU manual. Above was my impression after last check, but now
I cannot find necessary info to build the same picture (except below
code). :/

>
> What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
> the cache coherence to be disabled, which is not ignoring no snoop.

My impression was that snoop is one way of implementing cache
coherency and now since the PTE can explicitly specify cache coherency
like below:

else if (prot & IOMMU_CACHE)
pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
else
pte |= ARM_LPAE_PTE_MEMATTR_NC;

This setting in concept overrides the snoop attribute from the device thus
make it sort of ignored?

But I did see the manual says that:
--
Note: To achieve this 'pull-down' behavior, the No_snoop flag might
be carried through the SMMU and used to transform the SMMU output
downstream.
--

So again, just got confused here...

>
> > I didn't identify the exact commit for above meaning change.
> >
> > Robin, could you help share some thoughts here?
>
> It is this:
>
> static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
> unsigned long attrs)
> {
> int prot = coherent ? IOMMU_CACHE : 0;
>
> Which sets IOMMU_CACHE based on:
>
> static void *iommu_dma_alloc(struct device *dev, size_t size,
> dma_addr_t *handle, gfp_t gfp, unsigned long attrs)
> {
> bool coherent = dev_is_dma_coherent(dev);
> int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs);
>
> Driving IOMMU_CACHE from dev_is_dma_coherent() has *NOTHING* to do
> with no-snoop TLPs and everything to do with the arch cache
> maintenance API

Maybe I'll get a clearer picture on this after understanding the difference
between cache coherency and snoop on ARM. They are sort of inter-
changeable on Intel (or possibly on x86 since I just found that AMD
completely ignores IOMMU_CACHE).

Thanks
Kevin

2021-09-30 13:01:07

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 7:42 PM
>
> On Thu, Sep 23, 2021 at 03:38:10AM +0000, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Thursday, September 23, 2021 11:11 AM
> > >
> > > >
> > > > The required behavior for iommufd is to have the IOMMU ignore the
> > > > no-snoop bit so that Intel HW can disable wbinvd. This bit should be
> > > > clearly documented for its exact purpose and if other arches also have
> > > > instructions that need to be disabled if snoop TLPs are allowed then
> > > > they can re-use this bit. It appears ARM does not have this issue and
> > > > does not need the bit.
> > >
> > > Disabling wbinvd is one purpose. imo the more important intention
> > > is that iommu vendor uses different PTE formats between snoop and
> > > !snoop. As long as we want allow userspace to opt in case of isoch
> > > performance requirement (unlike current vfio which always choose
> > > snoop format if available), such mechanism is required for all vendors.
> > >
> >
> > btw I'm not sure whether the wbinvd trick is Intel specific. All other
> > platforms (amd, arm, s390, etc.) currently always claim OMMU_CAP_
> > CACHE_COHERENCY (the source of IOMMU_CACHE).
>
> This only means they don't need to use the arch cache flush
> helpers. It has nothing to do with no-snoop on those platforms.
>
> > They didn't hit this problem because vfio always sets IOMMU_CACHE to
> > force every DMA to snoop. Will they need to handle similar
> > wbinvd-like trick (plus necessary memory type virtualization) when
> > non-snoop format is enabled? Or are their architectures highly
> > optimized to afford isoch traffic even with snoop (then fine to not
> > support user opt-in)?
>
> In other arches the question is:
> - Do they allow non-coherent DMA to exist in a VM?

And is coherency a static attribute per device or could be opted
by driver on such arch? If the latter, then the same opt path from
userspace sounds also reasonable, since driver is in userspace now.

> - Can the VM issue cache maintaince ops to fix the decoherence?

As you listed the questions are all about non-coherent DMA, not
how non-coherent DMAs are implemented underlyingly. From this
angle focusing on coherent part as Alex suggested is more forward
looking than tying the uAPI to a specific coherency implementation
using snoop?

>
> The Intel functional issue is that Intel blocks the cache maintaince
> ops from the VM and the VM has no way to self-discover that the cache
> maintaince ops don't work.

the VM doesn't need to know whether the maintenance ops
actually works. It just treats the device as if those ops are always
required. The hypervisor will figure out whether those ops should
be blocked based on whether coherency is guaranteed by iommu
based on iommufd/vfio.

>
> Other arches don't seem to have this specific problem...

I think the key is whether other archs allow driver to decide DMA
coherency and indirectly the underlying I/O page table format.
If yes, then I don't see a reason why such decision should not be
given to userspace for passthrough case.

Thanks
Kevin

2021-09-30 13:47:21

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On 2021/9/30 16:49, Tian, Kevin wrote:
>> From: Jason Gunthorpe <[email protected]>
>> Sent: Thursday, September 23, 2021 8:22 PM
>>
>>>> These are different things and need different bits. Since the ARM path
>>>> has a lot more code supporting it, I'd suggest Intel should change
>>>> their code to use IOMMU_BLOCK_NO_SNOOP and abandon
>> IOMMU_CACHE.
>>>
>>> I didn't fully get this point. The end result is same, i.e. making the DMA
>>> cache-coherent when IOMMU_CACHE is set. Or if you help define the
>>> behavior of IOMMU_CACHE, what will you define now?
>>
>> It is clearly specifying how the kernel API works:
>>
>> !IOMMU_CACHE
>> must call arch cache flushers
>> IOMMU_CACHE -
>> do not call arch cache flushers
>> IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
>> dot not arch cache flushers, and ignore the no snoop bit.
>
> Who will set IOMMU_BLOCK_NO_SNOOP? I feel this is arch specific
> knowledge about how cache coherency is implemented, i.e.
> when IOMMU_CACHE is set intel-iommu driver just maps it to
> blocking no-snoop. It's not necessarily to be an attribute in
> the same level as IOMMU_CACHE?
>
>>
>> On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
>> do that.
>
> Agree. In reality I guess this is not hit because all devices are marked
> coherent on Intel platforms...
>
> Baolu, any insight here?

I am trying to follow the discussion here. Please guide me if I didn't
get the right context.

Here, we are discussing arch_sync_dma_for_cpu() and
arch_sync_dma_for_device(). The x86 arch has clflush to sync dma buffer
for device, but I can't see any instruction to sync dma buffer for cpu
if the device is not cache coherent. Is that the reason why x86 can't
have an implementation for arch_sync_dma_for_cpu(), hence all devices
are marked coherent?

> Thanks
> Kevin
>

Best regards,
baolu

2021-09-30 14:22:37

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 8:22 PM
>
> > > These are different things and need different bits. Since the ARM path
> > > has a lot more code supporting it, I'd suggest Intel should change
> > > their code to use IOMMU_BLOCK_NO_SNOOP and abandon
> IOMMU_CACHE.
> >
> > I didn't fully get this point. The end result is same, i.e. making the DMA
> > cache-coherent when IOMMU_CACHE is set. Or if you help define the
> > behavior of IOMMU_CACHE, what will you define now?
>
> It is clearly specifying how the kernel API works:
>
> !IOMMU_CACHE
> must call arch cache flushers
> IOMMU_CACHE -
> do not call arch cache flushers
> IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
> dot not arch cache flushers, and ignore the no snoop bit.

Who will set IOMMU_BLOCK_NO_SNOOP? I feel this is arch specific
knowledge about how cache coherency is implemented, i.e.
when IOMMU_CACHE is set intel-iommu driver just maps it to
blocking no-snoop. It's not necessarily to be an attribute in
the same level as IOMMU_CACHE?

>
> On Intel it should refuse to create a !IOMMU_CACHE since the HW can't
> do that.

Agree. In reality I guess this is not hit because all devices are marked
coherent on Intel platforms...

Baolu, any insight here?

Thanks
Kevin

2021-09-30 14:30:32

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 30, 2021 at 08:30:42AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Wednesday, September 29, 2021 8:37 PM
> >
> > On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:
> >
> > > ARM:
> > > - set to snoop format if IOMMU_CACHE
> > > - set to nonsnoop format if !IOMMU_CACHE
> > > (in both cases TLP snoop bit is ignored?)
> >
> > Where do you see this? I couldn't even find this functionality in the
> > ARM HW manual??
>
> Honestly speaking I'm getting confused by the complex attribute
> transformation control (default, replace, combine, input, output, etc.)
> in SMMU manual. Above was my impression after last check, but now
> I cannot find necessary info to build the same picture (except below
> code). :/
>
> >
> > What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
> > the cache coherence to be disabled, which is not ignoring no snoop.
>
> My impression was that snoop is one way of implementing cache
> coherency and now since the PTE can explicitly specify cache coherency
> like below:
>
> else if (prot & IOMMU_CACHE)
> pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> else
> pte |= ARM_LPAE_PTE_MEMATTR_NC;
>
> This setting in concept overrides the snoop attribute from the device thus
> make it sort of ignored?

To make sure we're talking about the same thing: "the snoop attribute from
the device" is the "No snoop" attribute in the PCI TLP, right?

The PTE flags define whether the memory access is cache-coherent or not.
* WB is cacheable (short for write-back cacheable. Doesn't matter here
what OI or RWA mean.)
* NC is non-cacheable.

| Normal PCI access | No_snoop PCI access
-------+-------------------+-------------------
PTE WB | Cacheable | Non-cacheable
PTE NC | Non-cacheable | Non-cacheable

Cacheable memory access participate in cache coherency. Non-cacheable
accesses go directly to memory, do not cause cache allocation.

On Arm cache coherency is configured through PTE attributes. I don't think
PCI No_snoop should be used because it's not necessarily supported
throughout the system and, as far as I understand, software can't discover
whether it is.

[...]
> Maybe I'll get a clearer picture on this after understanding the difference
> between cache coherency and snoop on ARM.

The architecture uses terms "cacheable" and "coherent". The term "snoop"
is used when referring specifically to the PCI "No snoop" attribute. It is
also used within the interconnect coherency protocols, which are invisible
to software.

Thanks,
Jean

2021-09-30 22:25:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 30, 2021 at 11:33:13AM +0100, Jean-Philippe Brucker wrote:
> On Thu, Sep 30, 2021 at 08:30:42AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Wednesday, September 29, 2021 8:37 PM
> > >
> > > On Wed, Sep 29, 2021 at 08:48:28AM +0000, Tian, Kevin wrote:
> > >
> > > > ARM:
> > > > - set to snoop format if IOMMU_CACHE
> > > > - set to nonsnoop format if !IOMMU_CACHE
> > > > (in both cases TLP snoop bit is ignored?)
> > >
> > > Where do you see this? I couldn't even find this functionality in the
> > > ARM HW manual??
> >
> > Honestly speaking I'm getting confused by the complex attribute
> > transformation control (default, replace, combine, input, output, etc.)
> > in SMMU manual. Above was my impression after last check, but now
> > I cannot find necessary info to build the same picture (except below
> > code). :/
> >
> > >
> > > What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes
> > > the cache coherence to be disabled, which is not ignoring no snoop.
> >
> > My impression was that snoop is one way of implementing cache
> > coherency and now since the PTE can explicitly specify cache coherency
> > like below:
> >
> > else if (prot & IOMMU_CACHE)
> > pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> > else
> > pte |= ARM_LPAE_PTE_MEMATTR_NC;
> >
> > This setting in concept overrides the snoop attribute from the device thus
> > make it sort of ignored?
>
> To make sure we're talking about the same thing: "the snoop attribute from
> the device" is the "No snoop" attribute in the PCI TLP, right?
>
> The PTE flags define whether the memory access is cache-coherent or not.
> * WB is cacheable (short for write-back cacheable. Doesn't matter here
> what OI or RWA mean.)
> * NC is non-cacheable.
>
> | Normal PCI access | No_snoop PCI access
> PTE WB | Cacheable | Non-cacheable
> PTE NC | Non-cacheable | Non-cacheable
>
> Cacheable memory access participate in cache coherency. Non-cacheable
> accesses go directly to memory, do not cause cache allocation.

This table is what I was thinking after reading through the ARM docs.

> On Arm cache coherency is configured through PTE attributes. I don't think
> PCI No_snoop should be used because it's not necessarily supported
> throughout the system and, as far as I understand, software can't discover
> whether it is.

The usage of no-snoop is a behavior of a device. A generic PCI driver
should be able to program the device to generate no-snoop TLPs and
ideally rely on an arch specific API in the OS to trigger the required
cache maintenance.

It doesn't make much sense for a portable driver to rely on a
non-portable IO PTE flag to control coherency, since that is not a
standards based approach.

That said, Linux doesn't have a generic DMA API to support
no-snoop. The few GPUs drivers that use this stuff just hardwired
wbsync on Intel..

What I don't really understand is why ARM, with an IOMMU that supports
PTE WB, has devices where dev_is_dma_coherent() == false ?

Is it the case that DMA from those devices ignores the IO PTE's
cachable mode?

Jason

2021-09-30 22:41:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 30, 2021 at 09:35:45AM +0000, Tian, Kevin wrote:

> > The Intel functional issue is that Intel blocks the cache maintaince
> > ops from the VM and the VM has no way to self-discover that the cache
> > maintaince ops don't work.
>
> the VM doesn't need to know whether the maintenance ops
> actually works.

Which is the whole problem.

Intel has a design where the device driver tells the device to issue
non-cachable TLPs.

The driver is supposed to know if it can issue the cache maintaince
instructions - if it can then it should ask the device to issue
no-snoop TLPs.

For instance the same PCI driver on non-x86 should never ask the
device to issue no-snoop TLPs because it has no idea how to restore
cache coherence on eg ARM.

Do you see the issue? This configuration where the hypervisor silently
make wbsync a NOP breaks the x86 architecture because the guest has no
idea it can no longer use no-snoop features.

Using the IOMMU to forcibly prevent the device from issuing no-snoop
makes this whole issue of the broken wbsync moot.

It is important to be really clear on what this is about - this is not
some idealized nice iommu feature - it is working around alot of
backwards compatability baggage that is probably completely unique to
x86.

> > Other arches don't seem to have this specific problem...
>
> I think the key is whether other archs allow driver to decide DMA
> coherency and indirectly the underlying I/O page table format.
> If yes, then I don't see a reason why such decision should not be
> given to userspace for passthrough case.

The choice all comes down to if the other arches have cache
maintenance instructions in the VM that *don't work*

Jason

2021-09-30 22:58:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 30, 2021 at 08:49:03AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, September 23, 2021 8:22 PM
> >
> > > > These are different things and need different bits. Since the ARM path
> > > > has a lot more code supporting it, I'd suggest Intel should change
> > > > their code to use IOMMU_BLOCK_NO_SNOOP and abandon
> > IOMMU_CACHE.
> > >
> > > I didn't fully get this point. The end result is same, i.e. making the DMA
> > > cache-coherent when IOMMU_CACHE is set. Or if you help define the
> > > behavior of IOMMU_CACHE, what will you define now?
> >
> > It is clearly specifying how the kernel API works:
> >
> > !IOMMU_CACHE
> > must call arch cache flushers
> > IOMMU_CACHE -
> > do not call arch cache flushers
> > IOMMU_CACHE|IOMMU_BLOCK_NO_SNOOP -
> > dot not arch cache flushers, and ignore the no snoop bit.
>
> Who will set IOMMU_BLOCK_NO_SNOOP?

Basically only qemu due to specialized x86 hypervisor knowledge.

The only purpose of this attribute is to support a specific
virtualization use case where a whole bunch of stuff is broken
together:
- the cache maintenance instructions are not available to a guest
- the guest isn't aware that the instructions don't work and tells
the device to issue no-snoop TLPs
- The device ignores the 'disable no-snoop' flag in the PCIe config
space

Thus things become broken.

Jason

2021-10-01 03:37:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 30, 2021 at 07:23:55PM -0300, Jason Gunthorpe wrote:
> > > The Intel functional issue is that Intel blocks the cache maintaince
> > > ops from the VM and the VM has no way to self-discover that the cache
> > > maintaince ops don't work.
> >
> > the VM doesn't need to know whether the maintenance ops
> > actually works.
>
> Which is the whole problem.
>
> Intel has a design where the device driver tells the device to issue
> non-cachable TLPs.
>
> The driver is supposed to know if it can issue the cache maintaince
> instructions - if it can then it should ask the device to issue
> no-snoop TLPs.

The driver should never issue them. This whole idea that a driver
can just magically poke the cache directly is just one of these horrible
short cuts that seems to happen in GPU land all the time but nowhere
else.

> > coherency and indirectly the underlying I/O page table format.
> > If yes, then I don't see a reason why such decision should not be
> > given to userspace for passthrough case.
>
> The choice all comes down to if the other arches have cache
> maintenance instructions in the VM that *don't work*

Or have them at all.

2021-10-01 04:03:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 30, 2021 at 07:04:46PM -0300, Jason Gunthorpe wrote:
> > On Arm cache coherency is configured through PTE attributes. I don't think
> > PCI No_snoop should be used because it's not necessarily supported
> > throughout the system and, as far as I understand, software can't discover
> > whether it is.
>
> The usage of no-snoop is a behavior of a device. A generic PCI driver
> should be able to program the device to generate no-snoop TLPs and
> ideally rely on an arch specific API in the OS to trigger the required
> cache maintenance.

Well, it is a combination of the device, the root port and the driver
which all need to be in line to use this.

> It doesn't make much sense for a portable driver to rely on a
> non-portable IO PTE flag to control coherency, since that is not a
> standards based approach.
>
> That said, Linux doesn't have a generic DMA API to support
> no-snoop. The few GPUs drivers that use this stuff just hardwired
> wbsync on Intel..

Yes, as usual the GPU folks come up with nasty hacks instead of
providing generic helper. Basically all we'd need to support it
in a generic way is:

- a DMA_ATTR_NO_SNOOP (or DMA_ATTR_FORCE_NONCOHERENT to fit the Linux
terminology) which treats the current dma_map/unmap/sync calls as
if dev_is_dma_coherent was false
- a way for the driver to discover that a given architecture / running
system actually supports this

> What I don't really understand is why ARM, with an IOMMU that supports
> PTE WB, has devices where dev_is_dma_coherent() == false ?

Because no IOMMU in the world can help that fact that a periphal on the
SOC is not part of the cache coherency protocol.

2021-10-01 04:03:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Sep 30, 2021 at 09:43:58PM +0800, Lu Baolu wrote:
> Here, we are discussing arch_sync_dma_for_cpu() and
> arch_sync_dma_for_device(). The x86 arch has clflush to sync dma buffer
> for device, but I can't see any instruction to sync dma buffer for cpu
> if the device is not cache coherent. Is that the reason why x86 can't
> have an implementation for arch_sync_dma_for_cpu(), hence all devices
> are marked coherent?

arch_sync_dma_for_cpu and arch_sync_dma_for_device are only used if
the device is marked non-coherent, that is if Linux knows the device
can't be part of the cache coherency protocol. There are no known
x86 systems with entirely not cache coherent devices so these helpers
won't be useful as-is.

2021-10-14 08:03:23

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Thursday, September 30, 2021 6:33 PM
>
> The PTE flags define whether the memory access is cache-coherent or not.
> * WB is cacheable (short for write-back cacheable. Doesn't matter here
> what OI or RWA mean.)
> * NC is non-cacheable.
>
> | Normal PCI access | No_snoop PCI access
> -------+-------------------+-------------------
> PTE WB | Cacheable | Non-cacheable
> PTE NC | Non-cacheable | Non-cacheable

This implies that PCI no-snoop supersedes PTE flags when it's supported
by the system?

2021-10-14 08:15:31

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: [email protected] <[email protected]>
> Sent: Friday, October 1, 2021 11:28 AM
>
> On Thu, Sep 30, 2021 at 07:04:46PM -0300, Jason Gunthorpe wrote:
> > > On Arm cache coherency is configured through PTE attributes. I don't
> think
> > > PCI No_snoop should be used because it's not necessarily supported
> > > throughout the system and, as far as I understand, software can't
> discover
> > > whether it is.
> >
> > The usage of no-snoop is a behavior of a device. A generic PCI driver
> > should be able to program the device to generate no-snoop TLPs and
> > ideally rely on an arch specific API in the OS to trigger the required
> > cache maintenance.
>
> Well, it is a combination of the device, the root port and the driver
> which all need to be in line to use this.
>
> > It doesn't make much sense for a portable driver to rely on a
> > non-portable IO PTE flag to control coherency, since that is not a
> > standards based approach.
> >
> > That said, Linux doesn't have a generic DMA API to support
> > no-snoop. The few GPUs drivers that use this stuff just hardwired
> > wbsync on Intel..
>
> Yes, as usual the GPU folks come up with nasty hacks instead of
> providing generic helper. Basically all we'd need to support it
> in a generic way is:
>
> - a DMA_ATTR_NO_SNOOP (or DMA_ATTR_FORCE_NONCOHERENT to fit the
> Linux
> terminology) which treats the current dma_map/unmap/sync calls as
> if dev_is_dma_coherent was false
> - a way for the driver to discover that a given architecture / running
> system actually supports this

Based on above information my interpretation is that existing
DMA API manages coherency per device and It's not designed for
devices which are coherent in nature but also set PCI no-snoop
for selective traffic. Then the new DMA_ATTR_NO_SNOOP, once
set in dma_map, allows the driver to follow non-coherent
semantics even when the device itself is considered coherent.

Does it capture the whole story correct?

>
> > What I don't really understand is why ARM, with an IOMMU that supports
> > PTE WB, has devices where dev_is_dma_coherent() == false ?
>
> Because no IOMMU in the world can help that fact that a periphal on the
> SOC is not part of the cache coherency protocol.

but since DMA goes through IOMMU then isn't IOMMU the one who
should decide the final cache coherency? What would be the case
if the IOMMU sets WB while the peripheral doesn't want it?

Thanks
Kevin

2021-10-14 08:24:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 14, 2021 at 08:13:03AM +0000, Tian, Kevin wrote:
> Based on above information my interpretation is that existing
> DMA API manages coherency per device and It's not designed for
> devices which are coherent in nature but also set PCI no-snoop
> for selective traffic. Then the new DMA_ATTR_NO_SNOOP, once
> set in dma_map, allows the driver to follow non-coherent
> semantics even when the device itself is considered coherent.
>
> Does it capture the whole story correct?

Yes.

> > > What I don't really understand is why ARM, with an IOMMU that supports
> > > PTE WB, has devices where dev_is_dma_coherent() == false ?
> >
> > Because no IOMMU in the world can help that fact that a periphal on the
> > SOC is not part of the cache coherency protocol.
>
> but since DMA goes through IOMMU then isn't IOMMU the one who
> should decide the final cache coherency? What would be the case
> if the IOMMU sets WB while the peripheral doesn't want it?

No. And IOMMU deal with address translation, it can't paper over
a fact that there is no coherency possible.

2021-10-14 08:31:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: [email protected] <[email protected]>
> Sent: Thursday, October 14, 2021 4:22 PM
>
> > > > What I don't really understand is why ARM, with an IOMMU that
> supports
> > > > PTE WB, has devices where dev_is_dma_coherent() == false ?
> > >
> > > Because no IOMMU in the world can help that fact that a periphal on the
> > > SOC is not part of the cache coherency protocol.
> >
> > but since DMA goes through IOMMU then isn't IOMMU the one who
> > should decide the final cache coherency? What would be the case
> > if the IOMMU sets WB while the peripheral doesn't want it?
>
> No. And IOMMU deal with address translation, it can't paper over
> a fact that there is no coherency possible.

Does it relate to the ATS story where the device gets translated address
from IOMMU and then directly sends request to memory controller?
In this way if the device is not in cache coherency domain then nothing
can change it.

Then if ATS is disabled, suppose the untranslated request from the
device is translated and forwarded by IOMMU to the memory controller.
In this case IOMMU should be able to join the coherency protocol
even when the originating device itself cannot.

Is above understanding correct?

Thanks
Kevin

2021-10-14 09:14:23

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, October 1, 2021 6:24 AM
>
> On Thu, Sep 30, 2021 at 09:35:45AM +0000, Tian, Kevin wrote:
>
> > > The Intel functional issue is that Intel blocks the cache maintaince
> > > ops from the VM and the VM has no way to self-discover that the cache
> > > maintaince ops don't work.
> >
> > the VM doesn't need to know whether the maintenance ops
> > actually works.
>
> Which is the whole problem.
>
> Intel has a design where the device driver tells the device to issue
> non-cachable TLPs.
>
> The driver is supposed to know if it can issue the cache maintaince
> instructions - if it can then it should ask the device to issue
> no-snoop TLPs.
>
> For instance the same PCI driver on non-x86 should never ask the
> device to issue no-snoop TLPs because it has no idea how to restore
> cache coherence on eg ARM.
>
> Do you see the issue? This configuration where the hypervisor silently
> make wbsync a NOP breaks the x86 architecture because the guest has no
> idea it can no longer use no-snoop features.

Thanks for explanation. But I still have one puzzle about the 'break'
part.

If hypervisor makes wbinvd a NOP then it will also set enforce_snoop
bit in PTE to convert non-snoop packet to snoop. No function in the guest
is broken, just the performance may lag.

If performance matters then hypervisor configures IOMMU to allow
non-snoop packet and then emulate wbinvd properly.

The contract between vfio and kvm is to convey above requirement
on how wbinvd is handled.

But in both cases cache maintenance instructions are available from
guest p.o.v and no coherency semantics would be violated.

>
> Using the IOMMU to forcibly prevent the device from issuing no-snoop
> makes this whole issue of the broken wbsync moot.

it's not prevent issuing. Instead, IOMMU converts non-snoop request
to snoop.

>
> It is important to be really clear on what this is about - this is not
> some idealized nice iommu feature - it is working around alot of
> backwards compatability baggage that is probably completely unique to
> x86.
>
> > > Other arches don't seem to have this specific problem...
> >
> > I think the key is whether other archs allow driver to decide DMA
> > coherency and indirectly the underlying I/O page table format.
> > If yes, then I don't see a reason why such decision should not be
> > given to userspace for passthrough case.
>
> The choice all comes down to if the other arches have cache
> maintenance instructions in the VM that *don't work*
>

Looks vfio always sets IOMMU_CACHE on all platforms as long as
iommu supports it (true on all platforms except intel iommu which
is dedicated for GPU):

vfio_iommu_type1_attach_group()
{
...
if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
domain->prot |= IOMMU_CACHE;
...
}

Should above be set according to whether a device is coherent?

Thanks
Kevin

2021-10-14 09:18:52

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 14, 2021 at 08:01:49AM +0000, Tian, Kevin wrote:
> > From: Jean-Philippe Brucker <[email protected]>
> > Sent: Thursday, September 30, 2021 6:33 PM
> >
> > The PTE flags define whether the memory access is cache-coherent or not.
> > * WB is cacheable (short for write-back cacheable. Doesn't matter here
> > what OI or RWA mean.)
> > * NC is non-cacheable.
> >
> > | Normal PCI access | No_snoop PCI access
> > -------+-------------------+-------------------
> > PTE WB | Cacheable | Non-cacheable
> > PTE NC | Non-cacheable | Non-cacheable
>
> This implies that PCI no-snoop supersedes PTE flags when it's supported
> by the system?
>

Yes, no way for the SMMU to ignore no-snoop, as far as I can see

Thanks,
Jean

2021-10-14 18:31:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 14, 2021 at 09:11:58AM +0000, Tian, Kevin wrote:

> But in both cases cache maintenance instructions are available from
> guest p.o.v and no coherency semantics would be violated.

You've described how Intel's solution papers over the problem.

In part wbinvd is defined to restore CPU cache coherence after a
no-snoop DMA. Having wbinvd NOP breaks this contract.

To counter-act the broken wbinvd the IOMMU completely prevents the use
of no-snoop DMA. It converts them to snoop instead.

The driver thinks it has no-snoop. The platform appears to support
no-snoop. The driver issues wbinvd - but all of it does nothing.

Don't think any of this is even remotely related to what ARM is doing
here. ARM has neither the broken VM cache ops, nor the IOMMU ability
to suppress no-snoop.

> > > I think the key is whether other archs allow driver to decide DMA
> > > coherency and indirectly the underlying I/O page table format.
> > > If yes, then I don't see a reason why such decision should not be
> > > given to userspace for passthrough case.
> >
> > The choice all comes down to if the other arches have cache
> > maintenance instructions in the VM that *don't work*
>
> Looks vfio always sets IOMMU_CACHE on all platforms as long as
> iommu supports it (true on all platforms except intel iommu which
> is dedicated for GPU):
>
> vfio_iommu_type1_attach_group()
> {
> ...
> if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> domain->prot |= IOMMU_CACHE;
> ...
> }
>
> Should above be set according to whether a device is coherent?

For IOMMU_CACHE there are two questions related to the overloaded
meaning:

- Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning)
This depends on how the VFIO user expects to operate the DMA.
If the VFIO user can issue cache maintenance ops then IOMMU_CACHE
should be controlled by the user. I have no idea what platforms
support user space cache maintenance ops.

- Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning)
This depends if the VFIO user has access to wbinvd or not.

wbinvd is a privileged instruction so normally userspace will not
be able to access it.

Per Paolo recommendation there should be a uAPI someplace that
allows userspace to issue wbinvd - basically the suppress no-snoop
is also user controllable.

The two things are very similar and ultimately are a choice userspace
should be making.

From something like a qemu perspective things are more murkey - eg on
ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE
mode VFIO is using must match the device coherent flag in the Linux
guest. I'm guessing all Linux guest VMs only use coherent DMA for all
devices today. I don't know if the cache maintaince ops are even
permitted in an ARM VM.

Jason

2021-10-15 10:07:32

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, October 14, 2021 11:43 PM
>
> > > > I think the key is whether other archs allow driver to decide DMA
> > > > coherency and indirectly the underlying I/O page table format.
> > > > If yes, then I don't see a reason why such decision should not be
> > > > given to userspace for passthrough case.
> > >
> > > The choice all comes down to if the other arches have cache
> > > maintenance instructions in the VM that *don't work*
> >
> > Looks vfio always sets IOMMU_CACHE on all platforms as long as
> > iommu supports it (true on all platforms except intel iommu which
> > is dedicated for GPU):
> >
> > vfio_iommu_type1_attach_group()
> > {
> > ...
> > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> > domain->prot |= IOMMU_CACHE;
> > ...
> > }
> >
> > Should above be set according to whether a device is coherent?
>
> For IOMMU_CACHE there are two questions related to the overloaded
> meaning:
>
> - Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning)
> This depends on how the VFIO user expects to operate the DMA.
> If the VFIO user can issue cache maintenance ops then IOMMU_CACHE
> should be controlled by the user. I have no idea what platforms
> support user space cache maintenance ops.

But just like you said for intel meaning below, even if those ops are
privileged a uAPI can be provided to support such usage if necessary.

>
> - Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning)
> This depends if the VFIO user has access to wbinvd or not.
>
> wbinvd is a privileged instruction so normally userspace will not
> be able to access it.
>
> Per Paolo recommendation there should be a uAPI someplace that
> allows userspace to issue wbinvd - basically the suppress no-snoop
> is also user controllable.
>
> The two things are very similar and ultimately are a choice userspace
> should be making.

yes

>
> From something like a qemu perspective things are more murkey - eg on
> ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE
> mode VFIO is using must match the device coherent flag in the Linux
> guest. I'm guessing all Linux guest VMs only use coherent DMA for all
> devices today. I don't know if the cache maintaince ops are even
> permitted in an ARM VM.
>

I'll leave it to Jean to confirm. If only coherent DMA can be used in
the guest on other platforms, suppose VFIO should not blindly set
IOMMU_CACHE and in concept it should deny assigning a non-coherent
device since no co-ordination with guest exists today.

So the bottomline is that we'll keep this no-snoop thing Intel-specific.
For the basic skeleton we'll not support no-snoop thus the user
needs to set enforce-snoop flag when creating an IOAS like this RFC v1
does. Also need to introduce a new flag instead of abusing
IOMMU_CACHE in the kernel. For other platforms it may need a fix
to deny non-coherent device (based on above open) for now.

Thanks
Kevin

2021-10-21 02:29:19

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Tian, Kevin
> Sent: Friday, October 15, 2021 9:02 AM
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, October 14, 2021 11:43 PM
> >
> > > > > I think the key is whether other archs allow driver to decide DMA
> > > > > coherency and indirectly the underlying I/O page table format.
> > > > > If yes, then I don't see a reason why such decision should not be
> > > > > given to userspace for passthrough case.
> > > >
> > > > The choice all comes down to if the other arches have cache
> > > > maintenance instructions in the VM that *don't work*
> > >
> > > Looks vfio always sets IOMMU_CACHE on all platforms as long as
> > > iommu supports it (true on all platforms except intel iommu which
> > > is dedicated for GPU):
> > >
> > > vfio_iommu_type1_attach_group()
> > > {
> > > ...
> > > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> > > domain->prot |= IOMMU_CACHE;
> > > ...
> > > }
> > >
> > > Should above be set according to whether a device is coherent?
> >
> > For IOMMU_CACHE there are two questions related to the overloaded
> > meaning:
> >
> > - Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning)
> > This depends on how the VFIO user expects to operate the DMA.
> > If the VFIO user can issue cache maintenance ops then IOMMU_CACHE
> > should be controlled by the user. I have no idea what platforms
> > support user space cache maintenance ops.
>
> But just like you said for intel meaning below, even if those ops are
> privileged a uAPI can be provided to support such usage if necessary.
>
> >
> > - Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning)
> > This depends if the VFIO user has access to wbinvd or not.
> >
> > wbinvd is a privileged instruction so normally userspace will not
> > be able to access it.
> >
> > Per Paolo recommendation there should be a uAPI someplace that
> > allows userspace to issue wbinvd - basically the suppress no-snoop
> > is also user controllable.
> >
> > The two things are very similar and ultimately are a choice userspace
> > should be making.
>
> yes
>
> >
> > From something like a qemu perspective things are more murkey - eg on
> > ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE
> > mode VFIO is using must match the device coherent flag in the Linux
> > guest. I'm guessing all Linux guest VMs only use coherent DMA for all
> > devices today. I don't know if the cache maintaince ops are even
> > permitted in an ARM VM.
> >
>
> I'll leave it to Jean to confirm. If only coherent DMA can be used in
> the guest on other platforms, suppose VFIO should not blindly set
> IOMMU_CACHE and in concept it should deny assigning a non-coherent
> device since no co-ordination with guest exists today.

Jean, what's your opinion?

>
> So the bottomline is that we'll keep this no-snoop thing Intel-specific.
> For the basic skeleton we'll not support no-snoop thus the user
> needs to set enforce-snoop flag when creating an IOAS like this RFC v1
> does. Also need to introduce a new flag instead of abusing
> IOMMU_CACHE in the kernel. For other platforms it may need a fix
> to deny non-coherent device (based on above open) for now.
>

Jason, want to check whether another option works here.

The current proposal lets the user to choose whether the I/O page
table should be put in an enforced-snoop format, with the assumption
that the user may have better knowledge than the kernel to know the
no-snoop requirement. This leads to the current design which exposes
whether an IOMMU behind a device supports enforce-snoop via
IOMMU_DEVICE_GET_INFO to the user and then have the user to
set/clear the enforce-snoop flag in IOMMU_IOASID_ALLOC.

This makes sense if there are no-snoop devices behind an IOMMU
supporting enforce-snoop.

But in reality only Intel integrated GPUs have this special no-snoop
trick (fixed knowledge), with a dedicated IOMMU which doesn't
support enforce-snoop format at all. In this case there is no choice
that the user can further make.

Also per Christoph's comment no-snoop is not an encouraged
usage overall.

Given that I wonder whether the current vfio model better suites
for this corner case, i.e. just let the kernel to handle instead of
exposing it in uAPI. The simple policy (as vfio does) is to automatically
set enforce-snoop when the target IOMMU supports it, otherwise
enable vfio/kvm contract to handle no-snoop requirement.

I don't see any interest in implementing an Intel GPU driver fully
in userspace. If just talking about possibility, a separate uAPI can
be still introduced to allow the userspace to issue wbinvd as Paolo
suggested.

One side-effect of doing so is that then we may have to support
multiple domains per IOAS when Intel GPU and other devices are
attached to the same IOAS. But this doesn't have to be implemented
in the basic skeleton now. Can be extended later when we start
working on Intel GPU support. And overall it also improves
performance otherwise the user has to create two duplicated IOAS's
(one for GPU, one for other devices) if assuming one domain per
IOAS then every map request must be done twice in both IOAS's.

Does this option make sense?

btw fixing the abuse of IOMMU_CACHE is orthogonal to this uAPI
open anyway.

Thanks
Kevin

2021-10-21 15:02:03

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 21, 2021 at 02:26:00AM +0000, Tian, Kevin wrote:
> > I'll leave it to Jean to confirm. If only coherent DMA can be used in
> > the guest on other platforms, suppose VFIO should not blindly set
> > IOMMU_CACHE and in concept it should deny assigning a non-coherent
> > device since no co-ordination with guest exists today.
>
> Jean, what's your opinion?

Yes a sanity check to prevent assigning non-coherent devices would be
good, though I'm not particularly worried about non-coherent devices. PCIe
on Arm should be coherent (according to the Base System Architecture). So
vfio-pci devices should be coherent, but vfio-platform and mdev are
case-by-case (hopefully all coherent since it concerns newer platforms).

More worrying, I thought we disabled No-Snoop for VFIO but I was wrong,
it's left enabled. On Arm I don't think userspace can perform the right
cache maintenance operations to maintain coherency with a device that
issues No-Snoop writes. Userspace can issue clean+invalidate but not
invalidate alone, so there is no equivalent to arch_sync_dma_for_cpu().
I think the worse that can happen is the device owner shooting itself in
the foot by using No-Snoop, but would it hurt to disable it?

Thanks,
Jean

2021-10-21 23:23:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 21, 2021 at 03:58:02PM +0100, Jean-Philippe Brucker wrote:
> On Thu, Oct 21, 2021 at 02:26:00AM +0000, Tian, Kevin wrote:
> > > I'll leave it to Jean to confirm. If only coherent DMA can be used in
> > > the guest on other platforms, suppose VFIO should not blindly set
> > > IOMMU_CACHE and in concept it should deny assigning a non-coherent
> > > device since no co-ordination with guest exists today.
> >
> > Jean, what's your opinion?
>
> Yes a sanity check to prevent assigning non-coherent devices would be
> good, though I'm not particularly worried about non-coherent devices. PCIe
> on Arm should be coherent (according to the Base System Architecture). So
> vfio-pci devices should be coherent, but vfio-platform and mdev are
> case-by-case (hopefully all coherent since it concerns newer platforms).
>
> More worrying, I thought we disabled No-Snoop for VFIO but I was wrong,
> it's left enabled. On Arm I don't think userspace can perform the right
> cache maintenance operations to maintain coherency with a device that
> issues No-Snoop writes. Userspace can issue clean+invalidate but not
> invalidate alone, so there is no equivalent to
> arch_sync_dma_for_cpu().

So what happens in a VM? Does a VM know that arch_sync_dma_for_cpu()
is not available?

And how does this work with the nested IOMMU translation? I thought I
read in the SMMU spec that the io page table entries could control
cachability including in nesting cases?

> I think the worse that can happen is the device owner shooting itself in
> the foot by using No-Snoop, but would it hurt to disable it?

No, the worst is the same as Intel - a driver running in the guest VM
assumes it can use arch_sync_dma_for_cpu() and acts accordingly,
resulting in a broken VM.

Jason

2021-10-21 23:32:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 21, 2021 at 02:26:00AM +0000, Tian, Kevin wrote:

> But in reality only Intel integrated GPUs have this special no-snoop
> trick (fixed knowledge), with a dedicated IOMMU which doesn't
> support enforce-snoop format at all. In this case there is no choice
> that the user can further make.

huh? That is not true at all. no-snoop is a PCIe spec behavior, any
device can trigger it

What is true today is that only Intel GPU drivers are crazy enough to
use it on Linux without platform support.

> Also per Christoph's comment no-snoop is not an encouraged
> usage overall.

I wouldn't say that, I think Christoph said using it without API
support through the DMA layer is very wrong.

DMA layer support could be added if there was interest, all the pieces
are there to do it.

> Given that I wonder whether the current vfio model better suites for
> this corner case, i.e. just let the kernel to handle instead of
> exposing it in uAPI. The simple policy (as vfio does) is to
> automatically set enforce-snoop when the target IOMMU supports it,
> otherwise enable vfio/kvm contract to handle no-snoop requirement.

IMHO you need to model it as the KVM people said - if KVM can execute
a real wbinvd in a VM then an ioctl shoudl be available to normal
userspace to run the same instruction.

So, figure out some rules to add a wbinvd ioctl to iommufd that makes
some kind of sense and logically kvm is just triggering that ioctl,
including whatever security model protects it.

I have no idea what security model makes sense for wbinvd, that is the
major question you have to answer.

And obviously none of this should be hidden behind a private API to
KVM.

> I don't see any interest in implementing an Intel GPU driver fully
> in userspace. If just talking about possibility, a separate uAPI can
> be still introduced to allow the userspace to issue wbinvd as Paolo
> suggested.
>
> One side-effect of doing so is that then we may have to support
> multiple domains per IOAS when Intel GPU and other devices are
> attached to the same IOAS.

I think we already said the IOAS should represent a single IO page
table layout?

So if there is a new for incompatible layouts then the IOAS should be
duplicated.

Otherwise, I also think the iommu core code should eventually learn to
share the io page table across HW instances. Eg ARM has a similar
efficiency issue if there are multiple SMMU HW blocks.

Jason

2021-10-22 03:13:15

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, October 22, 2021 7:31 AM
>
> On Thu, Oct 21, 2021 at 02:26:00AM +0000, Tian, Kevin wrote:
>
> > But in reality only Intel integrated GPUs have this special no-snoop
> > trick (fixed knowledge), with a dedicated IOMMU which doesn't
> > support enforce-snoop format at all. In this case there is no choice
> > that the user can further make.
>
> huh? That is not true at all. no-snoop is a PCIe spec behavior, any
> device can trigger it

yes, I should say Intel GPU 'drivers'.

>
> What is true today is that only Intel GPU drivers are crazy enough to
> use it on Linux without platform support.
>
> > Also per Christoph's comment no-snoop is not an encouraged
> > usage overall.
>
> I wouldn't say that, I think Christoph said using it without API
> support through the DMA layer is very wrong.

ok, sounds like I drew out a wrong impression from previous discussion.

>
> DMA layer support could be added if there was interest, all the pieces
> are there to do it.
>
> > Given that I wonder whether the current vfio model better suites for
> > this corner case, i.e. just let the kernel to handle instead of
> > exposing it in uAPI. The simple policy (as vfio does) is to
> > automatically set enforce-snoop when the target IOMMU supports it,
> > otherwise enable vfio/kvm contract to handle no-snoop requirement.
>
> IMHO you need to model it as the KVM people said - if KVM can execute
> a real wbinvd in a VM then an ioctl shoudl be available to normal
> userspace to run the same instruction.
>
> So, figure out some rules to add a wbinvd ioctl to iommufd that makes
> some kind of sense and logically kvm is just triggering that ioctl,
> including whatever security model protects it.

wbinvd instruction is x86 specific. Here we'd want a generic cache
invalidation ioctl and then need some form of arch callbacks though x86
is the only concerned platform for now.

>
> I have no idea what security model makes sense for wbinvd, that is the
> major question you have to answer.

wbinvd flushes the entire cache in local cpu. It's more a performance
isolation problem but nothing can prevent it once the user is allowed
to call this ioctl. This is the main reason why wbinvd is a privileged
instruction and is emulated by kvm as a nop unless an assigned device
has no-snoop requirement. alternatively the user may call clflush
which is unprivileged and can invalidate a specific cache line, though
not efficient for flushing a big buffer.

One tricky thing is that the process might be scheduled to different
cpus between writing buffers and calling wbinvd ioctl. Since wbvind
only has local behavior, it requires the ioctl to call wbinvd on all
cpus that this process has previously been scheduled on.

kvm maintains a dirty cpu mask in its preempt notifier (see
kvm_sched_in/out).

Is there any concern if iommufd also follows the same mechanism?
Currently looks preempt notifier is only used by kvm. Not sure whether
there is strong criteria around using it. and this local behavior may
not apply to all platforms (then better hidden behind arch callback?)

>
> And obviously none of this should be hidden behind a private API to
> KVM.
>
> > I don't see any interest in implementing an Intel GPU driver fully
> > in userspace. If just talking about possibility, a separate uAPI can
> > be still introduced to allow the userspace to issue wbinvd as Paolo
> > suggested.
> >
> > One side-effect of doing so is that then we may have to support
> > multiple domains per IOAS when Intel GPU and other devices are
> > attached to the same IOAS.
>
> I think we already said the IOAS should represent a single IO page
> table layout?

yes. I was just talking about tradeoff possibility if the aforementioned
option is feasible. Now based on above discussion then we will resume
back to this one-ioas-one-layout model.

>
> So if there is a new for incompatible layouts then the IOAS should be
> duplicated.
>
> Otherwise, I also think the iommu core code should eventually learn to
> share the io page table across HW instances. Eg ARM has a similar
> efficiency issue if there are multiple SMMU HW blocks.
>

or we may introduce an alias ioas concept that any change on one
ioas is automatically replayed on the alias ioas if two ioas's are created
just due to incompatible layout.

Thanks
Kevin

2021-10-22 07:54:27

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 21, 2021 at 08:22:23PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 21, 2021 at 03:58:02PM +0100, Jean-Philippe Brucker wrote:
> > On Thu, Oct 21, 2021 at 02:26:00AM +0000, Tian, Kevin wrote:
> > > > I'll leave it to Jean to confirm. If only coherent DMA can be used in
> > > > the guest on other platforms, suppose VFIO should not blindly set
> > > > IOMMU_CACHE and in concept it should deny assigning a non-coherent
> > > > device since no co-ordination with guest exists today.
> > >
> > > Jean, what's your opinion?
> >
> > Yes a sanity check to prevent assigning non-coherent devices would be
> > good, though I'm not particularly worried about non-coherent devices. PCIe
> > on Arm should be coherent (according to the Base System Architecture). So
> > vfio-pci devices should be coherent, but vfio-platform and mdev are
> > case-by-case (hopefully all coherent since it concerns newer platforms).
> >
> > More worrying, I thought we disabled No-Snoop for VFIO but I was wrong,
> > it's left enabled. On Arm I don't think userspace can perform the right
> > cache maintenance operations to maintain coherency with a device that
> > issues No-Snoop writes. Userspace can issue clean+invalidate but not
> > invalidate alone, so there is no equivalent to
> > arch_sync_dma_for_cpu().
>
> So what happens in a VM? Does a VM know that arch_sync_dma_for_cpu()
> is not available?

This would only affect userspace drivers, it's only host or guest
userspace that cannot issue the maintenance operations. The VM can do
arch_sync_dma_for_cpu()

Thanks,
Jean

>
> And how does this work with the nested IOMMU translation? I thought I
> read in the SMMU spec that the io page table entries could control
> cachability including in nesting cases?
>
> > I think the worse that can happen is the device owner shooting itself in
> > the foot by using No-Snoop, but would it hurt to disable it?
>
> No, the worst is the same as Intel - a driver running in the guest VM
> assumes it can use arch_sync_dma_for_cpu() and acts accordingly,
> resulting in a broken VM.
>
> Jason

2021-10-25 16:52:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Fri, Oct 22, 2021 at 08:49:03AM +0100, Jean-Philippe Brucker wrote:
> On Thu, Oct 21, 2021 at 08:22:23PM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 21, 2021 at 03:58:02PM +0100, Jean-Philippe Brucker wrote:
> > > On Thu, Oct 21, 2021 at 02:26:00AM +0000, Tian, Kevin wrote:
> > > > > I'll leave it to Jean to confirm. If only coherent DMA can be used in
> > > > > the guest on other platforms, suppose VFIO should not blindly set
> > > > > IOMMU_CACHE and in concept it should deny assigning a non-coherent
> > > > > device since no co-ordination with guest exists today.
> > > >
> > > > Jean, what's your opinion?
> > >
> > > Yes a sanity check to prevent assigning non-coherent devices would be
> > > good, though I'm not particularly worried about non-coherent devices. PCIe
> > > on Arm should be coherent (according to the Base System Architecture). So
> > > vfio-pci devices should be coherent, but vfio-platform and mdev are
> > > case-by-case (hopefully all coherent since it concerns newer platforms).
> > >
> > > More worrying, I thought we disabled No-Snoop for VFIO but I was wrong,
> > > it's left enabled. On Arm I don't think userspace can perform the right
> > > cache maintenance operations to maintain coherency with a device that
> > > issues No-Snoop writes. Userspace can issue clean+invalidate but not
> > > invalidate alone, so there is no equivalent to
> > > arch_sync_dma_for_cpu().
> >
> > So what happens in a VM? Does a VM know that arch_sync_dma_for_cpu()
> > is not available?
>
> This would only affect userspace drivers, it's only host or guest
> userspace that cannot issue the maintenance operations. The VM can do
> arch_sync_dma_for_cpu()

This seems out of sync with what the KVM people asked for - any
operation a VM can do should be doable inside a normal process as
well?

Jason

2021-10-26 02:59:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Fri, Oct 22, 2021 at 03:08:06AM +0000, Tian, Kevin wrote:

> > I have no idea what security model makes sense for wbinvd, that is the
> > major question you have to answer.
>
> wbinvd flushes the entire cache in local cpu. It's more a performance
> isolation problem but nothing can prevent it once the user is allowed
> to call this ioctl. This is the main reason why wbinvd is a privileged
> instruction and is emulated by kvm as a nop unless an assigned device
> has no-snoop requirement. alternatively the user may call clflush
> which is unprivileged and can invalidate a specific cache line, though
> not efficient for flushing a big buffer.
>
> One tricky thing is that the process might be scheduled to different
> cpus between writing buffers and calling wbinvd ioctl. Since wbvind
> only has local behavior, it requires the ioctl to call wbinvd on all
> cpus that this process has previously been scheduled on.

That is such a hassle, you may want to re-open this with the kvm
people as it seems ARM also has different behavior between VM and
process here.

The ideal is already not being met, so maybe we can keep special
casing cache ops?

> Is there any concern if iommufd also follows the same mechanism?
> Currently looks preempt notifier is only used by kvm. Not sure whether
> there is strong criteria around using it. and this local behavior may
> not apply to all platforms (then better hidden behind arch callback?)

I don't have any desire to see a performance cost to implement an
ioctl that nothing will ever call just to satisify a idealized target
from the kvm folks..

Jason

2021-10-27 15:27:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, October 26, 2021 7:35 AM
>
> On Fri, Oct 22, 2021 at 03:08:06AM +0000, Tian, Kevin wrote:
>
> > > I have no idea what security model makes sense for wbinvd, that is the
> > > major question you have to answer.
> >
> > wbinvd flushes the entire cache in local cpu. It's more a performance
> > isolation problem but nothing can prevent it once the user is allowed
> > to call this ioctl. This is the main reason why wbinvd is a privileged
> > instruction and is emulated by kvm as a nop unless an assigned device
> > has no-snoop requirement. alternatively the user may call clflush
> > which is unprivileged and can invalidate a specific cache line, though
> > not efficient for flushing a big buffer.
> >
> > One tricky thing is that the process might be scheduled to different
> > cpus between writing buffers and calling wbinvd ioctl. Since wbvind
> > only has local behavior, it requires the ioctl to call wbinvd on all
> > cpus that this process has previously been scheduled on.
>
> That is such a hassle, you may want to re-open this with the kvm
> people as it seems ARM also has different behavior between VM and
> process here.
>
> The ideal is already not being met, so maybe we can keep special
> casing cache ops?
>

will do.

2021-10-28 02:09:25

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, October 26, 2021 7:35 AM
>
> On Fri, Oct 22, 2021 at 03:08:06AM +0000, Tian, Kevin wrote:
>
> > > I have no idea what security model makes sense for wbinvd, that is the
> > > major question you have to answer.
> >
> > wbinvd flushes the entire cache in local cpu. It's more a performance
> > isolation problem but nothing can prevent it once the user is allowed
> > to call this ioctl. This is the main reason why wbinvd is a privileged
> > instruction and is emulated by kvm as a nop unless an assigned device
> > has no-snoop requirement. alternatively the user may call clflush
> > which is unprivileged and can invalidate a specific cache line, though
> > not efficient for flushing a big buffer.
> >
> > One tricky thing is that the process might be scheduled to different
> > cpus between writing buffers and calling wbinvd ioctl. Since wbvind
> > only has local behavior, it requires the ioctl to call wbinvd on all
> > cpus that this process has previously been scheduled on.
>
> That is such a hassle, you may want to re-open this with the kvm
> people as it seems ARM also has different behavior between VM and
> process here.
>
> The ideal is already not being met, so maybe we can keep special
> casing cache ops?
>

Now Paolo confirmed wbinvd ioctl is just a thought experiment.

Then Jason, want to have a clarification on 'keep special casing' here.

Did you mean adopting the vfio model which neither allows the user
to decide no-snoop format nor provides a wbinvd ioctl for the user
to manage buffers used for no-snoop traffic, or still wanting the user
to decide no-snoop format but not implementing a wbinvd ioctl?

The latter option sounds a bit incomplete from uAPI p.o.v. but it
allows us to stay with one-format-one-ioas policy. And anyway the
userspace can still call clflush to do cacheline-based invalidation,
if necessary.

The former option would force us to support multi-formats-one-ioas.

either case it's iommufd which decides and tells kvm whether wbinvd
is allowed for a process.

Thanks
Kevin

2021-10-29 13:58:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

On Thu, Oct 28, 2021 at 02:07:46AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, October 26, 2021 7:35 AM
> >
> > On Fri, Oct 22, 2021 at 03:08:06AM +0000, Tian, Kevin wrote:
> >
> > > > I have no idea what security model makes sense for wbinvd, that is the
> > > > major question you have to answer.
> > >
> > > wbinvd flushes the entire cache in local cpu. It's more a performance
> > > isolation problem but nothing can prevent it once the user is allowed
> > > to call this ioctl. This is the main reason why wbinvd is a privileged
> > > instruction and is emulated by kvm as a nop unless an assigned device
> > > has no-snoop requirement. alternatively the user may call clflush
> > > which is unprivileged and can invalidate a specific cache line, though
> > > not efficient for flushing a big buffer.
> > >
> > > One tricky thing is that the process might be scheduled to different
> > > cpus between writing buffers and calling wbinvd ioctl. Since wbvind
> > > only has local behavior, it requires the ioctl to call wbinvd on all
> > > cpus that this process has previously been scheduled on.
> >
> > That is such a hassle, you may want to re-open this with the kvm
> > people as it seems ARM also has different behavior between VM and
> > process here.
> >
> > The ideal is already not being met, so maybe we can keep special
> > casing cache ops?
> >
>
> Now Paolo confirmed wbinvd ioctl is just a thought experiment.
>
> Then Jason, want to have a clarification on 'keep special casing' here.
>
> Did you mean adopting the vfio model which neither allows the user
> to decide no-snoop format nor provides a wbinvd ioctl for the user
> to manage buffers used for no-snoop traffic, or still wanting the user
> to decide no-snoop format but not implementing a wbinvd ioctl?

IMHO if the wbinvd is just a thought experiment then userspace should
directly control the wbinvd enable and present the iommufd as 'proof'
to enable it.

The thought experiment tells us that iommufd should have a wbinvd
ioctl, even if we don't implement it today. So access to iommufd is
also access to wbinvd.

iommufd should control/report via intel sepecific areas if the IOS is
no-snoop blocking or not so userspace can decide what it wants to do.

Jason