2021-09-19 12:17:52

by Yi Liu

[permalink] [raw]
Subject: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

This patch adds IOASID allocation/free interface per iommufd. When
allocating an IOASID, userspace is expected to specify the type and
format information for the target I/O page table.

This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
implying a kernel-managed I/O page table with vfio type1v2 mapping
semantics. For this type the user should specify the addr_width of
the I/O address space and whether the I/O page table is created in
an iommu enfore_snoop format. enforce_snoop must be true at this point,
as the false setting requires additional contract with KVM on handling
WBINVD emulation, which can be added later.

Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
for what formats can be specified when allocating an IOASID.

Open:
- Devices on PPC platform currently use a different iommu driver in vfio.
Per previous discussion they can also use vfio type1v2 as long as there
is a way to claim a specific iova range from a system-wide address space.
This requirement doesn't sound PPC specific, as addr_width for pci devices
can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
adopted this design yet. We hope to have formal alignment in v1 discussion
and then decide how to incorporate it in v2.

- Currently ioasid term has already been used in the kernel (drivers/iommu/
ioasid.c) to represent the hardware I/O address space ID in the wire. It
covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
ID). We need find a way to resolve the naming conflict between the hardware
ID and software handle. One option is to rename the existing ioasid to be
pasid or ssid, given their full names still sound generic. Appreciate more
thoughts on this open!

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
include/linux/iommufd.h | 3 +
include/uapi/linux/iommu.h | 54 ++++++++++++++
3 files changed, 177 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 641f199f2d41..4839f128b24a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -24,6 +24,7 @@
struct iommufd_ctx {
refcount_t refs;
struct mutex lock;
+ struct xarray ioasid_xa; /* xarray of ioasids */
struct xarray device_xa; /* xarray of bound devices */
};

@@ -42,6 +43,16 @@ struct iommufd_device {
u64 dev_cookie;
};

+/* Represent an I/O address space */
+struct iommufd_ioas {
+ int ioasid;
+ u32 type;
+ u32 addr_width;
+ bool enforce_snoop;
+ struct iommufd_ctx *ictx;
+ refcount_t refs;
+};
+
static int iommufd_fops_open(struct inode *inode, struct file *filep)
{
struct iommufd_ctx *ictx;
@@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)

refcount_set(&ictx->refs, 1);
mutex_init(&ictx->lock);
+ xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC);
xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
filep->private_data = ictx;

@@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
if (!refcount_dec_and_test(&ictx->refs))
return;

+ WARN_ON(!xa_empty(&ictx->ioasid_xa));
WARN_ON(!xa_empty(&ictx->device_xa));
kfree(ictx);
}

+/* Caller should hold ictx->lock */
+static void ioas_put_locked(struct iommufd_ioas *ioas)
+{
+ struct iommufd_ctx *ictx = ioas->ictx;
+ int ioasid = ioas->ioasid;
+
+ if (!refcount_dec_and_test(&ioas->refs))
+ return;
+
+ xa_erase(&ictx->ioasid_xa, ioasid);
+ iommufd_ctx_put(ictx);
+ kfree(ioas);
+}
+
+static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
+{
+ struct iommu_ioasid_alloc req;
+ struct iommufd_ioas *ioas;
+ unsigned long minsz;
+ int ioasid, ret;
+
+ minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
+
+ if (copy_from_user(&req, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (req.argsz < minsz || !req.addr_width ||
+ req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
+ req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
+ return -EINVAL;
+
+ ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
+ if (!ioas)
+ return -ENOMEM;
+
+ mutex_lock(&ictx->lock);
+ ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas,
+ XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
+ GFP_KERNEL);
+ mutex_unlock(&ictx->lock);
+ if (ret) {
+ pr_err_ratelimited("Failed to alloc ioasid\n");
+ kfree(ioas);
+ return ret;
+ }
+
+ ioas->ioasid = ioasid;
+
+ /* only supports kernel managed I/O page table so far */
+ ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
+
+ ioas->addr_width = req.addr_width;
+
+ /* only supports enforce snoop today */
+ ioas->enforce_snoop = true;
+
+ iommufd_ctx_get(ictx);
+ ioas->ictx = ictx;
+
+ refcount_set(&ioas->refs, 1);
+
+ return ioasid;
+}
+
+static int iommufd_ioasid_free(struct iommufd_ctx *ictx, unsigned long arg)
+{
+ struct iommufd_ioas *ioas = NULL;
+ int ioasid, ret;
+
+ if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
+ return -EFAULT;
+
+ if (ioasid < 0)
+ return -EINVAL;
+
+ mutex_lock(&ictx->lock);
+ ioas = xa_load(&ictx->ioasid_xa, ioasid);
+ if (IS_ERR(ioas)) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ /* Disallow free if refcount is not 1 */
+ if (refcount_read(&ioas->refs) > 1) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ ioas_put_locked(ioas);
+out_unlock:
+ mutex_unlock(&ictx->lock);
+ return ret;
+};
+
static int iommufd_fops_release(struct inode *inode, struct file *filep)
{
struct iommufd_ctx *ictx = filep->private_data;
+ struct iommufd_ioas *ioas;
+ unsigned long index;

filep->private_data = NULL;

+ mutex_lock(&ictx->lock);
+ xa_for_each(&ictx->ioasid_xa, index, ioas)
+ ioas_put_locked(ioas);
+ mutex_unlock(&ictx->lock);
+
iommufd_ctx_put(ictx);

return 0;
@@ -195,6 +309,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
case IOMMU_DEVICE_GET_INFO:
ret = iommufd_get_device_info(ictx, arg);
break;
+ case IOMMU_IOASID_ALLOC:
+ ret = iommufd_ioasid_alloc(ictx, arg);
+ break;
+ case IOMMU_IOASID_FREE:
+ ret = iommufd_ioasid_free(ictx, arg);
+ break;
default:
pr_err_ratelimited("unsupported cmd %u\n", cmd);
break;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1603a13937e9..1dd6515e7816 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -14,6 +14,9 @@
#include <linux/err.h>
#include <linux/device.h>

+#define IOMMUFD_IOASID_MAX ((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_IOASID_MIN 0
+
#define IOMMUFD_DEVID_MAX ((unsigned int)(0x7FFFFFFF))
#define IOMMUFD_DEVID_MIN 0

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 76b71f9d6b34..5cbd300eb0ee 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -57,6 +57,60 @@ struct iommu_device_info {

#define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)

+/*
+ * IOMMU_IOASID_ALLOC - _IOWR(IOMMU_TYPE, IOMMU_BASE + 2,
+ * struct iommu_ioasid_alloc)
+ *
+ * Allocate an IOASID.
+ *
+ * IOASID is the FD-local software handle representing an I/O address
+ * space. Each IOASID is associated with a single I/O page table. User
+ * must call this ioctl to get an IOASID for every I/O address space
+ * that is intended to be tracked by the kernel.
+ *
+ * User needs to specify the attributes of the IOASID and associated
+ * I/O page table format information according to one or multiple devices
+ * which will be attached to this IOASID right after. The I/O page table
+ * is activated in the IOMMU when it's attached by a device. Incompatible
+ * format between device and IOASID will lead to attaching failure in
+ * device side.
+ *
+ * Currently only one flag (IOMMU_IOASID_ENFORCE_SNOOP) is supported and
+ * must be always set.
+ *
+ * Only one I/O page table type (kernel-managed) is supported, with vfio
+ * type1v2 mapping semantics.
+ *
+ * User should call IOMMU_CHECK_EXTENSION for future extensions.
+ *
+ * @argsz: user filled size of this data.
+ * @flags: additional information for IOASID allocation.
+ * @type: I/O address space page table type.
+ * @addr_width: address width of the I/O address space.
+ *
+ * Return: allocated ioasid on success, -errno on failure.
+ */
+struct iommu_ioasid_alloc {
+ __u32 argsz;
+ __u32 flags;
+#define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0)
+ __u32 type;
+#define IOMMU_IOASID_TYPE_KERNEL_TYPE1V2 1
+ __u32 addr_width;
+};
+
+#define IOMMU_IOASID_ALLOC _IO(IOMMU_TYPE, IOMMU_BASE + 2)
+
+/**
+ * IOMMU_IOASID_FREE - _IOWR(IOMMU_TYPE, IOMMU_BASE + 3, int)
+ *
+ * Free an IOASID.
+ *
+ * returns: 0 on success, -errno on failure
+ */
+
+#define IOMMU_IOASID_FREE _IO(IOMMU_TYPE, IOMMU_BASE + 3)
+
#define IOMMU_FAULT_PERM_READ (1 << 0) /* read */
#define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */
#define IOMMU_FAULT_PERM_EXEC (1 << 2) /* exec */
--
2.25.1


2021-09-21 17:46:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
>
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
>
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
>
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
> Per previous discussion they can also use vfio type1v2 as long as there
> is a way to claim a specific iova range from a system-wide address space.
> This requirement doesn't sound PPC specific, as addr_width for pci devices
> can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> adopted this design yet. We hope to have formal alignment in v1 discussion
> and then decide how to incorporate it in v2.

I think the request was to include a start/end IO address hint when
creating the ios. When the kernel creates it then it can return the
actual geometry including any holes via a query.

> - Currently ioasid term has already been used in the kernel (drivers/iommu/
> ioasid.c) to represent the hardware I/O address space ID in the wire. It
> covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
> ID). We need find a way to resolve the naming conflict between the hardware
> ID and software handle. One option is to rename the existing ioasid to be
> pasid or ssid, given their full names still sound generic. Appreciate more
> thoughts on this open!

ioas works well here I think. Use ioas_id to refer to the xarray
index.

> Signed-off-by: Liu Yi L <[email protected]>
> drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
> include/linux/iommufd.h | 3 +
> include/uapi/linux/iommu.h | 54 ++++++++++++++
> 3 files changed, 177 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 641f199f2d41..4839f128b24a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -24,6 +24,7 @@
> struct iommufd_ctx {
> refcount_t refs;
> struct mutex lock;
> + struct xarray ioasid_xa; /* xarray of ioasids */
> struct xarray device_xa; /* xarray of bound devices */
> };
>
> @@ -42,6 +43,16 @@ struct iommufd_device {
> u64 dev_cookie;
> };
>
> +/* Represent an I/O address space */
> +struct iommufd_ioas {
> + int ioasid;

xarray id's should consistently be u32s everywhere.

Many of the same prior comments repeated here

Jason

2021-09-22 03:41:34

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 1:45 AM
>
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> > Per previous discussion they can also use vfio type1v2 as long as there
> > is a way to claim a specific iova range from a system-wide address space.
> > This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > adopted this design yet. We hope to have formal alignment in v1
> discussion
> > and then decide how to incorporate it in v2.
>
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the

is the hint single-range or could be multiple-ranges?

> actual geometry including any holes via a query.

I'd like to see a detail flow from David on how the uAPI works today with
existing spapr driver and what exact changes he'd like to make on this
proposed interface. Above info is still insufficient for us to think about the
right solution.

>
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> > ioasid.c) to represent the hardware I/O address space ID in the wire. It
> > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> > ID). We need find a way to resolve the naming conflict between the
> hardware
> > ID and software handle. One option is to rename the existing ioasid to be
> > pasid or ssid, given their full names still sound generic. Appreciate more
> > thoughts on this open!
>
> ioas works well here I think. Use ioas_id to refer to the xarray
> index.

What about when introducing pasid to this uAPI? Then use ioas_id
for the xarray index and ioasid to represent pasid/ssid? At this point
the software handle and hardware id are mixed together thus need
a clear terminology to differentiate them.


Thanks
Kevin

2021-09-22 12:55:36

by Yi Liu

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 1:45 AM
>
[...]
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 641f199f2d41..4839f128b24a 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -24,6 +24,7 @@
> > struct iommufd_ctx {
> > refcount_t refs;
> > struct mutex lock;
> > + struct xarray ioasid_xa; /* xarray of ioasids */
> > struct xarray device_xa; /* xarray of bound devices */
> > };
> >
> > @@ -42,6 +43,16 @@ struct iommufd_device {
> > u64 dev_cookie;
> > };
> >
> > +/* Represent an I/O address space */
> > +struct iommufd_ioas {
> > + int ioasid;
>
> xarray id's should consistently be u32s everywhere.

sure. just one more check, this id is supposed to be returned to
userspace as the return value of ioctl(IOASID_ALLOC). That's why
I chose to use "int" as its prototype to make it aligned with the
return type of ioctl(). Based on this, do you think it's still better
to use "u32" here?

Regards,
Yi Liu

> Many of the same prior comments repeated here
>
> Jason

2021-09-22 13:36:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Wed, Sep 22, 2021 at 12:51:38PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> >
> [...]
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 641f199f2d41..4839f128b24a 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -24,6 +24,7 @@
> > > struct iommufd_ctx {
> > > refcount_t refs;
> > > struct mutex lock;
> > > + struct xarray ioasid_xa; /* xarray of ioasids */
> > > struct xarray device_xa; /* xarray of bound devices */
> > > };
> > >
> > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > > u64 dev_cookie;
> > > };
> > >
> > > +/* Represent an I/O address space */
> > > +struct iommufd_ioas {
> > > + int ioasid;
> >
> > xarray id's should consistently be u32s everywhere.
>
> sure. just one more check, this id is supposed to be returned to
> userspace as the return value of ioctl(IOASID_ALLOC). That's why
> I chose to use "int" as its prototype to make it aligned with the
> return type of ioctl(). Based on this, do you think it's still better
> to use "u32" here?

I suggest not using the return code from ioctl to exchange data.. The
rest of the uAPI uses an in/out struct, everything should do
that consistently.

Jason

2021-09-22 13:47:36

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
>
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
>
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
>
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
> Per previous discussion they can also use vfio type1v2 as long as there
> is a way to claim a specific iova range from a system-wide address space.

Is this the reason for passing addr_width to IOASID_ALLOC? I didn't get
what it's used for or why it's mandatory. But for PPC it sounds like it
should be an address range instead of an upper limit?

Thanks,
Jean

> This requirement doesn't sound PPC specific, as addr_width for pci devices
> can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> adopted this design yet. We hope to have formal alignment in v1 discussion
> and then decide how to incorporate it in v2.
>
> - Currently ioasid term has already been used in the kernel (drivers/iommu/
> ioasid.c) to represent the hardware I/O address space ID in the wire. It
> covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
> ID). We need find a way to resolve the naming conflict between the hardware
> ID and software handle. One option is to rename the existing ioasid to be
> pasid or ssid, given their full names still sound generic. Appreciate more
> thoughts on this open!

2021-09-22 14:14:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> >
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > Per previous discussion they can also use vfio type1v2 as long as there
> > > is a way to claim a specific iova range from a system-wide address space.
> > > This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > > and then decide how to incorporate it in v2.
> >
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
>
> is the hint single-range or could be multiple-ranges?

David explained it here:

https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

qeumu needs to be able to chooose if it gets the 32 bit range or 64
bit range.

So a 'range hint' will do the job

David also suggested this:

https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/

So I like this better:

struct iommu_ioasid_alloc {
__u32 argsz;

__u32 flags;
#define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0)
#define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1)

__aligned_u64 max_iova_hint;
__aligned_u64 base_iova_hint; // Used only if IOMMU_IOASID_HINT_BASE_IOVA

// For creating nested page tables
__u32 parent_ios_id;
__u32 format;
#define IOMMU_FORMAT_KERNEL 0
#define IOMMU_FORMAT_PPC_XXX 2
#define IOMMU_FORMAT_[..]
u32 format_flags; // Layout depends on format above

__aligned_u64 user_page_directory; // Used if parent_ios_id != 0
};

Again 'type' as an overall API indicator should not exist, feature
flags need to have clear narrow meanings.

This does both of David's suggestions at once. If quemu wants the 1G
limited region it could specify max_iova_hint = 1G, if it wants the
extend 64bit region with the hole it can give either the high base or
a large max_iova_hint. format/format_flags allows a further
device-specific escape if more specific customization is needed and is
needed to specify user space page tables anyhow.

> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
>
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index

Yes, ioas_id should always be the xarray index.

PASID needs to be called out as PASID or as a generic "hw description"
blob.

kvm's API to program the vPASID translation table should probably take
in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
information using an in-kernel API. Userspace shouldn't have to
shuttle it around.

I'm starting to feel like the struct approach for describing this uAPI
might not scale well, but lets see..

Jason

2021-09-23 06:30:24

by Yi Liu

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 9:32 PM
>
> On Wed, Sep 22, 2021 at 12:51:38PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > [...]
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index 641f199f2d41..4839f128b24a 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -24,6 +24,7 @@
> > > > struct iommufd_ctx {
> > > > refcount_t refs;
> > > > struct mutex lock;
> > > > + struct xarray ioasid_xa; /* xarray of ioasids */
> > > > struct xarray device_xa; /* xarray of bound devices */
> > > > };
> > > >
> > > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > > > u64 dev_cookie;
> > > > };
> > > >
> > > > +/* Represent an I/O address space */
> > > > +struct iommufd_ioas {
> > > > + int ioasid;
> > >
> > > xarray id's should consistently be u32s everywhere.
> >
> > sure. just one more check, this id is supposed to be returned to
> > userspace as the return value of ioctl(IOASID_ALLOC). That's why
> > I chose to use "int" as its prototype to make it aligned with the
> > return type of ioctl(). Based on this, do you think it's still better
> > to use "u32" here?
>
> I suggest not using the return code from ioctl to exchange data.. The
> rest of the uAPI uses an in/out struct, everything should do
> that consistently.

got it.

Thanks,
Yi Liu

2021-09-23 09:16:33

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 10:09 PM
>
> On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > is a way to claim a specific iova range from a system-wide address
> space.
> > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > > can be also represented by a range [0, 2^addr_width-1]. This RFC
> hasn't
> > > > adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > > and then decide how to incorporate it in v2.
> > >
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> >
> > is the hint single-range or could be multiple-ranges?
>
> David explained it here:
>
> https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
>
> qeumu needs to be able to chooose if it gets the 32 bit range or 64
> bit range.
>
> So a 'range hint' will do the job
>
> David also suggested this:
>
> https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/
>
> So I like this better:
>
> struct iommu_ioasid_alloc {
> __u32 argsz;
>
> __u32 flags;
> #define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0)
> #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1)
>
> __aligned_u64 max_iova_hint;
> __aligned_u64 base_iova_hint; // Used only if
> IOMMU_IOASID_HINT_BASE_IOVA
>
> // For creating nested page tables
> __u32 parent_ios_id;
> __u32 format;
> #define IOMMU_FORMAT_KERNEL 0
> #define IOMMU_FORMAT_PPC_XXX 2
> #define IOMMU_FORMAT_[..]
> u32 format_flags; // Layout depends on format above
>
> __aligned_u64 user_page_directory; // Used if parent_ios_id != 0
> };
>
> Again 'type' as an overall API indicator should not exist, feature
> flags need to have clear narrow meanings.

currently the type is aimed to differentiate three usages:

- kernel-managed I/O page table
- user-managed I/O page table
- shared I/O page table (e.g. with mm, or ept)

we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
indicator? their difference is not about format.

>
> This does both of David's suggestions at once. If quemu wants the 1G
> limited region it could specify max_iova_hint = 1G, if it wants the
> extend 64bit region with the hole it can give either the high base or
> a large max_iova_hint. format/format_flags allows a further

Dave's links didn't answer one puzzle from me. Does PPC needs accurate
range information or be ok with a large range including holes (then let
the kernel to figure out where the holes locate)?

> device-specific escape if more specific customization is needed and is
> needed to specify user space page tables anyhow.

and I didn't understand the 2nd link. How does user-managed page
table jump into this range claim problem? I'm getting confused...

>
> > > ioas works well here I think. Use ioas_id to refer to the xarray
> > > index.
> >
> > What about when introducing pasid to this uAPI? Then use ioas_id
> > for the xarray index
>
> Yes, ioas_id should always be the xarray index.
>
> PASID needs to be called out as PASID or as a generic "hw description"
> blob.

ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?

and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
kernel. Do we want to clear this confusion? Or possibly it's fine because
ioas_id is never used outside of iommufd and iommufd doesn't directly
call ioasid_alloc() from ioasid.c?

>
> kvm's API to program the vPASID translation table should probably take
> in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> information using an in-kernel API. Userspace shouldn't have to
> shuttle it around.

the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
when kvm calls iommufd with above tuple, vPASID->pPASID is
returned to kvm. So we still need a generic blob to represent
vPASID in the uAPI.

>
> I'm starting to feel like the struct approach for describing this uAPI
> might not scale well, but lets see..
>
> Jason

2021-09-23 12:11:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:

> currently the type is aimed to differentiate three usages:
>
> - kernel-managed I/O page table
> - user-managed I/O page table
> - shared I/O page table (e.g. with mm, or ept)

Creating a shared ios is something that should probably be a different
command.

> we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> indicator? their difference is not about format.

Format should be

FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc

> Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> range information or be ok with a large range including holes (then let
> the kernel to figure out where the holes locate)?

My impression was it only needed a way to select between the two
different cases as they are exclusive. I'd see this API as being a
hint and userspace should query the exact ranges to learn what was
actually created.

> > device-specific escape if more specific customization is needed and is
> > needed to specify user space page tables anyhow.
>
> and I didn't understand the 2nd link. How does user-managed page
> table jump into this range claim problem? I'm getting confused...

PPC could also model it using a FORMAT_KERNEL_PPC_X, FORMAT_KERNEL_PPC_Y
though it is less nice..

> > Yes, ioas_id should always be the xarray index.
> >
> > PASID needs to be called out as PASID or as a generic "hw description"
> > blob.
>
> ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?

ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
and it MUST be exposed in that format to be programmed into the PCI
device itself.

All of this should be able to support a userspace, like DPDK, creating
a PASID on its own without any special VFIO drivers.

- Open iommufd
- Attach the vfio device FD
- Request a PASID device id
- Create an ios against the pasid device id
- Query the ios for the PCI PASID #
- Program the HW to issue TLPs with the PASID

> and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> kernel. Do we want to clear this confusion? Or possibly it's fine because
> ioas_id is never used outside of iommufd and iommufd doesn't directly
> call ioasid_alloc() from ioasid.c?

As long as it is ioas_id and ioasid it is probably fine..

> > kvm's API to program the vPASID translation table should probably take
> > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > information using an in-kernel API. Userspace shouldn't have to
> > shuttle it around.
>
> the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> when kvm calls iommufd with above tuple, vPASID->pPASID is
> returned to kvm. So we still need a generic blob to represent
> vPASID in the uAPI.

I think you have to be clear about what the value is being used
for. Is it an IOMMU page table handle or is it a PCI PASID value?

AFAICT I think it is the former in the Intel scheme as the "vPASID" is
really about presenting a consistent IOMMU handle to the guest across
migration, it is not the value that shows up on the PCI bus.

Jason

2021-09-23 12:24:52

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 8:07 PM
>
> On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
>
> > currently the type is aimed to differentiate three usages:
> >
> > - kernel-managed I/O page table
> > - user-managed I/O page table
> > - shared I/O page table (e.g. with mm, or ept)
>
> Creating a shared ios is something that should probably be a different
> command.

why? I didn't understand the criteria here...

>
> > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > indicator? their difference is not about format.
>
> Format should be
>
> FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc

INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?

>
> > Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> > range information or be ok with a large range including holes (then let
> > the kernel to figure out where the holes locate)?
>
> My impression was it only needed a way to select between the two
> different cases as they are exclusive. I'd see this API as being a
> hint and userspace should query the exact ranges to learn what was
> actually created.

yes, the user can query the permitted range using DEVICE_GET_INFO.
But in the end if the user wants two separate regions, I'm afraid that
the underlying iommu driver wants to know the exact info. iirc PPC
has one global system address space shared by all devices. It is possible
that the user may want to claim range-A and range-C, with range-B
in-between but claimed by another user. Then simply using one hint
range [A-lowend, C-highend] might not work.

>
> > > device-specific escape if more specific customization is needed and is
> > > needed to specify user space page tables anyhow.
> >
> > and I didn't understand the 2nd link. How does user-managed page
> > table jump into this range claim problem? I'm getting confused...
>
> PPC could also model it using a FORMAT_KERNEL_PPC_X,
> FORMAT_KERNEL_PPC_Y
> though it is less nice..

yes PPC can use different format, but I didn't understand why it is
related user-managed page table which further requires nesting. sound
disconnected topics here...

>
> > > Yes, ioas_id should always be the xarray index.
> > >
> > > PASID needs to be called out as PASID or as a generic "hw description"
> > > blob.
> >
> > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?
>
> ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> and it MUST be exposed in that format to be programmed into the PCI
> device itself.

In the entire discussion in previous design RFC, I kept an impression that
ARM-equivalent PASID is called SSID. If we can use PASID as a general
term in iommufd context, definitely it's much better!

>
> All of this should be able to support a userspace, like DPDK, creating
> a PASID on its own without any special VFIO drivers.
>
> - Open iommufd
> - Attach the vfio device FD
> - Request a PASID device id
> - Create an ios against the pasid device id
> - Query the ios for the PCI PASID #
> - Program the HW to issue TLPs with the PASID

this all makes me very confused, and completely different from what
we agreed in previous v2 design proposal:

- open iommufd
- create an ioas
- attach vfio device to ioasid, with vPASID info
* vfio converts vPASID to pPASID and then call iommufd_device_attach_ioasid()
* the latter then installs ioas to the IOMMU with RID/PASID

>
> > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> > kernel. Do we want to clear this confusion? Or possibly it's fine because
> > ioas_id is never used outside of iommufd and iommufd doesn't directly
> > call ioasid_alloc() from ioasid.c?
>
> As long as it is ioas_id and ioasid it is probably fine..

let's align with others in a few hours.

>
> > > kvm's API to program the vPASID translation table should probably take
> > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > > information using an in-kernel API. Userspace shouldn't have to
> > > shuttle it around.
> >
> > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> > when kvm calls iommufd with above tuple, vPASID->pPASID is
> > returned to kvm. So we still need a generic blob to represent
> > vPASID in the uAPI.
>
> I think you have to be clear about what the value is being used
> for. Is it an IOMMU page table handle or is it a PCI PASID value?
>
> AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> really about presenting a consistent IOMMU handle to the guest across
> migration, it is not the value that shows up on the PCI bus.
>

It's the former. But vfio driver needs to maintain vPASID->pPASID
translation in the mediation path, since what guest programs is vPASID.

Thanks
Kevin

2021-09-23 12:35:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, September 23, 2021 8:07 PM
> >
> > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> >
> > > currently the type is aimed to differentiate three usages:
> > >
> > > - kernel-managed I/O page table
> > > - user-managed I/O page table
> > > - shared I/O page table (e.g. with mm, or ept)
> >
> > Creating a shared ios is something that should probably be a different
> > command.
>
> why? I didn't understand the criteria here...

I suspect the input args will be very different, no?

> > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > indicator? their difference is not about format.
> >
> > Format should be
> >
> > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
>
> INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?

So long as we are using structs we need to have values then the field
isn't being used. FORMAT_KERNEL is a reasonable value to have when we
are not creating a userspace page table.

Alternatively a userspace page table could have a different API

> yes, the user can query the permitted range using DEVICE_GET_INFO.
> But in the end if the user wants two separate regions, I'm afraid that
> the underlying iommu driver wants to know the exact info. iirc PPC
> has one global system address space shared by all devices. It is possible
> that the user may want to claim range-A and range-C, with range-B
> in-between but claimed by another user. Then simply using one hint
> range [A-lowend, C-highend] might not work.

I don't know, that sounds strange.. In any event hint is a hint, it
can be ignored, the only information the kernel needs to extract is
low/high bank?

> yes PPC can use different format, but I didn't understand why it is
> related user-managed page table which further requires nesting. sound
> disconnected topics here...

It is just a way to feed through more information if we get stuck
someday.

> > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > and it MUST be exposed in that format to be programmed into the PCI
> > device itself.
>
> In the entire discussion in previous design RFC, I kept an impression that
> ARM-equivalent PASID is called SSID. If we can use PASID as a general
> term in iommufd context, definitely it's much better!

SSID is inside the chip and part of the IOMMU. PASID is part of the
PCI spec.

iommufd should keep these things distinct.

If we are talking about a PCI TLP then the name to use is PASID.

> > All of this should be able to support a userspace, like DPDK, creating
> > a PASID on its own without any special VFIO drivers.
> >
> > - Open iommufd
> > - Attach the vfio device FD
> > - Request a PASID device id
> > - Create an ios against the pasid device id
> > - Query the ios for the PCI PASID #
> > - Program the HW to issue TLPs with the PASID
>
> this all makes me very confused, and completely different from what
> we agreed in previous v2 design proposal:
>
> - open iommufd
> - create an ioas
> - attach vfio device to ioasid, with vPASID info
> * vfio converts vPASID to pPASID and then call iommufd_device_attach_ioasid()
> * the latter then installs ioas to the IOMMU with RID/PASID

This was your flow for mdev's, I've always been talking about wanting
to see this supported for all use cases, including physical PCI
devices w/ PASID support.

A normal vfio_pci userspace should be able to create PASIDs unrelated
to the mdev stuff.

> > AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> > really about presenting a consistent IOMMU handle to the guest across
> > migration, it is not the value that shows up on the PCI bus.
>
> It's the former. But vfio driver needs to maintain vPASID->pPASID
> translation in the mediation path, since what guest programs is vPASID.

The pPASID definately is a PASID as it goes out on the PCIe wire

Suggest you come up with a more general name for vPASID?

Jason

2021-09-23 12:47:58

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 8:31 PM
>
> On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Thursday, September 23, 2021 8:07 PM
> > >
> > > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > >
> > > > currently the type is aimed to differentiate three usages:
> > > >
> > > > - kernel-managed I/O page table
> > > > - user-managed I/O page table
> > > > - shared I/O page table (e.g. with mm, or ept)
> > >
> > > Creating a shared ios is something that should probably be a different
> > > command.
> >
> > why? I didn't understand the criteria here...
>
> I suspect the input args will be very different, no?

yes, but can't the structure be extended to incorporate it?

>
> > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > indicator? their difference is not about format.
> > >
> > > Format should be
> > >
> > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> >
> > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
>
> So long as we are using structs we need to have values then the field
> isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> are not creating a userspace page table.
>
> Alternatively a userspace page table could have a different API

I don't know. Your comments really confused me on what's the right
way to design the uAPI. If you still remember, the original v1 proposal
introduced different uAPIs for kernel/user-managed cases. Then you
recommended to consolidate everything related to ioas in one allocation
command.

Can you help articulate the criteria first?

>
> > yes, the user can query the permitted range using DEVICE_GET_INFO.
> > But in the end if the user wants two separate regions, I'm afraid that
> > the underlying iommu driver wants to know the exact info. iirc PPC
> > has one global system address space shared by all devices. It is possible
> > that the user may want to claim range-A and range-C, with range-B
> > in-between but claimed by another user. Then simply using one hint
> > range [A-lowend, C-highend] might not work.
>
> I don't know, that sounds strange.. In any event hint is a hint, it
> can be ignored, the only information the kernel needs to extract is
> low/high bank?

iirc Dave said that the user needs to claim a range explicitly. 'claim'
sounds not a hint to me. Possibly it's time for Dave to chime in.

>
> > yes PPC can use different format, but I didn't understand why it is
> > related user-managed page table which further requires nesting. sound
> > disconnected topics here...
>
> It is just a way to feed through more information if we get stuck
> someday.

You mean that we should define uAPI for all future possible extensions
now to minimize the frequency of changing it?

>
> > > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > > and it MUST be exposed in that format to be programmed into the PCI
> > > device itself.
> >
> > In the entire discussion in previous design RFC, I kept an impression that
> > ARM-equivalent PASID is called SSID. If we can use PASID as a general
> > term in iommufd context, definitely it's much better!
>
> SSID is inside the chip and part of the IOMMU. PASID is part of the
> PCI spec.
>
> iommufd should keep these things distinct.
>
> If we are talking about a PCI TLP then the name to use is PASID.

If Jean doesn't object...

>
> > > All of this should be able to support a userspace, like DPDK, creating
> > > a PASID on its own without any special VFIO drivers.
> > >
> > > - Open iommufd
> > > - Attach the vfio device FD
> > > - Request a PASID device id
> > > - Create an ios against the pasid device id
> > > - Query the ios for the PCI PASID #
> > > - Program the HW to issue TLPs with the PASID
> >
> > this all makes me very confused, and completely different from what
> > we agreed in previous v2 design proposal:
> >
> > - open iommufd
> > - create an ioas
> > - attach vfio device to ioasid, with vPASID info
> > * vfio converts vPASID to pPASID and then call
> iommufd_device_attach_ioasid()
> > * the latter then installs ioas to the IOMMU with RID/PASID
>
> This was your flow for mdev's, I've always been talking about wanting
> to see this supported for all use cases, including physical PCI
> devices w/ PASID support.

this is not a flow for mdev. It's also required for pdev on Intel platform,
because the pasid table is in HPA space thus must be managed by host
kernel. Even no translation we still need the user to provide the pasid info.

>
> A normal vfio_pci userspace should be able to create PASIDs unrelated
> to the mdev stuff.
>
> > > AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> > > really about presenting a consistent IOMMU handle to the guest across
> > > migration, it is not the value that shows up on the PCI bus.
> >
> > It's the former. But vfio driver needs to maintain vPASID->pPASID
> > translation in the mediation path, since what guest programs is vPASID.
>
> The pPASID definately is a PASID as it goes out on the PCIe wire
>
> Suggest you come up with a more general name for vPASID?
>

as explained earlier, on Intel platform the user always needs to provide
a PASID in the attaching call. whether it's directly used (for pdev)
or translated (for mdev) is the underlying driver thing. From kernel
p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
in the uAPI.

Thanks
Kevin

2021-09-23 13:04:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Sep 23, 2021 at 12:45:17PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, September 23, 2021 8:31 PM
> >
> > On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Thursday, September 23, 2021 8:07 PM
> > > >
> > > > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > > >
> > > > > currently the type is aimed to differentiate three usages:
> > > > >
> > > > > - kernel-managed I/O page table
> > > > > - user-managed I/O page table
> > > > > - shared I/O page table (e.g. with mm, or ept)
> > > >
> > > > Creating a shared ios is something that should probably be a different
> > > > command.
> > >
> > > why? I didn't understand the criteria here...
> >
> > I suspect the input args will be very different, no?
>
> yes, but can't the structure be extended to incorporate it?

You need to be thoughtful, giant structures with endless combinations
of optional fields turn out very hard. I haven't even seen what args
this shared thing will need, but I'm guessing it is almost none, so
maybe a new call is OK?

If it is literally just 'give me an ioas for current mm' then it has
no args or complexity at all.

> > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > > indicator? their difference is not about format.
> > > >
> > > > Format should be
> > > >
> > > > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> > >
> > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> >
> > So long as we are using structs we need to have values then the field
> > isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> > are not creating a userspace page table.
> >
> > Alternatively a userspace page table could have a different API
>
> I don't know. Your comments really confused me on what's the right
> way to design the uAPI. If you still remember, the original v1 proposal
> introduced different uAPIs for kernel/user-managed cases. Then you
> recommended to consolidate everything related to ioas in one allocation
> command.

This is because you had almost completely duplicated the input args
between the two calls.

If it turns out they have very different args, then they should have
different calls.

> > > - open iommufd
> > > - create an ioas
> > > - attach vfio device to ioasid, with vPASID info
> > > * vfio converts vPASID to pPASID and then call
> > iommufd_device_attach_ioasid()
> > > * the latter then installs ioas to the IOMMU with RID/PASID
> >
> > This was your flow for mdev's, I've always been talking about wanting
> > to see this supported for all use cases, including physical PCI
> > devices w/ PASID support.
>
> this is not a flow for mdev. It's also required for pdev on Intel platform,
> because the pasid table is in HPA space thus must be managed by host
> kernel. Even no translation we still need the user to provide the pasid info.

There should be no mandatory vPASID stuff in most of these flows, that
is just a special thing ENQCMD virtualization needs. If userspace
isn't doing ENQCMD virtualization it shouldn't need to touch this
stuff.

> as explained earlier, on Intel platform the user always needs to provide
> a PASID in the attaching call. whether it's directly used (for pdev)
> or translated (for mdev) is the underlying driver thing. From kernel
> p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
> in the uAPI.

I've always disagreed with this. There should be an option for the
kernel to pick an appropriate PASID for portability to other IOMMUs
and simplicity of the interface.

You need to keep it clear what is in the minimum basic path and what
is needed for special cases, like ENQCMD virtualization.

Not every user of iommufd is doing virtualization.

Jason

2021-09-23 13:22:49

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 9:02 PM
>
> On Thu, Sep 23, 2021 at 12:45:17PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Thursday, September 23, 2021 8:31 PM
> > >
> > > On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <[email protected]>
> > > > > Sent: Thursday, September 23, 2021 8:07 PM
> > > > >
> > > > > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > currently the type is aimed to differentiate three usages:
> > > > > >
> > > > > > - kernel-managed I/O page table
> > > > > > - user-managed I/O page table
> > > > > > - shared I/O page table (e.g. with mm, or ept)
> > > > >
> > > > > Creating a shared ios is something that should probably be a different
> > > > > command.
> > > >
> > > > why? I didn't understand the criteria here...
> > >
> > > I suspect the input args will be very different, no?
> >
> > yes, but can't the structure be extended to incorporate it?
>
> You need to be thoughtful, giant structures with endless combinations
> of optional fields turn out very hard. I haven't even seen what args
> this shared thing will need, but I'm guessing it is almost none, so
> maybe a new call is OK?

To judge this looks we may have to do some practice on this front
e.g. coming up an example structure for future intended usages and
then see whether one structure can fit?

>
> If it is literally just 'give me an ioas for current mm' then it has
> no args or complexity at all.

for mm, yes, should be simple. for ept it might be more complex e.g.
requiring a handle in kvm and some other format info to match ept
page table.

>
> > > > > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > > > > indicator? their difference is not about format.
> > > > >
> > > > > Format should be
> > > > >
> > > > >
> FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
> > > >
> > > > INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
> > >
> > > So long as we are using structs we need to have values then the field
> > > isn't being used. FORMAT_KERNEL is a reasonable value to have when we
> > > are not creating a userspace page table.
> > >
> > > Alternatively a userspace page table could have a different API
> >
> > I don't know. Your comments really confused me on what's the right
> > way to design the uAPI. If you still remember, the original v1 proposal
> > introduced different uAPIs for kernel/user-managed cases. Then you
> > recommended to consolidate everything related to ioas in one allocation
> > command.
>
> This is because you had almost completely duplicated the input args
> between the two calls.
>
> If it turns out they have very different args, then they should have
> different calls.
>
> > > > - open iommufd
> > > > - create an ioas
> > > > - attach vfio device to ioasid, with vPASID info
> > > > * vfio converts vPASID to pPASID and then call
> > > iommufd_device_attach_ioasid()
> > > > * the latter then installs ioas to the IOMMU with RID/PASID
> > >
> > > This was your flow for mdev's, I've always been talking about wanting
> > > to see this supported for all use cases, including physical PCI
> > > devices w/ PASID support.
> >
> > this is not a flow for mdev. It's also required for pdev on Intel platform,
> > because the pasid table is in HPA space thus must be managed by host
> > kernel. Even no translation we still need the user to provide the pasid info.
>
> There should be no mandatory vPASID stuff in most of these flows, that
> is just a special thing ENQCMD virtualization needs. If userspace
> isn't doing ENQCMD virtualization it shouldn't need to touch this
> stuff.

No. for one, we also support SVA w/o using ENQCMD. For two, the key
is that the PASID table cannot be delegated to the userspace like ARM
or AMD. This implies that for any pasid that the userspace wants to
enable, it must be configured via the kernel.

>
> > as explained earlier, on Intel platform the user always needs to provide
> > a PASID in the attaching call. whether it's directly used (for pdev)
> > or translated (for mdev) is the underlying driver thing. From kernel
> > p.o.v, since this PASID is provided by the user, it's fine to call it vPASID
> > in the uAPI.
>
> I've always disagreed with this. There should be an option for the
> kernel to pick an appropriate PASID for portability to other IOMMUs
> and simplicity of the interface.
>
> You need to keep it clear what is in the minimum basic path and what
> is needed for special cases, like ENQCMD virtualization.
>
> Not every user of iommufd is doing virtualization.
>

just for a short summary of PASID model from previous design RFC:

for arm/amd:
- pasid space delegated to userspace
- pasid table delegated to userspace
- just one call to bind pasid_table() then pasids are fully managed by user

for intel:
- pasid table is always managed by kernel
- for pdev,
- pasid space is delegated to userspace
- attach_ioasid(dev, ioasid, pasid) so the kernel can setup the pasid entry
- for mdev,
- pasid space is managed by userspace
- attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to ppasid. iommufd setups the ppasid entry
- additional a contract to kvm for setup CPU pasid translation if enqcmd is used
- to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let underlying driver to figure out whether vpasid should be translated.

Thanks
Kevin

2021-09-23 13:35:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Sep 23, 2021 at 01:20:55PM +0000, Tian, Kevin wrote:

> > > this is not a flow for mdev. It's also required for pdev on Intel platform,
> > > because the pasid table is in HPA space thus must be managed by host
> > > kernel. Even no translation we still need the user to provide the pasid info.
> >
> > There should be no mandatory vPASID stuff in most of these flows, that
> > is just a special thing ENQCMD virtualization needs. If userspace
> > isn't doing ENQCMD virtualization it shouldn't need to touch this
> > stuff.
>
> No. for one, we also support SVA w/o using ENQCMD. For two, the key
> is that the PASID table cannot be delegated to the userspace like ARM
> or AMD. This implies that for any pasid that the userspace wants to
> enable, it must be configured via the kernel.

Yes, configured through the kernel, but the simplified flow should
have the kernel handle everything and just emit a PASID for userspace
to use.


> just for a short summary of PASID model from previous design RFC:
>
> for arm/amd:
> - pasid space delegated to userspace
> - pasid table delegated to userspace
> - just one call to bind pasid_table() then pasids are fully managed by user
>
> for intel:
> - pasid table is always managed by kernel
> - for pdev,
> - pasid space is delegated to userspace
> - attach_ioasid(dev, ioasid, pasid) so the kernel can setup the pasid entry
> - for mdev,
> - pasid space is managed by userspace
> - attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to ppasid. iommufd setups the ppasid entry
> - additional a contract to kvm for setup CPU pasid translation if enqcmd is used
> - to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let underlying driver to figure out whether vpasid should be translated.

All cases should support a kernel owned ioas associated with a
PASID. This is the universal basic API that all PASID supporting
IOMMUs need to implement.

I should not need to write generic users space that has to know how to
setup architecture specific nested userspace page tables just to use
PASID!

All of the above is qemu accelerated vIOMMU stuff. It is a good idea
to keep the two areas seperate as it greatly informs what is general
code and what is HW specific code.

Jason

2021-09-23 13:45:15

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 23, 2021 9:31 PM
>
> On Thu, Sep 23, 2021 at 01:20:55PM +0000, Tian, Kevin wrote:
>
> > > > this is not a flow for mdev. It's also required for pdev on Intel platform,
> > > > because the pasid table is in HPA space thus must be managed by host
> > > > kernel. Even no translation we still need the user to provide the pasid
> info.
> > >
> > > There should be no mandatory vPASID stuff in most of these flows, that
> > > is just a special thing ENQCMD virtualization needs. If userspace
> > > isn't doing ENQCMD virtualization it shouldn't need to touch this
> > > stuff.
> >
> > No. for one, we also support SVA w/o using ENQCMD. For two, the key
> > is that the PASID table cannot be delegated to the userspace like ARM
> > or AMD. This implies that for any pasid that the userspace wants to
> > enable, it must be configured via the kernel.
>
> Yes, configured through the kernel, but the simplified flow should
> have the kernel handle everything and just emit a PASID for userspace
> to use.
>
>
> > just for a short summary of PASID model from previous design RFC:
> >
> > for arm/amd:
> > - pasid space delegated to userspace
> > - pasid table delegated to userspace
> > - just one call to bind pasid_table() then pasids are fully managed by
> user
> >
> > for intel:
> > - pasid table is always managed by kernel
> > - for pdev,
> > - pasid space is delegated to userspace
> > - attach_ioasid(dev, ioasid, pasid) so the kernel can setup the
> pasid entry
> > - for mdev,
> > - pasid space is managed by userspace
> > - attach_ioasid(dev, ioasid, vpasid). vfio converts vpasid to
> ppasid. iommufd setups the ppasid entry
> > - additional a contract to kvm for setup CPU pasid translation
> if enqcmd is used
> > - to unify pdev/mdev, just always call it vpasid in attach_ioasid(). let
> underlying driver to figure out whether vpasid should be translated.
>
> All cases should support a kernel owned ioas associated with a
> PASID. This is the universal basic API that all PASID supporting
> IOMMUs need to implement.
>
> I should not need to write generic users space that has to know how to
> setup architecture specific nested userspace page tables just to use
> PASID!

ah, got you! I have to admit that my previous thoughts are all from
VM p.o.v, with true userspace application ignored...

>
> All of the above is qemu accelerated vIOMMU stuff. It is a good idea
> to keep the two areas seperate as it greatly informs what is general
> code and what is HW specific code.
>

Agree. will think more along this direction. possibly this discussion
deviated a lot from what this skeleton series provide. We still have
plenty of time to figure it out when starting the pasid support. For now
at least the minimal output is that PASID might be a good candidate to
be used in iommufd. ????

Thanks
Kevin

2021-09-29 11:51:54

by Yi Liu

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Wednesday, September 22, 2021 9:45 PM
>
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> > Per previous discussion they can also use vfio type1v2 as long as there
> > is a way to claim a specific iova range from a system-wide address space.
>
> Is this the reason for passing addr_width to IOASID_ALLOC? I didn't get
> what it's used for or why it's mandatory. But for PPC it sounds like it
> should be an address range instead of an upper limit?

yes, as this open described, it may need to be a range. But not sure
if PPC requires multiple ranges or just one range. Perhaps, David may
guide there.

Regards,
Yi Liu

> Thanks,
> Jean
>
> > This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > adopted this design yet. We hope to have formal alignment in v1
> discussion
> > and then decide how to incorporate it in v2.
> >
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> > ioasid.c) to represent the hardware I/O address space ID in the wire. It
> > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> > ID). We need find a way to resolve the naming conflict between the
> hardware
> > ID and software handle. One option is to rename the existing ioasid to be
> > pasid or ssid, given their full names still sound generic. Appreciate more
> > thoughts on this open!

2021-10-01 06:31:51

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> > Per previous discussion they can also use vfio type1v2 as long as there
> > is a way to claim a specific iova range from a system-wide address space.
> > This requirement doesn't sound PPC specific, as addr_width for pci devices
> > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > adopted this design yet. We hope to have formal alignment in v1 discussion
> > and then decide how to incorporate it in v2.
>
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the
> actual geometry including any holes via a query.

So part of the point of specifying start/end addresses is that
explicitly querying holes shouldn't be necessary: if the requested
range crosses a hole, it should fail. If you didn't really need all
that range, you shouldn't have asked for it.

Which means these aren't really "hints" but optionally supplied
constraints.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.19 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-01 06:31:57

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
>
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
>
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
>
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
> Per previous discussion they can also use vfio type1v2 as long as there
> is a way to claim a specific iova range from a system-wide address space.
> This requirement doesn't sound PPC specific, as addr_width for pci devices
> can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> adopted this design yet. We hope to have formal alignment in v1 discussion
> and then decide how to incorporate it in v2.

Ok, there are several things we need for ppc. None of which are
inherently ppc specific and some of which will I think be useful for
most platforms. So, starting from most general to most specific
here's basically what's needed:

1. We need to represent the fact that the IOMMU can only translate
*some* IOVAs, not a full 64-bit range. You have the addr_width
already, but I'm entirely sure if the translatable range on ppc
(or other platforms) is always a power-of-2 size. It usually will
be, of course, but I'm not sure that's a hard requirement. So
using a size/max rather than just a number of bits might be safer.

I think basically every platform will need this. Most platforms
don't actually implement full 64-bit translation in any case, but
rather some smaller number of bits that fits their page table
format.

2. The translatable range of IOVAs may not begin at 0. So we need to
advertise to userspace what the base address is, as well as the
size. POWER's main IOVA range begins at 2^59 (at least on the
models I know about).

I think a number of platforms are likely to want this, though I
couldn't name them apart from POWER. Putting the translated IOVA
window at some huge address is a pretty obvious approach to making
an IOMMU which can translate a wide address range without colliding
with any legacy PCI addresses down low (the IOMMU can check if this
transaction is for it by just looking at some high bits in the
address).

3. There might be multiple translatable ranges. So, on POWER the
IOMMU can typically translate IOVAs from 0..2GiB, and also from
2^59..2^59+<RAM size>. The two ranges have completely separate IO
page tables, with (usually) different layouts. (The low range will
nearly always be a single-level page table with 4kiB or 64kiB
entries, the high one will be multiple levels depending on the size
of the range and pagesize).

This may be less common, but I suspect POWER won't be the only
platform to do something like this. As above, using a high range
is a pretty obvious approach, but clearly won't handle older
devices which can't do 64-bit DMA. So adding a smaller range for
those devices is again a pretty obvious solution. Any platform
with an "IO hole" can be treated as having two ranges, one below
the hole and one above it (although in that case they may well not
have separate page tables

4. The translatable ranges might not be fixed. On ppc that 0..2GiB
and 2^59..whatever ranges are kernel conventions, not specified by
the hardware or firmware. When running as a guest (which is the
normal case on POWER), there are explicit hypercalls for
configuring the allowed IOVA windows (along with pagesize, number
of levels etc.). At the moment it is fixed in hardware that there
are only 2 windows, one starting at 0 and one at 2^59 but there's
no inherent reason those couldn't also be configurable.

This will probably be rarer, but I wouldn't be surprised if it
appears on another platform. If you were designing an IOMMU ASIC
for use in a variety of platforms, making the base address and size
of the translatable range(s) configurable in registers would make
sense.


Now, for (3) and (4), representing lists of windows explicitly in
ioctl()s is likely to be pretty ugly. We might be able to avoid that,
for at least some of the interfaces, by using the nested IOAS stuff.
One way or another, though, the IOASes which are actually attached to
devices need to represent both windows.

e.g.
Create a "top-level" IOAS <A> representing the device's view. This
would be either TYPE_KERNEL or maybe a special type. Into that you'd
make just two iomappings one for each of the translation windows,
pointing to IOASes <B> and <C>. IOAS <B> and <C> would have a single
window, and would represent the IO page tables for each of the
translation windows. These could be either TYPE_KERNEL or (say)
TYPE_POWER_TCE for a user managed table. Well.. in theory, anyway.
The way paravirtualization on POWER is done might mean user managed
tables aren't really possible for other reasons, but that's not
relevant here.

The next problem here is that we don't want userspace to have to do
different things for POWER, at least not for the easy case of a
userspace driver that just wants a chunk of IOVA space and doesn't
really care where it is.

In general I think the right approach to handle that is to
de-emphasize "info" or "query" interfaces. We'll probably still need
some for debugging and edge cases, but in the normal case userspace
should just specify what it *needs* and (ideally) no more with
optional hints, and the kernel will either supply that or fail.

e.g. A simple userspace driver would simply say "I need an IOAS with
at least 1GiB of IOVA space" and the kernel says "Ok, you can use
2^59..2^59+2GiB". qemu, emulating the POWER vIOMMU might say "I need
an IOAS with translatable addresses from 0..2GiB with 4kiB page size
and from 2^59..2^59+1TiB with 64kiB page size" and the kernel would
either say "ok", or "I can't do that".

> - Currently ioasid term has already been used in the kernel (drivers/iommu/
> ioasid.c) to represent the hardware I/O address space ID in the wire. It
> covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
> ID). We need find a way to resolve the naming conflict between the hardware
> ID and software handle. One option is to rename the existing ioasid to be
> pasid or ssid, given their full names still sound generic. Appreciate more
> thoughts on this open!
>
> Signed-off-by: Liu Yi L <[email protected]>
> ---
> drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
> include/linux/iommufd.h | 3 +
> include/uapi/linux/iommu.h | 54 ++++++++++++++
> 3 files changed, 177 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 641f199f2d41..4839f128b24a 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -24,6 +24,7 @@
> struct iommufd_ctx {
> refcount_t refs;
> struct mutex lock;
> + struct xarray ioasid_xa; /* xarray of ioasids */
> struct xarray device_xa; /* xarray of bound devices */
> };
>
> @@ -42,6 +43,16 @@ struct iommufd_device {
> u64 dev_cookie;
> };
>
> +/* Represent an I/O address space */
> +struct iommufd_ioas {
> + int ioasid;
> + u32 type;
> + u32 addr_width;
> + bool enforce_snoop;
> + struct iommufd_ctx *ictx;
> + refcount_t refs;
> +};
> +
> static int iommufd_fops_open(struct inode *inode, struct file *filep)
> {
> struct iommufd_ctx *ictx;
> @@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
>
> refcount_set(&ictx->refs, 1);
> mutex_init(&ictx->lock);
> + xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC);
> xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
> filep->private_data = ictx;
>
> @@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
> if (!refcount_dec_and_test(&ictx->refs))
> return;
>
> + WARN_ON(!xa_empty(&ictx->ioasid_xa));
> WARN_ON(!xa_empty(&ictx->device_xa));
> kfree(ictx);
> }
>
> +/* Caller should hold ictx->lock */
> +static void ioas_put_locked(struct iommufd_ioas *ioas)
> +{
> + struct iommufd_ctx *ictx = ioas->ictx;
> + int ioasid = ioas->ioasid;
> +
> + if (!refcount_dec_and_test(&ioas->refs))
> + return;
> +
> + xa_erase(&ictx->ioasid_xa, ioasid);
> + iommufd_ctx_put(ictx);
> + kfree(ioas);
> +}
> +
> +static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
> +{
> + struct iommu_ioasid_alloc req;
> + struct iommufd_ioas *ioas;
> + unsigned long minsz;
> + int ioasid, ret;
> +
> + minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
> +
> + if (copy_from_user(&req, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (req.argsz < minsz || !req.addr_width ||
> + req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
> + req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
> + return -EINVAL;
> +
> + ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
> + if (!ioas)
> + return -ENOMEM;
> +
> + mutex_lock(&ictx->lock);
> + ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas,
> + XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
> + GFP_KERNEL);
> + mutex_unlock(&ictx->lock);
> + if (ret) {
> + pr_err_ratelimited("Failed to alloc ioasid\n");
> + kfree(ioas);
> + return ret;
> + }
> +
> + ioas->ioasid = ioasid;
> +
> + /* only supports kernel managed I/O page table so far */
> + ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
> +
> + ioas->addr_width = req.addr_width;
> +
> + /* only supports enforce snoop today */
> + ioas->enforce_snoop = true;
> +
> + iommufd_ctx_get(ictx);
> + ioas->ictx = ictx;
> +
> + refcount_set(&ioas->refs, 1);
> +
> + return ioasid;
> +}
> +
> +static int iommufd_ioasid_free(struct iommufd_ctx *ictx, unsigned long arg)
> +{
> + struct iommufd_ioas *ioas = NULL;
> + int ioasid, ret;
> +
> + if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
> + return -EFAULT;
> +
> + if (ioasid < 0)
> + return -EINVAL;
> +
> + mutex_lock(&ictx->lock);
> + ioas = xa_load(&ictx->ioasid_xa, ioasid);
> + if (IS_ERR(ioas)) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + /* Disallow free if refcount is not 1 */
> + if (refcount_read(&ioas->refs) > 1) {
> + ret = -EBUSY;
> + goto out_unlock;
> + }
> +
> + ioas_put_locked(ioas);
> +out_unlock:
> + mutex_unlock(&ictx->lock);
> + return ret;
> +};
> +
> static int iommufd_fops_release(struct inode *inode, struct file *filep)
> {
> struct iommufd_ctx *ictx = filep->private_data;
> + struct iommufd_ioas *ioas;
> + unsigned long index;
>
> filep->private_data = NULL;
>
> + mutex_lock(&ictx->lock);
> + xa_for_each(&ictx->ioasid_xa, index, ioas)
> + ioas_put_locked(ioas);
> + mutex_unlock(&ictx->lock);
> +
> iommufd_ctx_put(ictx);
>
> return 0;
> @@ -195,6 +309,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
> case IOMMU_DEVICE_GET_INFO:
> ret = iommufd_get_device_info(ictx, arg);
> break;
> + case IOMMU_IOASID_ALLOC:
> + ret = iommufd_ioasid_alloc(ictx, arg);
> + break;
> + case IOMMU_IOASID_FREE:
> + ret = iommufd_ioasid_free(ictx, arg);
> + break;
> default:
> pr_err_ratelimited("unsupported cmd %u\n", cmd);
> break;
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> index 1603a13937e9..1dd6515e7816 100644
> --- a/include/linux/iommufd.h
> +++ b/include/linux/iommufd.h
> @@ -14,6 +14,9 @@
> #include <linux/err.h>
> #include <linux/device.h>
>
> +#define IOMMUFD_IOASID_MAX ((unsigned int)(0x7FFFFFFF))
> +#define IOMMUFD_IOASID_MIN 0
> +
> #define IOMMUFD_DEVID_MAX ((unsigned int)(0x7FFFFFFF))
> #define IOMMUFD_DEVID_MIN 0
>
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 76b71f9d6b34..5cbd300eb0ee 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -57,6 +57,60 @@ struct iommu_device_info {
>
> #define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)
>
> +/*
> + * IOMMU_IOASID_ALLOC - _IOWR(IOMMU_TYPE, IOMMU_BASE + 2,
> + * struct iommu_ioasid_alloc)
> + *
> + * Allocate an IOASID.
> + *
> + * IOASID is the FD-local software handle representing an I/O address
> + * space. Each IOASID is associated with a single I/O page table. User
> + * must call this ioctl to get an IOASID for every I/O address space
> + * that is intended to be tracked by the kernel.
> + *
> + * User needs to specify the attributes of the IOASID and associated
> + * I/O page table format information according to one or multiple devices
> + * which will be attached to this IOASID right after. The I/O page table
> + * is activated in the IOMMU when it's attached by a device. Incompatible
> + * format between device and IOASID will lead to attaching failure in
> + * device side.
> + *
> + * Currently only one flag (IOMMU_IOASID_ENFORCE_SNOOP) is supported and
> + * must be always set.
> + *
> + * Only one I/O page table type (kernel-managed) is supported, with vfio
> + * type1v2 mapping semantics.
> + *
> + * User should call IOMMU_CHECK_EXTENSION for future extensions.
> + *
> + * @argsz: user filled size of this data.
> + * @flags: additional information for IOASID allocation.
> + * @type: I/O address space page table type.
> + * @addr_width: address width of the I/O address space.
> + *
> + * Return: allocated ioasid on success, -errno on failure.
> + */
> +struct iommu_ioasid_alloc {
> + __u32 argsz;
> + __u32 flags;
> +#define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0)
> + __u32 type;
> +#define IOMMU_IOASID_TYPE_KERNEL_TYPE1V2 1
> + __u32 addr_width;
> +};
> +
> +#define IOMMU_IOASID_ALLOC _IO(IOMMU_TYPE, IOMMU_BASE + 2)
> +
> +/**
> + * IOMMU_IOASID_FREE - _IOWR(IOMMU_TYPE, IOMMU_BASE + 3, int)
> + *
> + * Free an IOASID.
> + *
> + * returns: 0 on success, -errno on failure
> + */
> +
> +#define IOMMU_IOASID_FREE _IO(IOMMU_TYPE, IOMMU_BASE + 3)
> +
> #define IOMMU_FAULT_PERM_READ (1 << 0) /* read */
> #define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */
> #define IOMMU_FAULT_PERM_EXEC (1 << 2) /* exec */

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (14.85 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-01 06:33:04

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 22, 2021 10:09 PM
> >
> > On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > >
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type
> > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> > patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > > is a way to claim a specific iova range from a system-wide address
> > space.
> > > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC
> > hasn't
> > > > > adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > > and then decide how to incorporate it in v2.
> > > >
> > > > I think the request was to include a start/end IO address hint when
> > > > creating the ios. When the kernel creates it then it can return the
> > >
> > > is the hint single-range or could be multiple-ranges?
> >
> > David explained it here:
> >
> > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> >
> > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > bit range.
> >
> > So a 'range hint' will do the job
> >
> > David also suggested this:
> >
> > https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/
> >
> > So I like this better:
> >
> > struct iommu_ioasid_alloc {
> > __u32 argsz;
> >
> > __u32 flags;
> > #define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0)
> > #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1)
> >
> > __aligned_u64 max_iova_hint;
> > __aligned_u64 base_iova_hint; // Used only if
> > IOMMU_IOASID_HINT_BASE_IOVA
> >
> > // For creating nested page tables
> > __u32 parent_ios_id;
> > __u32 format;
> > #define IOMMU_FORMAT_KERNEL 0
> > #define IOMMU_FORMAT_PPC_XXX 2
> > #define IOMMU_FORMAT_[..]
> > u32 format_flags; // Layout depends on format above
> >
> > __aligned_u64 user_page_directory; // Used if parent_ios_id != 0
> > };
> >
> > Again 'type' as an overall API indicator should not exist, feature
> > flags need to have clear narrow meanings.
>
> currently the type is aimed to differentiate three usages:
>
> - kernel-managed I/O page table
> - user-managed I/O page table
> - shared I/O page table (e.g. with mm, or ept)
>
> we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> indicator? their difference is not about format.

To me "format" indicates how the IO translation information is
encoded. We potentially have two different encodings: from userspace
to the kernel and from the kernel to the hardware. But since this is
the userspace API, it's only the userspace to kernel one that matters
here.

In that sense, KERNEL, is a "format": we encode the translation
information as a series of IOMAP operations to the kernel, rather than
as an in-memory structure.

> > This does both of David's suggestions at once. If quemu wants the 1G
> > limited region it could specify max_iova_hint = 1G, if it wants the
> > extend 64bit region with the hole it can give either the high base or
> > a large max_iova_hint. format/format_flags allows a further
>
> Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> range information or be ok with a large range including holes (then let
> the kernel to figure out where the holes locate)?

I need more specifics to answer that. Are you talking from a
userspace PoV, a guest kernel's or the host kernel's? In general I
think requiring userspace to locate and work aronud holes is a bad
idea. If userspace requests a range, it should get *all* of that
range.

The ppc case is further complicated because there are multiple ranges
and each range could have separate IO page tables. In practice
non-kernel managed IO pagetables are likely to be hard on ppc (or at
least rely on firmware/hypervisor interfaces which don't exist yet,
AFAIK). But even then, the underlying hardware page table format can
affect the minimum pagesize of each range, which could be different.

How all of this interacts with PASIDs I really haven't figured out.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.45 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-01 06:33:28

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, September 22, 2021 1:45 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > is a way to claim a specific iova range from a system-wide address space.
> > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > > adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > > and then decide how to incorporate it in v2.
> > >
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> >
> > is the hint single-range or could be multiple-ranges?
>
> David explained it here:
>
> https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

Apparently not well enough. I've attempted again in this thread.

> qeumu needs to be able to chooose if it gets the 32 bit range or 64
> bit range.

No. qemu needs to supply *both* the 32-bit and 64-bit range to its
guest, and therefore needs to request both from the host.

Or rather, it *might* need to supply both. It will supply just the
32-bit range by default, but the guest can request the 64-bit range
and/or remove and resize the 32-bit range via hypercall interfaces.
Vaguely recent Linux guests certainly will request the 64-bit range in
addition to the default 32-bit range.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.85 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-01 06:35:44

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Sep 23, 2021 at 12:22:23PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, September 23, 2021 8:07 PM
> >
> > On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> >
> > > currently the type is aimed to differentiate three usages:
> > >
> > > - kernel-managed I/O page table
> > > - user-managed I/O page table
> > > - shared I/O page table (e.g. with mm, or ept)
> >
> > Creating a shared ios is something that should probably be a different
> > command.
>
> why? I didn't understand the criteria here...
>
> >
> > > we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> > > indicator? their difference is not about format.
> >
> > Format should be
> >
> > FORMAT_KERNEL/FORMAT_INTEL_PTE_V1/FORMAT_INTEL_PTE_V2/etc
>
> INTEL_PTE_V1/V2 are formats. Why is kernel-managed called a format?
>
> >
> > > Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> > > range information or be ok with a large range including holes (then let
> > > the kernel to figure out where the holes locate)?
> >
> > My impression was it only needed a way to select between the two
> > different cases as they are exclusive. I'd see this API as being a
> > hint and userspace should query the exact ranges to learn what was
> > actually created.
>
> yes, the user can query the permitted range using DEVICE_GET_INFO.
> But in the end if the user wants two separate regions, I'm afraid that
> the underlying iommu driver wants to know the exact info. iirc PPC
> has one global system address space shared by all devices.

I think certain POWER models do this, yes, there's *protection*
between DMAs from different devices, but you can't translate the same
address to different places for different devices. I *think* that's a
firmware/hypervisor convention rather than a hardware limitation, but
I'm not entirely sure. We don't do things this way when emulating the
POWER vIOMMU in POWER, but PowerVM might and we still have to deal
with that when running as a POWERVM guest.

> It is possible
> that the user may want to claim range-A and range-C, with range-B
> in-between but claimed by another user. Then simply using one hint
> range [A-lowend, C-highend] might not work.
>
> >
> > > > device-specific escape if more specific customization is needed and is
> > > > needed to specify user space page tables anyhow.
> > >
> > > and I didn't understand the 2nd link. How does user-managed page
> > > table jump into this range claim problem? I'm getting confused...
> >
> > PPC could also model it using a FORMAT_KERNEL_PPC_X,
> > FORMAT_KERNEL_PPC_Y
> > though it is less nice..
>
> yes PPC can use different format, but I didn't understand why it is
> related user-managed page table which further requires nesting. sound
> disconnected topics here...
>
> >
> > > > Yes, ioas_id should always be the xarray index.
> > > >
> > > > PASID needs to be called out as PASID or as a generic "hw description"
> > > > blob.
> > >
> > > ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid?
> >
> > ARM *does* need PASID! PASID is the label of the DMA on the PCI bus,
> > and it MUST be exposed in that format to be programmed into the PCI
> > device itself.
>
> In the entire discussion in previous design RFC, I kept an impression that
> ARM-equivalent PASID is called SSID. If we can use PASID as a general
> term in iommufd context, definitely it's much better!
>
> >
> > All of this should be able to support a userspace, like DPDK, creating
> > a PASID on its own without any special VFIO drivers.
> >
> > - Open iommufd
> > - Attach the vfio device FD
> > - Request a PASID device id
> > - Create an ios against the pasid device id
> > - Query the ios for the PCI PASID #
> > - Program the HW to issue TLPs with the PASID
>
> this all makes me very confused, and completely different from what
> we agreed in previous v2 design proposal:
>
> - open iommufd
> - create an ioas
> - attach vfio device to ioasid, with vPASID info
> * vfio converts vPASID to pPASID and then call iommufd_device_attach_ioasid()
> * the latter then installs ioas to the IOMMU with RID/PASID
>
> >
> > > and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the
> > > kernel. Do we want to clear this confusion? Or possibly it's fine because
> > > ioas_id is never used outside of iommufd and iommufd doesn't directly
> > > call ioasid_alloc() from ioasid.c?
> >
> > As long as it is ioas_id and ioasid it is probably fine..
>
> let's align with others in a few hours.
>
> >
> > > > kvm's API to program the vPASID translation table should probably take
> > > > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
> > > > information using an in-kernel API. Userspace shouldn't have to
> > > > shuttle it around.
> > >
> > > the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI.
> > > when kvm calls iommufd with above tuple, vPASID->pPASID is
> > > returned to kvm. So we still need a generic blob to represent
> > > vPASID in the uAPI.
> >
> > I think you have to be clear about what the value is being used
> > for. Is it an IOMMU page table handle or is it a PCI PASID value?
> >
> > AFAICT I think it is the former in the Intel scheme as the "vPASID" is
> > really about presenting a consistent IOMMU handle to the guest across
> > migration, it is not the value that shows up on the PCI bus.
> >
>
> It's the former. But vfio driver needs to maintain vPASID->pPASID
> translation in the mediation path, since what guest programs is vPASID.
>
> Thanks
> Kevin
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.80 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-01 07:02:31

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> >
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > Per previous discussion they can also use vfio type1v2 as long as there
> > > is a way to claim a specific iova range from a system-wide address space.
> > > This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > > and then decide how to incorporate it in v2.
> >
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
>
> is the hint single-range or could be multiple-ranges?
>
> > actual geometry including any holes via a query.
>
> I'd like to see a detail flow from David on how the uAPI works today with
> existing spapr driver and what exact changes he'd like to make on this
> proposed interface. Above info is still insufficient for us to think about the
> right solution.
>
> >
> > > - Currently ioasid term has already been used in the kernel
> > (drivers/iommu/
> > > ioasid.c) to represent the hardware I/O address space ID in the wire. It
> > > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> > Stream
> > > ID). We need find a way to resolve the naming conflict between the
> > hardware
> > > ID and software handle. One option is to rename the existing ioasid to be
> > > pasid or ssid, given their full names still sound generic. Appreciate more
> > > thoughts on this open!
> >
> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
>
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index and ioasid to represent pasid/ssid?

This is probably obsoleted by Jason's other comments, but definitely
don't use "ioas_id" and "ioasid" to mean different things. Having
meaningfully different things distinguished only by an underscore is
not a good idea.

> At this point
> the software handle and hardware id are mixed together thus need
> a clear terminology to differentiate them.
>
>
> Thanks
> Kevin
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.48 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-01 13:43:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Fri, Oct 01, 2021 at 04:19:22PM +1000, [email protected] wrote:
> On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > >
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > > is a way to claim a specific iova range from a system-wide address space.
> > > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > > > adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > > and then decide how to incorporate it in v2.
> > > >
> > > > I think the request was to include a start/end IO address hint when
> > > > creating the ios. When the kernel creates it then it can return the
> > >
> > > is the hint single-range or could be multiple-ranges?
> >
> > David explained it here:
> >
> > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
>
> Apparently not well enough. I've attempted again in this thread.
>
> > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > bit range.
>
> No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> guest, and therefore needs to request both from the host.

As I understood your remarks each IOAS can only be one of the formats
as they have a different PTE layout. So here I ment that qmeu needs to
be able to pick *for each IOAS* which of the two formats it is.

> Or rather, it *might* need to supply both. It will supply just the
> 32-bit range by default, but the guest can request the 64-bit range
> and/or remove and resize the 32-bit range via hypercall interfaces.
> Vaguely recent Linux guests certainly will request the 64-bit range in
> addition to the default 32-bit range.

And this would result in two different IOAS objects

Jason

2021-10-01 16:59:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote:
> On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > Per previous discussion they can also use vfio type1v2 as long as there
> > > is a way to claim a specific iova range from a system-wide address space.
> > > This requirement doesn't sound PPC specific, as addr_width for pci devices
> > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > adopted this design yet. We hope to have formal alignment in v1 discussion
> > > and then decide how to incorporate it in v2.
> >
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> > actual geometry including any holes via a query.
>
> So part of the point of specifying start/end addresses is that
> explicitly querying holes shouldn't be necessary: if the requested
> range crosses a hole, it should fail. If you didn't really need all
> that range, you shouldn't have asked for it.
>
> Which means these aren't really "hints" but optionally supplied
> constraints.

We have to be very careful here, there are two very different use
cases. When we are talking about the generic API I am mostly
interested to see that applications like DPDK can use this API and be
portable to any IOMMU HW the kernel supports. I view the fact that
there is VFIO PPC specific code in DPDK as a failing of the kernel to
provide a HW abstraction.

This means we cannot define an input that has a magic HW specific
value. DPDK can never provide that portably. Thus all these kinds of
inputs in the generic API need to be hints, if they exist at all.

As 'address space size hint'/'address space start hint' is both
generic, useful, and providable by DPDK I think it is OK. PPC can use
it to pick which of the two page table formats to use for this IOAS if
it wants.

The second use case is when we have a userspace driver for a specific
HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration.
We can look here for things to make general, but I would expect a
fairly high bar. Instead, I would rather see the userspace driver
communicate with the kernel driver in its own private language, so
that the entire functionality of the unique HW can be used.

So, when it comes to providing exact ranges as an input parameter we
have to decide if that is done as some additional general data, or if
it should be part of a IOAS_FORMAT_KERNEL_PPC. In this case I suggest
the guiding factor should be if every single IOMMU implementation can
be updated to support the value.

Jason


2021-10-02 06:18:55

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Fri, Oct 01, 2021 at 09:25:05AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:19:22PM +1000, [email protected] wrote:
> > On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <[email protected]>
> > > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > > >
> > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > > format information for the target I/O page table.
> > > > > >
> > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > > semantics. For this type the user should specify the addr_width of
> > > > > > the I/O address space and whether the I/O page table is created in
> > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > > > > as the false setting requires additional contract with KVM on handling
> > > > > > WBINVD emulation, which can be added later.
> > > > > >
> > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > > > for what formats can be specified when allocating an IOASID.
> > > > > >
> > > > > > Open:
> > > > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > > > is a way to claim a specific iova range from a system-wide address space.
> > > > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > > > devices
> > > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > > > > adopted this design yet. We hope to have formal alignment in v1
> > > > > discussion
> > > > > > and then decide how to incorporate it in v2.
> > > > >
> > > > > I think the request was to include a start/end IO address hint when
> > > > > creating the ios. When the kernel creates it then it can return the
> > > >
> > > > is the hint single-range or could be multiple-ranges?
> > >
> > > David explained it here:
> > >
> > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> >
> > Apparently not well enough. I've attempted again in this thread.
> >
> > > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > > bit range.
> >
> > No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> > guest, and therefore needs to request both from the host.
>
> As I understood your remarks each IOAS can only be one of the formats
> as they have a different PTE layout. So here I ment that qmeu needs to
> be able to pick *for each IOAS* which of the two formats it is.

No. Both windows are in the same IOAS. A device could do DMA
simultaneously to both windows. More realstically a 64-bit DMA
capable and a non-64-bit DMA capable device could be in the same group
and be doing DMAs to different windows simultaneously.

> > Or rather, it *might* need to supply both. It will supply just the
> > 32-bit range by default, but the guest can request the 64-bit range
> > and/or remove and resize the 32-bit range via hypercall interfaces.
> > Vaguely recent Linux guests certainly will request the 64-bit range in
> > addition to the default 32-bit range.
>
> And this would result in two different IOAS objects

There might be two different IOAS objects for setup, but at some point
they need to be combined into one IOAS to which the device is actually
attached.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.87 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-02 12:42:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Sat, Oct 02, 2021 at 02:21:38PM +1000, [email protected] wrote:

> > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> > > guest, and therefore needs to request both from the host.
> >
> > As I understood your remarks each IOAS can only be one of the formats
> > as they have a different PTE layout. So here I ment that qmeu needs to
> > be able to pick *for each IOAS* which of the two formats it is.
>
> No. Both windows are in the same IOAS. A device could do DMA
> simultaneously to both windows.

Sure, but that doesn't force us to model it as one IOAS in the
iommufd. A while back you were talking about using nesting and 3
IOAS's, right?

1, 2 or 3 IOAS's seems like a decision we can make.

PASID support will already require that a device can be multi-bound to
many IOAS's, couldn't PPC do the same with the windows?

Jason

2021-10-11 11:32:37

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Fri, Oct 01, 2021 at 09:22:25AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote:
> > On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > is a way to claim a specific iova range from a system-wide address space.
> > > > This requirement doesn't sound PPC specific, as addr_width for pci devices
> > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > > adopted this design yet. We hope to have formal alignment in v1 discussion
> > > > and then decide how to incorporate it in v2.
> > >
> > > I think the request was to include a start/end IO address hint when
> > > creating the ios. When the kernel creates it then it can return the
> > > actual geometry including any holes via a query.
> >
> > So part of the point of specifying start/end addresses is that
> > explicitly querying holes shouldn't be necessary: if the requested
> > range crosses a hole, it should fail. If you didn't really need all
> > that range, you shouldn't have asked for it.
> >
> > Which means these aren't really "hints" but optionally supplied
> > constraints.
>
> We have to be very careful here, there are two very different use
> cases. When we are talking about the generic API I am mostly
> interested to see that applications like DPDK can use this API and be
> portable to any IOMMU HW the kernel supports. I view the fact that
> there is VFIO PPC specific code in DPDK as a failing of the kernel to
> provide a HW abstraction.

I would agree. At the time we were making this, we thought there were
irreconcilable differences between what could be done with the x86 vs
ppc IOMMUs. Turns out we just didn't think it through hard enough to
find a common model.

> This means we cannot define an input that has a magic HW specific
> value.

I'm not entirely sure what you mean by that.

> DPDK can never provide that portably. Thus all these kinds of
> inputs in the generic API need to be hints, if they exist at all.

I don't follow your reasoning. First, note that in qemu these valus
are *target* hardware specific, not *host* hardware specific. If
those requests aren't honoured, qemu cannot faithfully emulate the
target hardware and has to fail. That's what I mean when I say this
is not a constraint, not a hint.

But when I say the constraint is optional, I mean that things which
don't have that requirement - like DPDK - shouldn't apply the
constraint.

> As 'address space size hint'/'address space start hint' is both
> generic, useful, and providable by DPDK I think it is OK.

Size is certainly providable, and probably useful. For DPDK, I don't
think start is useful.

> PPC can use
> it to pick which of the two page table formats to use for this IOAS if
> it wants.

Clarification: it's not that each window has a specific page table
format. The two windows are independent of each other, which means
you can separately select the page table format for each one (although
the 32-bit one generally won't be big enough that there's any point
selecting something other than a 1-level TCE table). When I say
format here, I basically mean number of levels and size of each level
- the IOPTE (a.k.a. TCE) format is the same in each case.

> The second use case is when we have a userspace driver for a specific
> HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration.
> We can look here for things to make general, but I would expect a
> fairly high bar. Instead, I would rather see the userspace driver
> communicate with the kernel driver in its own private language, so
> that the entire functionality of the unique HW can be used.

I don't think we actually need to do this. Or rather, we might want
to do this for maximum performance in some cases, but I think we can
have something that at least usually works without having explicit
host == target logic for each case. I believe this can work (at least
when using kernel managed IO page tables) in a lot of cases even with
a different vIOMMU from the host IOMMU.

e.g. suppose the host is some x86 (or arm, or whatever) machine with
an IOMMU capable of translating any address from 0..2^60, with maybe
the exception of an IO hole somewhere between 2GiB and 4GiB.

qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
be determined) that it needs an IOAS where things can be mapped in the
range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
64-bit window).

Ideally the host /dev/iommu will say "ok!", since both those ranges
are within the 0..2^60 translated range of the host IOMMU, and don't
touch the IO hole. When the guest calls the IO mapping hypercalls,
qemu translates those into DMA_MAP operations, and since they're all
within the previously verified windows, they should work fine.

> So, when it comes to providing exact ranges as an input parameter we
> have to decide if that is done as some additional general data, or if
> it should be part of a IOAS_FORMAT_KERNEL_PPC. In this case I suggest
> the guiding factor should be if every single IOMMU implementation can
> be updated to support the value.

No, I don't think that needs to be a condition. I think it's
perfectly reasonable for a constraint to be given, and for the host
IOMMU to just say "no, I can't do that". But that does mean that each
of these values has to have an explicit way of userspace specifying "I
don't care", so that the kernel will select a suitable value for those
instead - that's what DPDK or other userspace would use nearly all the
time.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.86 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-11 11:47:25

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Sat, Oct 02, 2021 at 09:25:42AM -0300, Jason Gunthorpe wrote:
> On Sat, Oct 02, 2021 at 02:21:38PM +1000, [email protected] wrote:
>
> > > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its
> > > > guest, and therefore needs to request both from the host.
> > >
> > > As I understood your remarks each IOAS can only be one of the formats
> > > as they have a different PTE layout. So here I ment that qmeu needs to
> > > be able to pick *for each IOAS* which of the two formats it is.
> >
> > No. Both windows are in the same IOAS. A device could do DMA
> > simultaneously to both windows.
>
> Sure, but that doesn't force us to model it as one IOAS in the
> iommufd. A while back you were talking about using nesting and 3
> IOAS's, right?
>
> 1, 2 or 3 IOAS's seems like a decision we can make.

Well, up to a point. We can decide how such a thing should be
constructed. However at some point there needs to exist an IOAS in
which both windows are mapped, whether it's directly or indirectly.
That's what the device will be attached to.

> PASID support will already require that a device can be multi-bound to
> many IOAS's, couldn't PPC do the same with the windows?

I don't see how that would make sense. The device has no awareness of
multiple windows the way it does of PASIDs. It just sends
transactions over the bus with the IOVAs it's told. If those IOVAs
lie within one of the windows, the IOMMU picks them up and translates
them. If they don't, it doesn't.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.70 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-11 12:54:22

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
> be determined) that it needs an IOAS where things can be mapped in the
> range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
> 64-bit window).
>
> Ideally the host /dev/iommu will say "ok!", since both those ranges
> are within the 0..2^60 translated range of the host IOMMU, and don't
> touch the IO hole. When the guest calls the IO mapping hypercalls,
> qemu translates those into DMA_MAP operations, and since they're all
> within the previously verified windows, they should work fine.

Seems like we don't need the negotiation part? The host kernel
communicates available IOVA ranges to userspace including holes (patch
17), and userspace can check that the ranges it needs are within the IOVA
space boundaries. That part is necessary for DPDK as well since it needs
to know about holes in the IOVA space where DMA wouldn't work as expected
(MSI doorbells for example). And there already is a negotiation happening,
when the host kernel rejects MAP ioctl outside the advertised area.

Thanks,
Jean

2021-10-11 17:21:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 04:37:38PM +1100, [email protected] wrote:
> > PASID support will already require that a device can be multi-bound to
> > many IOAS's, couldn't PPC do the same with the windows?
>
> I don't see how that would make sense. The device has no awareness of
> multiple windows the way it does of PASIDs. It just sends
> transactions over the bus with the IOVAs it's told. If those IOVAs
> lie within one of the windows, the IOMMU picks them up and translates
> them. If they don't, it doesn't.

To my mind that address centric routing is awareness.

If the HW can attach multiple non-overlapping IOAS's to the same
device then the HW is routing to the correct IOAS by using the address
bits. This is not much different from the prior discussion we had
where we were thinking of the PASID as an 80 bit address

The fact the PPC HW actually has multiple page table roots and those
roots even have different page tables layouts while still connected to
the same device suggests this is not even an unnatural modelling
approach...

Jason


2021-10-11 18:52:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:

> > This means we cannot define an input that has a magic HW specific
> > value.
>
> I'm not entirely sure what you mean by that.

I mean if you make a general property 'foo' that userspace must
specify correctly then your API isn't general anymore. Userspace must
know if it is A or B HW to set foo=A or foo=B.

Supported IOVA ranges are easially like that as every IOMMU is
different. So DPDK shouldn't provide such specific or binding
information.

> No, I don't think that needs to be a condition. I think it's
> perfectly reasonable for a constraint to be given, and for the host
> IOMMU to just say "no, I can't do that". But that does mean that each
> of these values has to have an explicit way of userspace specifying "I
> don't care", so that the kernel will select a suitable value for those
> instead - that's what DPDK or other userspace would use nearly all the
> time.

My feeling is that qemu should be dealing with the host != target
case, not the kernel.

The kernel's job should be to expose the IOMMU HW it has, with all
features accessible, to userspace.

Qemu's job should be to have a userspace driver for each kernel IOMMU
and the internal infrastructure to make accelerated emulations for all
supported target IOMMUs.

In other words, it is not the kernel's job to provide target IOMMU
emulation.

The kernel should provide truely generic "works everywhere" interface
that qemu/etc can rely on to implement the least accelerated emulation
path.

So when I see proposals to have "generic" interfaces that actually
require very HW specific setup, and cannot be used by a generic qemu
userpace driver, I think it breaks this model. If qemu needs to know
it is on PPC (as it does today with VFIO's PPC specific API) then it
may as well speak PPC specific language and forget about pretending to
be generic.

This approach is grounded in 15 years of trying to build these
user/kernel split HW subsystems (particularly RDMA) where it has
become painfully obvious that the kernel is the worst place to try and
wrangle really divergent HW into a "common" uAPI.

This is because the kernel/user boundary is fixed. Introducing
anything generic here requires a lot of time, thought, arguing and
risk. Usually it ends up being done wrong (like the PPC specific
ioctls, for instance) and when this happens we can't learn and adapt,
we are stuck with stable uABI forever.

Exposing a device's native programming interface is much simpler. Each
device is fixed, defined and someone can sit down and figure out how
to expose it. Then that is it, it doesn't need revisiting, it doesn't
need harmonizing with a future slightly different device, it just
stays as is.

The cost, is that there must be a userspace driver component for each
HW piece - which we are already paying here!

> Ideally the host /dev/iommu will say "ok!", since both those ranges
> are within the 0..2^60 translated range of the host IOMMU, and don't
> touch the IO hole. When the guest calls the IO mapping hypercalls,
> qemu translates those into DMA_MAP operations, and since they're all
> within the previously verified windows, they should work fine.

For instance, we are going to see HW with nested page tables, user
space owned page tables and even kernel-bypass fast IOTLB
invalidation.

In that world does it even make sense for qmeu to use slow DMA_MAP
ioctls for emulation?

A userspace framework in qemu can make these optimizations and is
also necessarily HW specific as the host page table is HW specific..

Jason

2021-10-11 23:44:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:

> Seems like we don't need the negotiation part? The host kernel
> communicates available IOVA ranges to userspace including holes (patch
> 17), and userspace can check that the ranges it needs are within the IOVA
> space boundaries. That part is necessary for DPDK as well since it needs
> to know about holes in the IOVA space where DMA wouldn't work as expected
> (MSI doorbells for example).

I haven't looked super closely at DPDK, but the other simple VFIO app
I am aware of struggled to properly implement this semantic (Indeed it
wasn't even clear to the author this was even needed).

It requires interval tree logic inside the application which is not a
trivial algorithm to implement in C.

I do wonder if the "simple" interface should have an option more like
the DMA API where userspace just asks to DMA map some user memory and
gets back the dma_addr_t to use. Kernel manages the allocation
space/etc.

Jason

2021-10-12 08:38:49

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 08:38:17PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:
>
> > Seems like we don't need the negotiation part? The host kernel
> > communicates available IOVA ranges to userspace including holes (patch
> > 17), and userspace can check that the ranges it needs are within the IOVA
> > space boundaries. That part is necessary for DPDK as well since it needs
> > to know about holes in the IOVA space where DMA wouldn't work as expected
> > (MSI doorbells for example).
>
> I haven't looked super closely at DPDK, but the other simple VFIO app
> I am aware of struggled to properly implement this semantic (Indeed it
> wasn't even clear to the author this was even needed).
>
> It requires interval tree logic inside the application which is not a
> trivial algorithm to implement in C.
>
> I do wonder if the "simple" interface should have an option more like
> the DMA API where userspace just asks to DMA map some user memory and
> gets back the dma_addr_t to use. Kernel manages the allocation
> space/etc.

Agreed, it's tempting to use IOVA = VA but the two spaces aren't
necessarily compatible. An extension that plugs into the IOVA allocator
could be useful to userspace drivers.

Thanks,
Jean

2021-10-13 07:03:28

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: David Gibson
> Sent: Friday, October 1, 2021 2:11 PM
>
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> > Per previous discussion they can also use vfio type1v2 as long as there
> > is a way to claim a specific iova range from a system-wide address space.
> > This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > adopted this design yet. We hope to have formal alignment in v1
> discussion
> > and then decide how to incorporate it in v2.
>
> Ok, there are several things we need for ppc. None of which are
> inherently ppc specific and some of which will I think be useful for
> most platforms. So, starting from most general to most specific
> here's basically what's needed:
>
> 1. We need to represent the fact that the IOMMU can only translate
> *some* IOVAs, not a full 64-bit range. You have the addr_width
> already, but I'm entirely sure if the translatable range on ppc
> (or other platforms) is always a power-of-2 size. It usually will
> be, of course, but I'm not sure that's a hard requirement. So
> using a size/max rather than just a number of bits might be safer.
>
> I think basically every platform will need this. Most platforms
> don't actually implement full 64-bit translation in any case, but
> rather some smaller number of bits that fits their page table
> format.
>
> 2. The translatable range of IOVAs may not begin at 0. So we need to
> advertise to userspace what the base address is, as well as the
> size. POWER's main IOVA range begins at 2^59 (at least on the
> models I know about).
>
> I think a number of platforms are likely to want this, though I
> couldn't name them apart from POWER. Putting the translated IOVA
> window at some huge address is a pretty obvious approach to making
> an IOMMU which can translate a wide address range without colliding
> with any legacy PCI addresses down low (the IOMMU can check if this
> transaction is for it by just looking at some high bits in the
> address).
>
> 3. There might be multiple translatable ranges. So, on POWER the
> IOMMU can typically translate IOVAs from 0..2GiB, and also from
> 2^59..2^59+<RAM size>. The two ranges have completely separate IO
> page tables, with (usually) different layouts. (The low range will
> nearly always be a single-level page table with 4kiB or 64kiB
> entries, the high one will be multiple levels depending on the size
> of the range and pagesize).
>
> This may be less common, but I suspect POWER won't be the only
> platform to do something like this. As above, using a high range
> is a pretty obvious approach, but clearly won't handle older
> devices which can't do 64-bit DMA. So adding a smaller range for
> those devices is again a pretty obvious solution. Any platform
> with an "IO hole" can be treated as having two ranges, one below
> the hole and one above it (although in that case they may well not
> have separate page tables

1-3 are common on all platforms with fixed reserved ranges. Current
vfio already reports permitted iova ranges to user via VFIO_IOMMU_
TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
maps only in those ranges. iommufd can follow the same logic for the
baseline uAPI.

For above cases a [base, max] hint can be provided by the user per
Jason's recommendation. It is a hint as no additional restriction is
imposed, since the kernel only cares about no violation on permitted
ranges that it reports to the user. Underlying iommu driver may use
this hint to optimize e.g. deciding how many levels are used for
the kernel-managed page table according to max addr.

>
> 4. The translatable ranges might not be fixed. On ppc that 0..2GiB
> and 2^59..whatever ranges are kernel conventions, not specified by
> the hardware or firmware. When running as a guest (which is the
> normal case on POWER), there are explicit hypercalls for
> configuring the allowed IOVA windows (along with pagesize, number
> of levels etc.). At the moment it is fixed in hardware that there
> are only 2 windows, one starting at 0 and one at 2^59 but there's
> no inherent reason those couldn't also be configurable.

If ppc iommu driver needs to configure hardware according to the
specified ranges, then it requires more than a hint thus better be
conveyed via ppc specific cmd as Jason suggested.

>
> This will probably be rarer, but I wouldn't be surprised if it
> appears on another platform. If you were designing an IOMMU ASIC
> for use in a variety of platforms, making the base address and size
> of the translatable range(s) configurable in registers would make
> sense.
>
>
> Now, for (3) and (4), representing lists of windows explicitly in
> ioctl()s is likely to be pretty ugly. We might be able to avoid that,
> for at least some of the interfaces, by using the nested IOAS stuff.
> One way or another, though, the IOASes which are actually attached to
> devices need to represent both windows.
>
> e.g.
> Create a "top-level" IOAS <A> representing the device's view. This
> would be either TYPE_KERNEL or maybe a special type. Into that you'd
> make just two iomappings one for each of the translation windows,
> pointing to IOASes <B> and <C>. IOAS <B> and <C> would have a single
> window, and would represent the IO page tables for each of the
> translation windows. These could be either TYPE_KERNEL or (say)
> TYPE_POWER_TCE for a user managed table. Well.. in theory, anyway.
> The way paravirtualization on POWER is done might mean user managed
> tables aren't really possible for other reasons, but that's not
> relevant here.
>
> The next problem here is that we don't want userspace to have to do
> different things for POWER, at least not for the easy case of a
> userspace driver that just wants a chunk of IOVA space and doesn't
> really care where it is.
>
> In general I think the right approach to handle that is to
> de-emphasize "info" or "query" interfaces. We'll probably still need
> some for debugging and edge cases, but in the normal case userspace
> should just specify what it *needs* and (ideally) no more with
> optional hints, and the kernel will either supply that or fail.
>
> e.g. A simple userspace driver would simply say "I need an IOAS with
> at least 1GiB of IOVA space" and the kernel says "Ok, you can use
> 2^59..2^59+2GiB". qemu, emulating the POWER vIOMMU might say "I need
> an IOAS with translatable addresses from 0..2GiB with 4kiB page size
> and from 2^59..2^59+1TiB with 64kiB page size" and the kernel would
> either say "ok", or "I can't do that".
>

This doesn't work for other platforms, which don't have vIOMMU
mandatory as on ppc. For those platforms, the initial address space
is GPA (for vm case) and Qemu needs to mark those GPA holes as
reserved in firmware structure. I don't think anyone wants a tedious
try-and-fail process to figure out how many holes exists in a 64bit
address space...

Thanks
Kevin

2021-10-13 07:10:08

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Monday, October 11, 2021 4:50 PM
>
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
> > be determined) that it needs an IOAS where things can be mapped in the
> > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
> > 64-bit window).
> >
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole. When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
>
> Seems like we don't need the negotiation part? The host kernel
> communicates available IOVA ranges to userspace including holes (patch
> 17), and userspace can check that the ranges it needs are within the IOVA
> space boundaries. That part is necessary for DPDK as well since it needs
> to know about holes in the IOVA space where DMA wouldn't work as
> expected
> (MSI doorbells for example). And there already is a negotiation happening,
> when the host kernel rejects MAP ioctl outside the advertised area.
>

Agree. This can cover the ppc platforms with fixed reserved ranges.
It's meaningless to have user further tell kernel that it is only willing
to use a subset of advertised area. for ppc platforms with dynamic
reserved ranges which are claimed by user, we can leave it out of
the common set and handled in a different way, either leveraging
ioas nesting if applied or having ppc specific cmd.

Thanks
Kevin

2021-10-13 07:17:08

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Tuesday, October 12, 2021 4:34 PM
>
> On Mon, Oct 11, 2021 at 08:38:17PM -0300, Jason Gunthorpe wrote:
> > On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:
> >
> > > Seems like we don't need the negotiation part? The host kernel
> > > communicates available IOVA ranges to userspace including holes (patch
> > > 17), and userspace can check that the ranges it needs are within the IOVA
> > > space boundaries. That part is necessary for DPDK as well since it needs
> > > to know about holes in the IOVA space where DMA wouldn't work as
> expected
> > > (MSI doorbells for example).
> >
> > I haven't looked super closely at DPDK, but the other simple VFIO app
> > I am aware of struggled to properly implement this semantic (Indeed it
> > wasn't even clear to the author this was even needed).
> >
> > It requires interval tree logic inside the application which is not a
> > trivial algorithm to implement in C.
> >
> > I do wonder if the "simple" interface should have an option more like
> > the DMA API where userspace just asks to DMA map some user memory
> and
> > gets back the dma_addr_t to use. Kernel manages the allocation
> > space/etc.
>
> Agreed, it's tempting to use IOVA = VA but the two spaces aren't
> necessarily compatible. An extension that plugs into the IOVA allocator
> could be useful to userspace drivers.
>

Make sense. We can have a flag in IOMMUFD_MAP_DMA to tell whether
the user provides vaddr or expects the kernel to allocate and return.

Thanks
Kevin

2021-10-14 06:23:48

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Wed, Oct 13, 2021 at 07:00:58AM +0000, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Friday, October 1, 2021 2:11 PM
> >
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > Per previous discussion they can also use vfio type1v2 as long as there
> > > is a way to claim a specific iova range from a system-wide address space.
> > > This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > > adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > > and then decide how to incorporate it in v2.
> >
> > Ok, there are several things we need for ppc. None of which are
> > inherently ppc specific and some of which will I think be useful for
> > most platforms. So, starting from most general to most specific
> > here's basically what's needed:
> >
> > 1. We need to represent the fact that the IOMMU can only translate
> > *some* IOVAs, not a full 64-bit range. You have the addr_width
> > already, but I'm entirely sure if the translatable range on ppc
> > (or other platforms) is always a power-of-2 size. It usually will
> > be, of course, but I'm not sure that's a hard requirement. So
> > using a size/max rather than just a number of bits might be safer.
> >
> > I think basically every platform will need this. Most platforms
> > don't actually implement full 64-bit translation in any case, but
> > rather some smaller number of bits that fits their page table
> > format.
> >
> > 2. The translatable range of IOVAs may not begin at 0. So we need to
> > advertise to userspace what the base address is, as well as the
> > size. POWER's main IOVA range begins at 2^59 (at least on the
> > models I know about).
> >
> > I think a number of platforms are likely to want this, though I
> > couldn't name them apart from POWER. Putting the translated IOVA
> > window at some huge address is a pretty obvious approach to making
> > an IOMMU which can translate a wide address range without colliding
> > with any legacy PCI addresses down low (the IOMMU can check if this
> > transaction is for it by just looking at some high bits in the
> > address).
> >
> > 3. There might be multiple translatable ranges. So, on POWER the
> > IOMMU can typically translate IOVAs from 0..2GiB, and also from
> > 2^59..2^59+<RAM size>. The two ranges have completely separate IO
> > page tables, with (usually) different layouts. (The low range will
> > nearly always be a single-level page table with 4kiB or 64kiB
> > entries, the high one will be multiple levels depending on the size
> > of the range and pagesize).
> >
> > This may be less common, but I suspect POWER won't be the only
> > platform to do something like this. As above, using a high range
> > is a pretty obvious approach, but clearly won't handle older
> > devices which can't do 64-bit DMA. So adding a smaller range for
> > those devices is again a pretty obvious solution. Any platform
> > with an "IO hole" can be treated as having two ranges, one below
> > the hole and one above it (although in that case they may well not
> > have separate page tables
>
> 1-3 are common on all platforms with fixed reserved ranges. Current
> vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> maps only in those ranges. iommufd can follow the same logic for the
> baseline uAPI.
>
> For above cases a [base, max] hint can be provided by the user per
> Jason's recommendation.

Provided at which stage?

> It is a hint as no additional restriction is
> imposed,

For the qemu type use case, that's not true. In that case we
*require* the available mapping ranges to match what the guest
platform expects.

> since the kernel only cares about no violation on permitted
> ranges that it reports to the user. Underlying iommu driver may use
> this hint to optimize e.g. deciding how many levels are used for
> the kernel-managed page table according to max addr.
>
> >
> > 4. The translatable ranges might not be fixed. On ppc that 0..2GiB
> > and 2^59..whatever ranges are kernel conventions, not specified by
> > the hardware or firmware. When running as a guest (which is the
> > normal case on POWER), there are explicit hypercalls for
> > configuring the allowed IOVA windows (along with pagesize, number
> > of levels etc.). At the moment it is fixed in hardware that there
> > are only 2 windows, one starting at 0 and one at 2^59 but there's
> > no inherent reason those couldn't also be configurable.
>
> If ppc iommu driver needs to configure hardware according to the
> specified ranges, then it requires more than a hint thus better be
> conveyed via ppc specific cmd as Jason suggested.

Again, a hint at what stage of the setup process are you thinking?

> > This will probably be rarer, but I wouldn't be surprised if it
> > appears on another platform. If you were designing an IOMMU ASIC
> > for use in a variety of platforms, making the base address and size
> > of the translatable range(s) configurable in registers would make
> > sense.
> >
> >
> > Now, for (3) and (4), representing lists of windows explicitly in
> > ioctl()s is likely to be pretty ugly. We might be able to avoid that,
> > for at least some of the interfaces, by using the nested IOAS stuff.
> > One way or another, though, the IOASes which are actually attached to
> > devices need to represent both windows.
> >
> > e.g.
> > Create a "top-level" IOAS <A> representing the device's view. This
> > would be either TYPE_KERNEL or maybe a special type. Into that you'd
> > make just two iomappings one for each of the translation windows,
> > pointing to IOASes <B> and <C>. IOAS <B> and <C> would have a single
> > window, and would represent the IO page tables for each of the
> > translation windows. These could be either TYPE_KERNEL or (say)
> > TYPE_POWER_TCE for a user managed table. Well.. in theory, anyway.
> > The way paravirtualization on POWER is done might mean user managed
> > tables aren't really possible for other reasons, but that's not
> > relevant here.
> >
> > The next problem here is that we don't want userspace to have to do
> > different things for POWER, at least not for the easy case of a
> > userspace driver that just wants a chunk of IOVA space and doesn't
> > really care where it is.
> >
> > In general I think the right approach to handle that is to
> > de-emphasize "info" or "query" interfaces. We'll probably still need
> > some for debugging and edge cases, but in the normal case userspace
> > should just specify what it *needs* and (ideally) no more with
> > optional hints, and the kernel will either supply that or fail.
> >
> > e.g. A simple userspace driver would simply say "I need an IOAS with
> > at least 1GiB of IOVA space" and the kernel says "Ok, you can use
> > 2^59..2^59+2GiB". qemu, emulating the POWER vIOMMU might say "I need
> > an IOAS with translatable addresses from 0..2GiB with 4kiB page size
> > and from 2^59..2^59+1TiB with 64kiB page size" and the kernel would
> > either say "ok", or "I can't do that".
> >
>
> This doesn't work for other platforms, which don't have vIOMMU
> mandatory as on ppc. For those platforms, the initial address space
> is GPA (for vm case) and Qemu needs to mark those GPA holes as
> reserved in firmware structure. I don't think anyone wants a tedious
> try-and-fail process to figure out how many holes exists in a 64bit
> address space...

Ok, I'm not quite sure how this works. The holes are guest visible,
which generally means they have to be fixed by the guest *platform*
and can't depend on host information. Otherwise, migration is totally
broken. I'm wondering if this only works by accident now, because the
holes are usually in the same place on all x86 machines.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (9.15 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-14 06:23:53

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
> > be determined) that it needs an IOAS where things can be mapped in the
> > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
> > 64-bit window).
> >
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole. When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
>
> Seems like we don't need the negotiation part? The host kernel
> communicates available IOVA ranges to userspace including holes (patch
> 17), and userspace can check that the ranges it needs are within the IOVA
> space boundaries. That part is necessary for DPDK as well since it needs
> to know about holes in the IOVA space where DMA wouldn't work as expected
> (MSI doorbells for example). And there already is a negotiation happening,
> when the host kernel rejects MAP ioctl outside the advertised area.

The problem with the approach where the kernel advertises and
userspace selects based on that, is that it locks us into a specific
representation of what's possible. If we get new hardware with new
weird constraints that can't be expressed with the representation we
chose, we're kind of out of stuffed. Userspace will have to change to
accomodate the new extension and have any chance of working on the new
hardware.

With the model where userspace requests, and the kernel acks or nacks,
we can still support existing userspace if the only things it requests
can still be accomodated in the new constraints. That's pretty likely
if the majority of userspaces request very simple things (say a single
IOVA block where it doesn't care about the base address).

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.20 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-14 06:24:02

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
>
> > > This means we cannot define an input that has a magic HW specific
> > > value.
> >
> > I'm not entirely sure what you mean by that.
>
> I mean if you make a general property 'foo' that userspace must
> specify correctly then your API isn't general anymore. Userspace must
> know if it is A or B HW to set foo=A or foo=B.

I absolutely agree. Which is exactly why I'm advocating that
userspace should request from the kernel what it needs (providing a
*minimum* of information) and the kernel satisfies that (filling in
the missing information as suitable for the platform) or outright
fails.

I think that is more robust across multiple platforms and usecases
than advertising a bunch of capabilities and forcing userspace to
interpret those to work out what it can do.

> Supported IOVA ranges are easially like that as every IOMMU is
> different. So DPDK shouldn't provide such specific or binding
> information.

Absolutely, DPDK should not provide that. qemu *should* provide that,
because the specific IOVAs matter to the guest. That will inevitably
mean that the request is more likely to fail, but that's a fundamental
tradeoff.

> > No, I don't think that needs to be a condition. I think it's
> > perfectly reasonable for a constraint to be given, and for the host
> > IOMMU to just say "no, I can't do that". But that does mean that each
> > of these values has to have an explicit way of userspace specifying "I
> > don't care", so that the kernel will select a suitable value for those
> > instead - that's what DPDK or other userspace would use nearly all the
> > time.
>
> My feeling is that qemu should be dealing with the host != target
> case, not the kernel.
>
> The kernel's job should be to expose the IOMMU HW it has, with all
> features accessible, to userspace.

See... to me this is contrary to the point we agreed on above.

> Qemu's job should be to have a userspace driver for each kernel IOMMU
> and the internal infrastructure to make accelerated emulations for all
> supported target IOMMUs.

This seems the wrong way around to me. I see qemu as providing logic
to emulate each target IOMMU. Where that matches the host, there's
the potential for an accelerated implementation, but it makes life a
lot easier if we can at least have a fallback that will work on any
sufficiently capable host IOMMU.

> In other words, it is not the kernel's job to provide target IOMMU
> emulation.

Absolutely not. But it *is* the kernel's job to let qemu do as mach
as it can with the *host* IOMMU.

> The kernel should provide truely generic "works everywhere" interface
> that qemu/etc can rely on to implement the least accelerated emulation
> path.

Right... seems like we're agreeing again.

> So when I see proposals to have "generic" interfaces that actually
> require very HW specific setup, and cannot be used by a generic qemu
> userpace driver, I think it breaks this model. If qemu needs to know
> it is on PPC (as it does today with VFIO's PPC specific API) then it
> may as well speak PPC specific language and forget about pretending to
> be generic.

Absolutely, the current situation is a mess.

> This approach is grounded in 15 years of trying to build these
> user/kernel split HW subsystems (particularly RDMA) where it has
> become painfully obvious that the kernel is the worst place to try and
> wrangle really divergent HW into a "common" uAPI.
>
> This is because the kernel/user boundary is fixed. Introducing
> anything generic here requires a lot of time, thought, arguing and
> risk. Usually it ends up being done wrong (like the PPC specific
> ioctls, for instance)

Those are certainly wrong, but they came about explicitly by *not*
being generic rather than by being too generic. So I'm really
confused aso to what you're arguing for / against.

> and when this happens we can't learn and adapt,
> we are stuck with stable uABI forever.
>
> Exposing a device's native programming interface is much simpler. Each
> device is fixed, defined and someone can sit down and figure out how
> to expose it. Then that is it, it doesn't need revisiting, it doesn't
> need harmonizing with a future slightly different device, it just
> stays as is.

I can certainly see the case for that approach. That seems utterly at
odds with what /dev/iommu is trying to do, though.

> The cost, is that there must be a userspace driver component for each
> HW piece - which we are already paying here!
>
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole. When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
>
> For instance, we are going to see HW with nested page tables, user
> space owned page tables and even kernel-bypass fast IOTLB
> invalidation.

> In that world does it even make sense for qmeu to use slow DMA_MAP
> ioctls for emulation?

Probably not what you want ideally, but it's a really useful fallback
case to have.

> A userspace framework in qemu can make these optimizations and is
> also necessarily HW specific as the host page table is HW specific..
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.61 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-14 06:24:26

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 11, 2021 at 02:17:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 04:37:38PM +1100, [email protected] wrote:
> > > PASID support will already require that a device can be multi-bound to
> > > many IOAS's, couldn't PPC do the same with the windows?
> >
> > I don't see how that would make sense. The device has no awareness of
> > multiple windows the way it does of PASIDs. It just sends
> > transactions over the bus with the IOVAs it's told. If those IOVAs
> > lie within one of the windows, the IOMMU picks them up and translates
> > them. If they don't, it doesn't.
>
> To my mind that address centric routing is awareness.

I don't really understand that position. A PASID capable device has
to be built to be PASID capable, and will generally have registers
into which you store PASIDs to use.

Any 64-bit DMA capable device can use the POWER IOMMU just fine - it's
up to the driver to program it with addresses that will be translated
(and in Linux the driver will get those from the DMA subsystem).

> If the HW can attach multiple non-overlapping IOAS's to the same
> device then the HW is routing to the correct IOAS by using the address
> bits. This is not much different from the prior discussion we had
> where we were thinking of the PASID as an 80 bit address

Ah... that might be a workable approach. And it even helps me get my
head around multiple attachment which I was struggling with before.

So, the rule would be that you can attach multiple IOASes to a device,
as long as none of them overlap. The non-overlapping could be because
each IOAS covers a disjoint address range, or it could be because
there's some attached information - such as a PASID - to disambiguate.

What remains a question is where the disambiguating information comes
from in each case: does it come from properties of the IOAS,
propertues of the device, or from extra parameters supplied at attach
time. IIUC, the current draft suggests it always comes at attach time
for the PASID information. Obviously the more consistency we can have
here the better.


I can also see an additional problem in implementation, once we start
looking at hot-adding devices to existing address spaces. Suppose our
software (maybe qemu) wants to set up a single DMA view for a bunch of
devices, that has such a split window. It can set up IOASes easily
enough for the two windows, then it needs to attach them. Presumbly,
it attaches them one at a time, which means that each device (or
group) goes through an interim state where it's attached to one, but
not the other. That can probably be achieved by using an extra IOMMU
domain (or the local equivalent) in the hardware for that interim
state. However it means we have to repeatedly create and destroy that
extra domain for each device after the first we add, rather than
simply adding each device to the domain which has both windows.

[I think this doesn't arise on POWER when running under PowerVM. That
has no concept like IOMMU domains, and instead the mapping is always
done per "partitionable endpoint" (PE), essentially a group. That
means it's just a question of whether we mirror mappings on both
windows into a given PE or just those from one IOAS. It's not an
unreasonable extension/combination of existing hardware quirks to
consider, though]

> The fact the PPC HW actually has multiple page table roots and those
> roots even have different page tables layouts while still connected to
> the same device suggests this is not even an unnatural modelling
> approach...
>
> Jason
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.77 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-14 06:56:27

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: David Gibson <[email protected]>
> Sent: Thursday, October 14, 2021 1:00 PM
>
> On Wed, Oct 13, 2021 at 07:00:58AM +0000, Tian, Kevin wrote:
> > > From: David Gibson
> > > Sent: Friday, October 1, 2021 2:11 PM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > is a way to claim a specific iova range from a system-wide address
> space.
> > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > > can be also represented by a range [0, 2^addr_width-1]. This RFC
> hasn't
> > > > adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > > and then decide how to incorporate it in v2.
> > >
> > > Ok, there are several things we need for ppc. None of which are
> > > inherently ppc specific and some of which will I think be useful for
> > > most platforms. So, starting from most general to most specific
> > > here's basically what's needed:
> > >
> > > 1. We need to represent the fact that the IOMMU can only translate
> > > *some* IOVAs, not a full 64-bit range. You have the addr_width
> > > already, but I'm entirely sure if the translatable range on ppc
> > > (or other platforms) is always a power-of-2 size. It usually will
> > > be, of course, but I'm not sure that's a hard requirement. So
> > > using a size/max rather than just a number of bits might be safer.
> > >
> > > I think basically every platform will need this. Most platforms
> > > don't actually implement full 64-bit translation in any case, but
> > > rather some smaller number of bits that fits their page table
> > > format.
> > >
> > > 2. The translatable range of IOVAs may not begin at 0. So we need to
> > > advertise to userspace what the base address is, as well as the
> > > size. POWER's main IOVA range begins at 2^59 (at least on the
> > > models I know about).
> > >
> > > I think a number of platforms are likely to want this, though I
> > > couldn't name them apart from POWER. Putting the translated IOVA
> > > window at some huge address is a pretty obvious approach to making
> > > an IOMMU which can translate a wide address range without colliding
> > > with any legacy PCI addresses down low (the IOMMU can check if this
> > > transaction is for it by just looking at some high bits in the
> > > address).
> > >
> > > 3. There might be multiple translatable ranges. So, on POWER the
> > > IOMMU can typically translate IOVAs from 0..2GiB, and also from
> > > 2^59..2^59+<RAM size>. The two ranges have completely separate IO
> > > page tables, with (usually) different layouts. (The low range will
> > > nearly always be a single-level page table with 4kiB or 64kiB
> > > entries, the high one will be multiple levels depending on the size
> > > of the range and pagesize).
> > >
> > > This may be less common, but I suspect POWER won't be the only
> > > platform to do something like this. As above, using a high range
> > > is a pretty obvious approach, but clearly won't handle older
> > > devices which can't do 64-bit DMA. So adding a smaller range for
> > > those devices is again a pretty obvious solution. Any platform
> > > with an "IO hole" can be treated as having two ranges, one below
> > > the hole and one above it (although in that case they may well not
> > > have separate page tables
> >
> > 1-3 are common on all platforms with fixed reserved ranges. Current
> > vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> > maps only in those ranges. iommufd can follow the same logic for the
> > baseline uAPI.
> >
> > For above cases a [base, max] hint can be provided by the user per
> > Jason's recommendation.
>
> Provided at which stage?

IOMMU_IOASID_ALLOC

>
> > It is a hint as no additional restriction is
> > imposed,
>
> For the qemu type use case, that's not true. In that case we
> *require* the available mapping ranges to match what the guest
> platform expects.

I didn't get the 'match' part. Here we are talking about your case 3
where the available ranges are fixed. There is nothing that the
guest can change in this case, as long as it allocates iova always in
the reported ranges.

>
> > since the kernel only cares about no violation on permitted
> > ranges that it reports to the user. Underlying iommu driver may use
> > this hint to optimize e.g. deciding how many levels are used for
> > the kernel-managed page table according to max addr.
> >
> > >
> > > 4. The translatable ranges might not be fixed. On ppc that 0..2GiB
> > > and 2^59..whatever ranges are kernel conventions, not specified by
> > > the hardware or firmware. When running as a guest (which is the
> > > normal case on POWER), there are explicit hypercalls for
> > > configuring the allowed IOVA windows (along with pagesize, number
> > > of levels etc.). At the moment it is fixed in hardware that there
> > > are only 2 windows, one starting at 0 and one at 2^59 but there's
> > > no inherent reason those couldn't also be configurable.
> >
> > If ppc iommu driver needs to configure hardware according to the
> > specified ranges, then it requires more than a hint thus better be
> > conveyed via ppc specific cmd as Jason suggested.
>
> Again, a hint at what stage of the setup process are you thinking?
>
> > > This will probably be rarer, but I wouldn't be surprised if it
> > > appears on another platform. If you were designing an IOMMU ASIC
> > > for use in a variety of platforms, making the base address and size
> > > of the translatable range(s) configurable in registers would make
> > > sense.
> > >
> > >
> > > Now, for (3) and (4), representing lists of windows explicitly in
> > > ioctl()s is likely to be pretty ugly. We might be able to avoid that,
> > > for at least some of the interfaces, by using the nested IOAS stuff.
> > > One way or another, though, the IOASes which are actually attached to
> > > devices need to represent both windows.
> > >
> > > e.g.
> > > Create a "top-level" IOAS <A> representing the device's view. This
> > > would be either TYPE_KERNEL or maybe a special type. Into that you'd
> > > make just two iomappings one for each of the translation windows,
> > > pointing to IOASes <B> and <C>. IOAS <B> and <C> would have a single
> > > window, and would represent the IO page tables for each of the
> > > translation windows. These could be either TYPE_KERNEL or (say)
> > > TYPE_POWER_TCE for a user managed table. Well.. in theory, anyway.
> > > The way paravirtualization on POWER is done might mean user managed
> > > tables aren't really possible for other reasons, but that's not
> > > relevant here.
> > >
> > > The next problem here is that we don't want userspace to have to do
> > > different things for POWER, at least not for the easy case of a
> > > userspace driver that just wants a chunk of IOVA space and doesn't
> > > really care where it is.
> > >
> > > In general I think the right approach to handle that is to
> > > de-emphasize "info" or "query" interfaces. We'll probably still need
> > > some for debugging and edge cases, but in the normal case userspace
> > > should just specify what it *needs* and (ideally) no more with
> > > optional hints, and the kernel will either supply that or fail.
> > >
> > > e.g. A simple userspace driver would simply say "I need an IOAS with
> > > at least 1GiB of IOVA space" and the kernel says "Ok, you can use
> > > 2^59..2^59+2GiB". qemu, emulating the POWER vIOMMU might say "I
> need
> > > an IOAS with translatable addresses from 0..2GiB with 4kiB page size
> > > and from 2^59..2^59+1TiB with 64kiB page size" and the kernel would
> > > either say "ok", or "I can't do that".
> > >
> >
> > This doesn't work for other platforms, which don't have vIOMMU
> > mandatory as on ppc. For those platforms, the initial address space
> > is GPA (for vm case) and Qemu needs to mark those GPA holes as
> > reserved in firmware structure. I don't think anyone wants a tedious
> > try-and-fail process to figure out how many holes exists in a 64bit
> > address space...
>
> Ok, I'm not quite sure how this works. The holes are guest visible,
> which generally means they have to be fixed by the guest *platform*
> and can't depend on host information. Otherwise, migration is totally
> broken. I'm wondering if this only works by accident now, because the
> holes are usually in the same place on all x86 machines.
>

I haven't checked how qemu handle it today after vfio introduces the
capability of reporting valid iova ranges (Alex, can you help confirm?).
But there is no elegant answer. if qemu doesn't put the holes in
GPA space it means guest driver might be broken if dma buffer happens
to sit in the hole. this is even more severe than missing live migration.
for x86 the situation is simpler as the only hole is 0xfeexxxxx on all
platforms (with gpu as an exception). other arch may have more holes
though.

regarding live migration with vfio devices, it's still in early stage. there
are tons of compatibility check opens to be addressed before it can
be widely deployed. this might just add another annoying open to that
long list...

Thanks
Kevin

2021-10-14 15:14:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Oct 14, 2021 at 03:33:21PM +1100, [email protected] wrote:

> > If the HW can attach multiple non-overlapping IOAS's to the same
> > device then the HW is routing to the correct IOAS by using the address
> > bits. This is not much different from the prior discussion we had
> > where we were thinking of the PASID as an 80 bit address
>
> Ah... that might be a workable approach. And it even helps me get my
> head around multiple attachment which I was struggling with before.
>
> So, the rule would be that you can attach multiple IOASes to a device,
> as long as none of them overlap. The non-overlapping could be because
> each IOAS covers a disjoint address range, or it could be because
> there's some attached information - such as a PASID - to disambiguate.

Right exactly - it is very parallel to PASID

And obviously HW support is required to have multiple page table
pointers per RID - which sounds like PPC does (high/low pointer?)

> What remains a question is where the disambiguating information comes
> from in each case: does it come from properties of the IOAS,
> propertues of the device, or from extra parameters supplied at attach
> time. IIUC, the current draft suggests it always comes at attach time
> for the PASID information. Obviously the more consistency we can have
> here the better.

From a generic view point I'd say all are fair game. It is up to the
IOMMU driver to take the requested set of IOAS's, the "at attachment"
information (like PASID) and decide what to do, or fail.

> I can also see an additional problem in implementation, once we start
> looking at hot-adding devices to existing address spaces.

I won't pretend to guess how to implement this :) Just from a modeling
perspective is something that works logically. If the kernel
implementation is too hard then PPC should do one of the other ideas.

Personally I'd probably try for a nice multi-domain attachment model
like PASID and not try to create/destroy domains.

As I said in my last email I think it is up to each IOMMU HW driver to
make these decisions, the iommufd framework just provides a
standardized API toward the attaching driver that the IOMMU HW must
fit into.

Jason

2021-10-14 16:17:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote:

> > My feeling is that qemu should be dealing with the host != target
> > case, not the kernel.
> >
> > The kernel's job should be to expose the IOMMU HW it has, with all
> > features accessible, to userspace.
>
> See... to me this is contrary to the point we agreed on above.

I'm not thinking of these as exclusive ideas.

The IOCTL interface in iommu can quite happily expose:
Create IOAS generically
Manipulate IOAS generically
Create IOAS with IOMMU driver specific attributes
HW specific Manipulate IOAS

IOCTL commands all together.

So long as everything is focused on a generic in-kernel IOAS object it
is fine to have multiple ways in the uAPI to create and manipulate the
objects.

When I speak about a generic interface I mean "Create IOAS
generically" - ie a set of IOCTLs that work on most IOMMU HW and can
be relied upon by things like DPDK/etc to always work and be portable.
This is why I like "hints" to provide some limited widely applicable
micro-optimization.

When I said "expose the IOMMU HW it has with all features accessible"
I mean also providing "Create IOAS with IOMMU driver specific
attributes".

These other IOCTLs would allow the IOMMU driver to expose every
configuration knob its HW has, in a natural HW centric language.
There is no pretense of genericness here, no crazy foo=A, foo=B hidden
device specific interface.

Think of it as a high level/low level interface to the same thing.

> Those are certainly wrong, but they came about explicitly by *not*
> being generic rather than by being too generic. So I'm really
> confused aso to what you're arguing for / against.

IMHO it is not having a PPC specific interface that was the problem,
it was making the PPC specific interface exclusive to the type 1
interface. If type 1 continued to work on PPC then DPDK/etc would
never learned PPC specific code.

For iommufd with the high/low interface each IOMMU HW should ask basic
questions:

- What should the generic high level interface do on this HW?
For instance what should 'Create IOAS generically' do for PPC?
It should not fail, it should create *something*
What is the best thing for DPDK?
I guess the 64 bit window is most broadly useful.

- How to accurately describe the HW in terms of standard IOAS objects
and where to put HW specific structs to support this.

This is where PPC would decide how best to expose a control over
its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver
wants, so long as it fits into the kernel IOAS model facing the
connected device driver.

QEMU would have IOMMU userspace drivers. One would be the "generic
driver" using only the high level generic interface. It should work as
best it can on all HW devices. This is the fallback path you talked
of.

QEMU would also have HW specific IOMMU userspace drivers that know how
to operate the exact HW. eg these drivers would know how to use
userspace page tables, how to form IOPTEs and how to access the
special features.

This is how QEMU could use an optimzed path with nested page tables,
for instance.

Jason

2021-10-18 04:34:11

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Oct 14, 2021 at 12:06:10PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 03:33:21PM +1100, [email protected] wrote:
>
> > > If the HW can attach multiple non-overlapping IOAS's to the same
> > > device then the HW is routing to the correct IOAS by using the address
> > > bits. This is not much different from the prior discussion we had
> > > where we were thinking of the PASID as an 80 bit address
> >
> > Ah... that might be a workable approach. And it even helps me get my
> > head around multiple attachment which I was struggling with before.
> >
> > So, the rule would be that you can attach multiple IOASes to a device,
> > as long as none of them overlap. The non-overlapping could be because
> > each IOAS covers a disjoint address range, or it could be because
> > there's some attached information - such as a PASID - to disambiguate.
>
> Right exactly - it is very parallel to PASID
>
> And obviously HW support is required to have multiple page table
> pointers per RID - which sounds like PPC does (high/low pointer?)

Hardware support is require *in the IOMMU*. Nothing (beyond regular
64-bit DMA support) is required in the endpoint devices. That's not
true of PASID.

> > What remains a question is where the disambiguating information comes
> > from in each case: does it come from properties of the IOAS,
> > propertues of the device, or from extra parameters supplied at attach
> > time. IIUC, the current draft suggests it always comes at attach time
> > for the PASID information. Obviously the more consistency we can have
> > here the better.
>
> From a generic view point I'd say all are fair game. It is up to the
> IOMMU driver to take the requested set of IOAS's, the "at attachment"
> information (like PASID) and decide what to do, or fail.

Ok, that's a model that makes sense to me.

> > I can also see an additional problem in implementation, once we start
> > looking at hot-adding devices to existing address spaces.
>
> I won't pretend to guess how to implement this :) Just from a modeling
> perspective is something that works logically. If the kernel
> implementation is too hard then PPC should do one of the other ideas.
>
> Personally I'd probably try for a nice multi-domain attachment model
> like PASID and not try to create/destroy domains.

I don't really follow what you mean by that.

> As I said in my last email I think it is up to each IOMMU HW driver to
> make these decisions, the iommufd framework just provides a
> standardized API toward the attaching driver that the IOMMU HW must
> fit into.
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.80 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-18 04:35:14

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Oct 14, 2021 at 11:52:08AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote:
>
> > > My feeling is that qemu should be dealing with the host != target
> > > case, not the kernel.
> > >
> > > The kernel's job should be to expose the IOMMU HW it has, with all
> > > features accessible, to userspace.
> >
> > See... to me this is contrary to the point we agreed on above.
>
> I'm not thinking of these as exclusive ideas.
>
> The IOCTL interface in iommu can quite happily expose:
> Create IOAS generically
> Manipulate IOAS generically
> Create IOAS with IOMMU driver specific attributes
> HW specific Manipulate IOAS
>
> IOCTL commands all together.
>
> So long as everything is focused on a generic in-kernel IOAS object it
> is fine to have multiple ways in the uAPI to create and manipulate the
> objects.
>
> When I speak about a generic interface I mean "Create IOAS
> generically" - ie a set of IOCTLs that work on most IOMMU HW and can
> be relied upon by things like DPDK/etc to always work and be portable.
> This is why I like "hints" to provide some limited widely applicable
> micro-optimization.
>
> When I said "expose the IOMMU HW it has with all features accessible"
> I mean also providing "Create IOAS with IOMMU driver specific
> attributes".
>
> These other IOCTLs would allow the IOMMU driver to expose every
> configuration knob its HW has, in a natural HW centric language.
> There is no pretense of genericness here, no crazy foo=A, foo=B hidden
> device specific interface.
>
> Think of it as a high level/low level interface to the same thing.

Ok, I see what you mean.

> > Those are certainly wrong, but they came about explicitly by *not*
> > being generic rather than by being too generic. So I'm really
> > confused aso to what you're arguing for / against.
>
> IMHO it is not having a PPC specific interface that was the problem,
> it was making the PPC specific interface exclusive to the type 1
> interface. If type 1 continued to work on PPC then DPDK/etc would
> never learned PPC specific code.

Ok, but the reason this happened is that the initial version of type 1
*could not* be used on PPC. The original Type 1 implicitly promised a
"large" IOVA range beginning at IOVA 0 without any real way of
specifying or discovering how large that range was. Since ppc could
typically only give a 2GiB range at IOVA 0, that wasn't usable.

That's why I say the problem was not making type1 generic enough. I
believe the current version of Type1 has addressed this - at least
enough to be usable in common cases. But by this time the ppc backend
is already out there, so no-one's had the capacity to go back and make
ppc work with Type1.

> For iommufd with the high/low interface each IOMMU HW should ask basic
> questions:
>
> - What should the generic high level interface do on this HW?
> For instance what should 'Create IOAS generically' do for PPC?
> It should not fail, it should create *something*
> What is the best thing for DPDK?
> I guess the 64 bit window is most broadly useful.

Right, which means the kernel must (at least in the common case) have
the capcity to choose and report a non-zero base-IOVA.

Hrm... which makes me think... if we allow this for the common
kernel-managed case, do we even need to have capcity in the high-level
interface for reporting IO holes? If the kernel can choose a non-zero
base, it could just choose on x86 to place it's advertised window
above the IO hole.

> - How to accurately describe the HW in terms of standard IOAS objects
> and where to put HW specific structs to support this.
>
> This is where PPC would decide how best to expose a control over
> its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver
> wants, so long as it fits into the kernel IOAS model facing the
> connected device driver.
>
> QEMU would have IOMMU userspace drivers. One would be the "generic
> driver" using only the high level generic interface. It should work as
> best it can on all HW devices. This is the fallback path you talked
> of.
>
> QEMU would also have HW specific IOMMU userspace drivers that know how
> to operate the exact HW. eg these drivers would know how to use
> userspace page tables, how to form IOPTEs and how to access the
> special features.
>
> This is how QEMU could use an optimzed path with nested page tables,
> for instance.

The concept makes sense in general. The devil's in the details, as usual.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.70 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-18 17:44:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Mon, Oct 18, 2021 at 02:50:54PM +1100, David Gibson wrote:

> Hrm... which makes me think... if we allow this for the common
> kernel-managed case, do we even need to have capcity in the high-level
> interface for reporting IO holes? If the kernel can choose a non-zero
> base, it could just choose on x86 to place it's advertised window
> above the IO hole.

If the high level interface is like dma_map() then, no it doesn't need
the ability to report holes. Kernel would find and return the IOVA
from dma_map not accept it in.

Since dma_map is a well proven model I'm inclined to model the
simplied interface after it..

That said, if we have some ioctl 'query iova ranges' I would expect it
to work on an IOAS created by the simplified interface too.

Jason

2021-10-25 05:31:30

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

On Thu, Oct 14, 2021 at 06:53:01AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Thursday, October 14, 2021 1:00 PM
> >
> > On Wed, Oct 13, 2021 at 07:00:58AM +0000, Tian, Kevin wrote:
> > > > From: David Gibson
> > > > Sent: Friday, October 1, 2021 2:11 PM
> > > >
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type
> > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> > patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > > is a way to claim a specific iova range from a system-wide address
> > space.
> > > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC
> > hasn't
> > > > > adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > > and then decide how to incorporate it in v2.
> > > >
> > > > Ok, there are several things we need for ppc. None of which are
> > > > inherently ppc specific and some of which will I think be useful for
> > > > most platforms. So, starting from most general to most specific
> > > > here's basically what's needed:
> > > >
> > > > 1. We need to represent the fact that the IOMMU can only translate
> > > > *some* IOVAs, not a full 64-bit range. You have the addr_width
> > > > already, but I'm entirely sure if the translatable range on ppc
> > > > (or other platforms) is always a power-of-2 size. It usually will
> > > > be, of course, but I'm not sure that's a hard requirement. So
> > > > using a size/max rather than just a number of bits might be safer.
> > > >
> > > > I think basically every platform will need this. Most platforms
> > > > don't actually implement full 64-bit translation in any case, but
> > > > rather some smaller number of bits that fits their page table
> > > > format.
> > > >
> > > > 2. The translatable range of IOVAs may not begin at 0. So we need to
> > > > advertise to userspace what the base address is, as well as the
> > > > size. POWER's main IOVA range begins at 2^59 (at least on the
> > > > models I know about).
> > > >
> > > > I think a number of platforms are likely to want this, though I
> > > > couldn't name them apart from POWER. Putting the translated IOVA
> > > > window at some huge address is a pretty obvious approach to making
> > > > an IOMMU which can translate a wide address range without colliding
> > > > with any legacy PCI addresses down low (the IOMMU can check if this
> > > > transaction is for it by just looking at some high bits in the
> > > > address).
> > > >
> > > > 3. There might be multiple translatable ranges. So, on POWER the
> > > > IOMMU can typically translate IOVAs from 0..2GiB, and also from
> > > > 2^59..2^59+<RAM size>. The two ranges have completely separate IO
> > > > page tables, with (usually) different layouts. (The low range will
> > > > nearly always be a single-level page table with 4kiB or 64kiB
> > > > entries, the high one will be multiple levels depending on the size
> > > > of the range and pagesize).
> > > >
> > > > This may be less common, but I suspect POWER won't be the only
> > > > platform to do something like this. As above, using a high range
> > > > is a pretty obvious approach, but clearly won't handle older
> > > > devices which can't do 64-bit DMA. So adding a smaller range for
> > > > those devices is again a pretty obvious solution. Any platform
> > > > with an "IO hole" can be treated as having two ranges, one below
> > > > the hole and one above it (although in that case they may well not
> > > > have separate page tables
> > >
> > > 1-3 are common on all platforms with fixed reserved ranges. Current
> > > vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> > > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> > > maps only in those ranges. iommufd can follow the same logic for the
> > > baseline uAPI.
> > >
> > > For above cases a [base, max] hint can be provided by the user per
> > > Jason's recommendation.
> >
> > Provided at which stage?
>
> IOMMU_IOASID_ALLOC

Ok. I have mixed thoughts on this. Doing this at ALLOC time was my
first instict as well. However with Jason's suggestion that any of a
number of things could disambiguate multiple IOAS attached to a
device, I wonder if it makes more sense for consistency to put base
address at attach time, as with PASID.

I do think putting the size of the IOVA range makes sense to add at
IOASID_ALLOC time - for basically every type of window. They'll
nearly always have some limit, which is relevant pretty early.

> > > It is a hint as no additional restriction is
> > > imposed,
> >
> > For the qemu type use case, that's not true. In that case we
> > *require* the available mapping ranges to match what the guest
> > platform expects.
>
> I didn't get the 'match' part. Here we are talking about your case 3
> where the available ranges are fixed. There is nothing that the
> guest can change in this case, as long as it allocates iova always in
> the reported ranges.

Sorry, I don't understand the question.

> > > since the kernel only cares about no violation on permitted
> > > ranges that it reports to the user. Underlying iommu driver may use
> > > this hint to optimize e.g. deciding how many levels are used for
> > > the kernel-managed page table according to max addr.
> > >
> > > >
> > > > 4. The translatable ranges might not be fixed. On ppc that 0..2GiB
> > > > and 2^59..whatever ranges are kernel conventions, not specified by
> > > > the hardware or firmware. When running as a guest (which is the
> > > > normal case on POWER), there are explicit hypercalls for
> > > > configuring the allowed IOVA windows (along with pagesize, number
> > > > of levels etc.). At the moment it is fixed in hardware that there
> > > > are only 2 windows, one starting at 0 and one at 2^59 but there's
> > > > no inherent reason those couldn't also be configurable.
> > >
> > > If ppc iommu driver needs to configure hardware according to the
> > > specified ranges, then it requires more than a hint thus better be
> > > conveyed via ppc specific cmd as Jason suggested.
> >
> > Again, a hint at what stage of the setup process are you thinking?
> >
> > > > This will probably be rarer, but I wouldn't be surprised if it
> > > > appears on another platform. If you were designing an IOMMU ASIC
> > > > for use in a variety of platforms, making the base address and size
> > > > of the translatable range(s) configurable in registers would make
> > > > sense.
> > > >
> > > >
> > > > Now, for (3) and (4), representing lists of windows explicitly in
> > > > ioctl()s is likely to be pretty ugly. We might be able to avoid that,
> > > > for at least some of the interfaces, by using the nested IOAS stuff.
> > > > One way or another, though, the IOASes which are actually attached to
> > > > devices need to represent both windows.
> > > >
> > > > e.g.
> > > > Create a "top-level" IOAS <A> representing the device's view. This
> > > > would be either TYPE_KERNEL or maybe a special type. Into that you'd
> > > > make just two iomappings one for each of the translation windows,
> > > > pointing to IOASes <B> and <C>. IOAS <B> and <C> would have a single
> > > > window, and would represent the IO page tables for each of the
> > > > translation windows. These could be either TYPE_KERNEL or (say)
> > > > TYPE_POWER_TCE for a user managed table. Well.. in theory, anyway.
> > > > The way paravirtualization on POWER is done might mean user managed
> > > > tables aren't really possible for other reasons, but that's not
> > > > relevant here.
> > > >
> > > > The next problem here is that we don't want userspace to have to do
> > > > different things for POWER, at least not for the easy case of a
> > > > userspace driver that just wants a chunk of IOVA space and doesn't
> > > > really care where it is.
> > > >
> > > > In general I think the right approach to handle that is to
> > > > de-emphasize "info" or "query" interfaces. We'll probably still need
> > > > some for debugging and edge cases, but in the normal case userspace
> > > > should just specify what it *needs* and (ideally) no more with
> > > > optional hints, and the kernel will either supply that or fail.
> > > >
> > > > e.g. A simple userspace driver would simply say "I need an IOAS with
> > > > at least 1GiB of IOVA space" and the kernel says "Ok, you can use
> > > > 2^59..2^59+2GiB". qemu, emulating the POWER vIOMMU might say "I
> > need
> > > > an IOAS with translatable addresses from 0..2GiB with 4kiB page size
> > > > and from 2^59..2^59+1TiB with 64kiB page size" and the kernel would
> > > > either say "ok", or "I can't do that".
> > > >
> > >
> > > This doesn't work for other platforms, which don't have vIOMMU
> > > mandatory as on ppc. For those platforms, the initial address space
> > > is GPA (for vm case) and Qemu needs to mark those GPA holes as
> > > reserved in firmware structure. I don't think anyone wants a tedious
> > > try-and-fail process to figure out how many holes exists in a 64bit
> > > address space...
> >
> > Ok, I'm not quite sure how this works. The holes are guest visible,
> > which generally means they have to be fixed by the guest *platform*
> > and can't depend on host information. Otherwise, migration is totally
> > broken. I'm wondering if this only works by accident now, because the
> > holes are usually in the same place on all x86 machines.
>
> I haven't checked how qemu handle it today after vfio introduces the
> capability of reporting valid iova ranges (Alex, can you help confirm?).
> But there is no elegant answer. if qemu doesn't put the holes in
> GPA space it means guest driver might be broken if dma buffer happens
> to sit in the hole. this is even more severe than missing live migration.
> for x86 the situation is simpler as the only hole is 0xfeexxxxx on all
> platforms (with gpu as an exception).

Right.. I suspect this is the only reason it's working now on x86.

> other arch may have more holes
> though.
>
> regarding live migration with vfio devices, it's still in early stage. there
> are tons of compatibility check opens to be addressed before it can
> be widely deployed. this might just add another annoying open to that
> long list...

So, yes, live migration with VFIO is limited, unfortunately this
still affects us even if we don't (currently) have VFIO devices. The
problem arises from the combination of two limitations:

1) Live migration means that we can't dynamically select guest visible
IOVA parameters at qemu start up time. We need to get consistent
guest visible behaviour for a given set of qemu options, so that we
can migrate between them.

2) Device hotplug means that we don't know if a PCI domain will have
VFIO devices on it when we start qemu. So, we don't know if host
limitations on IOVA ranges will affect the guest or not.

Together these mean that the best we can do is to define a *fixed*
(per machine type) configuration based on qemu options only. That is,
defined by the guest platform we're trying to present, only, never
host capabilities. We can then see if that configuration is possible
on the host and pass or fail. It's never safe to go the other
direction and take host capabilities and present those to the guest.

Obviously, we then try to define the default platform configuration in
qemu to be usable on the greatest number of hosts we can.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (12.78 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-27 15:47:40

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

> From: David Gibson <[email protected]>
> Sent: Monday, October 25, 2021 1:05 PM
>
> > > > For above cases a [base, max] hint can be provided by the user per
> > > > Jason's recommendation.
> > >
> > > Provided at which stage?
> >
> > IOMMU_IOASID_ALLOC
>
> Ok. I have mixed thoughts on this. Doing this at ALLOC time was my
> first instict as well. However with Jason's suggestion that any of a
> number of things could disambiguate multiple IOAS attached to a
> device, I wonder if it makes more sense for consistency to put base
> address at attach time, as with PASID.

In that case the base address provided at attach time is used as an
address space ID similar to PASID, which imho is orthogonal to the
generic [base, size] info for IOAS itself. The 2nd base sort of becomes
an offset on top of the first base in ppc case.

> >
> > regarding live migration with vfio devices, it's still in early stage. there
> > are tons of compatibility check opens to be addressed before it can
> > be widely deployed. this might just add another annoying open to that
> > long list...
>
> So, yes, live migration with VFIO is limited, unfortunately this
> still affects us even if we don't (currently) have VFIO devices. The
> problem arises from the combination of two limitations:
>
> 1) Live migration means that we can't dynamically select guest visible
> IOVA parameters at qemu start up time. We need to get consistent
> guest visible behaviour for a given set of qemu options, so that we
> can migrate between them.
>
> 2) Device hotplug means that we don't know if a PCI domain will have
> VFIO devices on it when we start qemu. So, we don't know if host
> limitations on IOVA ranges will affect the guest or not.
>
> Together these mean that the best we can do is to define a *fixed*
> (per machine type) configuration based on qemu options only. That is,
> defined by the guest platform we're trying to present, only, never
> host capabilities. We can then see if that configuration is possible
> on the host and pass or fail. It's never safe to go the other
> direction and take host capabilities and present those to the guest.
>

That is just one userspace policy. We don't want to design a uAPI
just for a specific userspace implementation. In concept the
userspace could:

1) use DMA-API like map/unmap i.e. letting IOVA address space
managed by the kernel;

* suitable for simple applications e.g. dpdk.

2) manage IOVA address space with *fixed* layout:

* fail device passthrough at MAP_DMA if conflict is detected
between mapped range and device specific IOVA holes

* suitable for VM when live migration is highly concerned

* potential problem with vIOMMU since the guest is unaware
of host constraints thus undefined behavior may occur if
guest IOVA addresses happens to overlap with host IOVA holes.

* ppc is special as you need to claim guest IOVA ranges in
the host. But it's not the case for other emulated IOMMUs.

3) manage IOVA address space with host constraints:

* create IOVA layout by combining qemu options and IOVA holes
of all boot-time passthrough devices

* reject hotplugged device if it has conflicting IOVA holes with
the initial IOVA layout

* suitable for vIOMMU since host constraints can be further
reported to the guest

* suitable for VM w/o live migration requirement, e.g. in many
client virtualization scenarios

* suboptimal with VM live migration with compatibility limitation

Overall the proposed uAPI will provide:

1) a simple DMA-API-like mapping protocol for kernel managed IOVA
address space:

2) a vfio-like mapping protocol for user managed IOVA address space:

a) check IOVA conflict in MAP_DMA ioctl;
b) allows the user to query available IOVA ranges;

Then it's totally user policy on how it wants to utilize those ioctls.

Thanks
Kevin