LinuxLists.cc - [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs

2020-03-22 12:28:03

Subject: [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs

From: Liu Yi L <[email protected]>

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes. For IOMMU and QEMU changes, they are in separate series (listed
in the "Related series").

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.

.-------------. .---------------------------.
| vIOMMU | | Guest process CR3, FL only|
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush -
'-------------' |
| | V
| | CR3 in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .----------------------.
| pIOMMU | | Bind FL for GVA-GPA |
| | '----------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.------------------------------.
| | |SL for GPA-HPA, default domain|
| | '------------------------------'
'-------------'
Where:
- FL = First level/stage one page tables
- SL = Second level/stage two page tables

There are roughly four parts in this patchset which are
corresponding to the basic vSVA support for PCI device
assignment
1. vfio support for PASID allocation and free for VMs
2. vfio support for guest page table binding request from VMs
3. vfio support for IOMMU cache invalidation from VMs
4. vfio support for vSVA usage on IOMMU-backed mdevs

The complete vSVA kernel upstream patches are divided into three phases:
1. Common APIs and PCI device direct assignment
2. IOMMU-backed Mediated Device assignment
3. Page Request Services (PRS) support

This patchset is aiming for the phase 1 and phase 2, and based on Jacob's
below series.
[PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
https://lkml.org/lkml/2020/3/20/1172

Complete set for current vSVA can be found in below branch.
https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6

The corresponding QEMU patch series is as below, complete QEMU set can be
found in below branch.
[PATCH v1 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
complete QEMU set can be found in below link:
https://github.com/luxis1999/qemu.git: sva_vtd_v10_v1

Regards,
Yi Liu

Changelog:
- RFC v1 -> Patch v1:
a) Address comments to the PASID request(alloc/free) path
b) Report PASID alloc/free availabitiy to user-space
c) Add a vfio_iommu_type1 parameter to support pasid quota tuning
d) Adjusted to latest ioasid code implementation. e.g. remove the
code for tracking the allocated PASIDs as latest ioasid code
will track it, VFIO could use ioasid_free_set() to free all
PASIDs.

- RFC v2 -> v3:
a) Refine the whole patchset to fit the roughly parts in this series
b) Adds complete vfio PASID management framework. e.g. pasid alloc,
free, reclaim in VM crash/down and per-VM PASID quota to prevent
PASID abuse.
c) Adds IOMMU uAPI version check and page table format check to ensure
version compatibility and hardware compatibility.
d) Adds vSVA vfio support for IOMMU-backed mdevs.

- RFC v1 -> v2:
Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.

Liu Yi L (8):
vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
vfio/type1: Report PASID alloc/free support to userspace
vfio: Check nesting iommu uAPI version
vfio/type1: Report 1st-level/stage-1 format to userspace
vfio/type1: Bind guest page tables to host
vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
vfio/type1: Add vSVA support for IOMMU-backed mdevs

drivers/vfio/vfio.c | 136 +++++++++++++
drivers/vfio/vfio_iommu_type1.c | 419 ++++++++++++++++++++++++++++++++++++++++
include/linux/vfio.h | 21 ++
include/uapi/linux/vfio.h | 127 ++++++++++++
4 files changed, 703 insertions(+)

--
2.7.4

2020-03-22 12:28:05

by Yi Liu

[permalink] [raw]

Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

From: Liu Yi L <[email protected]>

VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by hardware
IOMMUs that have nesting DMA translation (a.k.a dual stage address
translation). For such hardware IOMMUs, there are two stages/levels of
address translation, and software may let userspace/VM to own the first-
level/stage-1 translation structures. Example of such usage is vSVA (
virtual Shared Virtual Addressing). VM owns the first-level/stage-1
translation structures and bind the structures to host, then hardware
IOMMU would utilize nesting translation when doing DMA translation fo
the devices behind such hardware IOMMU.

This patch adds vfio support for binding guest translation (a.k.a stage 1)
structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only bind
guest page table is needed, it also requires to expose interface to guest
for iommu cache invalidation when guest modified the first-level/stage-1
translation structures since hardware needs to be notified to flush stale
iotlbs. This would be introduced in next patch.

In this patch, guest page table bind and unbind are done by using flags
VFIO_IOMMU_BIND_GUEST_PGTBL and VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
struct iommu_gpasid_bind_data. Before binding guest page table to host,
VM should have got a PASID allocated by host via VFIO_IOMMU_PASID_REQUEST.

Bind guest translation structures (here is guest page table) to host
are the first step to setup vSVA (Virtual Shared Virtual Addressing).

Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 158 ++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 46 ++++++++++++
2 files changed, 204 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 82a9e0b..a877747 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -130,6 +130,33 @@ struct vfio_regions {
#define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
(!list_empty(&iommu->domain_list))

+struct domain_capsule {
+ struct iommu_domain *domain;
+ void *data;
+};
+
+/* iommu->lock must be held */
+static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
+ int (*fn)(struct device *dev, void *data),
+ void *data)
+{
+ struct domain_capsule dc = {.data = data};
+ struct vfio_domain *d;
+ struct vfio_group *g;
+ int ret = 0;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ dc.domain = d->domain;
+ list_for_each_entry(g, &d->group_list, next) {
+ ret = iommu_group_for_each_dev(g->iommu_group,
+ &dc, fn);
+ if (ret)
+ break;
+ }
+ }
+ return ret;
+}
+
static int put_pfn(unsigned long pfn, int prot);

/*
@@ -2314,6 +2341,88 @@ static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
return 0;
}

+static int vfio_bind_gpasid_fn(struct device *dev, void *data)
+{
+ struct domain_capsule *dc = (struct domain_capsule *)data;
+ struct iommu_gpasid_bind_data *gbind_data =
+ (struct iommu_gpasid_bind_data *) dc->data;
+
+ return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
+}
+
+static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
+{
+ struct domain_capsule *dc = (struct domain_capsule *)data;
+ struct iommu_gpasid_bind_data *gbind_data =
+ (struct iommu_gpasid_bind_data *) dc->data;
+
+ return iommu_sva_unbind_gpasid(dc->domain, dev,
+ gbind_data->hpasid);
+}
+
+/**
+ * Unbind specific gpasid, caller of this function requires hold
+ * vfio_iommu->lock
+ */
+static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
+ struct iommu_gpasid_bind_data *gbind_data)
+{
+ return vfio_iommu_for_each_dev(iommu,
+ vfio_unbind_gpasid_fn, gbind_data);
+}
+
+static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
+ struct iommu_gpasid_bind_data *gbind_data)
+{
+ int ret = 0;
+
+ mutex_lock(&iommu->lock);
+ if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ ret = vfio_iommu_for_each_dev(iommu,
+ vfio_bind_gpasid_fn, gbind_data);
+ /*
+ * If bind failed, it may not be a total failure. Some devices
+ * within the iommu group may have bind successfully. Although
+ * we don't enable pasid capability for non-singletion iommu
+ * groups, a unbind operation would be helpful to ensure no
+ * partial binding for an iommu group.
+ */
+ if (ret)
+ /*
+ * Undo all binds that already succeeded, no need to
+ * check the return value here since some device within
+ * the group has no successful bind when coming to this
+ * place switch.
+ */
+ vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
+
+out_unlock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
+ struct iommu_gpasid_bind_data *gbind_data)
+{
+ int ret = 0;
+
+ mutex_lock(&iommu->lock);
+ if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
+
+out_unlock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
default:
return -EINVAL;
}
+
+ } else if (cmd == VFIO_IOMMU_BIND) {
+ struct vfio_iommu_type1_bind bind;
+ u32 version;
+ int data_size;
+ void *gbind_data;
+ int ret;
+
+ minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
+
+ if (copy_from_user(&bind, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (bind.argsz < minsz)
+ return -EINVAL;
+
+ /* Get the version of struct iommu_gpasid_bind_data */
+ if (copy_from_user(&version,
+ (void __user *) (arg + minsz),
+ sizeof(version)))
+ return -EFAULT;
+
+ data_size = iommu_uapi_get_data_size(
+ IOMMU_UAPI_BIND_GPASID, version);
+ gbind_data = kzalloc(data_size, GFP_KERNEL);
+ if (!gbind_data)
+ return -ENOMEM;
+
+ if (copy_from_user(gbind_data,
+ (void __user *) (arg + minsz), data_size)) {
+ kfree(gbind_data);
+ return -EFAULT;
+ }
+
+ switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
+ case VFIO_IOMMU_BIND_GUEST_PGTBL:
+ ret = vfio_iommu_type1_bind_gpasid(iommu,
+ gbind_data);
+ break;
+ case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
+ ret = vfio_iommu_type1_unbind_gpasid(iommu,
+ gbind_data);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ kfree(gbind_data);
+ return ret;
}

return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ebeaf3e..2235bc6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -14,6 +14,7 @@

#include <linux/types.h>
#include <linux/ioctl.h>
+#include <linux/iommu.h>

#define VFIO_API_VERSION 0

@@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
*/
#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 22)

+/**
+ * Supported flags:
+ * - VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
+ * nesting type IOMMUs. In @data field It takes struct
+ * iommu_gpasid_bind_data.
+ * - VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table operation
+ * invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
+ *
+ */
+struct vfio_iommu_type1_bind {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_IOMMU_BIND_GUEST_PGTBL (1 << 0)
+#define VFIO_IOMMU_UNBIND_GUEST_PGTBL (1 << 1)
+ __u8 data[];
+};
+
+#define VFIO_IOMMU_BIND_MASK (VFIO_IOMMU_BIND_GUEST_PGTBL | \
+ VFIO_IOMMU_UNBIND_GUEST_PGTBL)
+
+/**
+ * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
+ * struct vfio_iommu_type1_bind)
+ *
+ * Manage address spaces of devices in this container. Initially a TYPE1
+ * container can only have one address space, managed with
+ * VFIO_IOMMU_MAP/UNMAP_DMA.
+ *
+ * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
+ * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
+ * tables, and BIND manages the stage-1 (guest) page tables. Other types of
+ * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
+ * the traffics only require single stage translation while BIND controls the
+ * traffics require nesting translation. But this depends on the underlying
+ * IOMMU architecture and isn't guaranteed. Example of this is the guest SVA
+ * traffics, such traffics need nesting translation to gain gVA->gPA and then
+ * gPA->hPA translation.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_BIND _IO(VFIO_TYPE, VFIO_BASE + 23)
+
/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */

/*
--
2.7.4

2020-03-22 12:28:56

by Yi Liu

[permalink] [raw]

Subject: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace

From: Liu Yi L <[email protected]>

This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
thus userspace could do a pre-check before utilizing this feature.

Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 8 ++++++++
2 files changed, 36 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e40afc0..ddd1ffe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
return ret;
}

+static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_info_cap_header *header;
+ struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
+
+ header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
+ VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
+ if (IS_ERR(header))
+ return PTR_ERR(header);
+
+ nesting_cap = container_of(header,
+ struct vfio_iommu_type1_info_cap_nesting,
+ header);
+
+ nesting_cap->nesting_capabilities = 0;
+ if (iommu->nesting) {
+ /* nesting iommu type supports PASID requests (alloc/free) */
+ nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
+ }
+
+ return 0;
+}
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
if (ret)
return ret;

+ ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
+ if (ret)
+ return ret;
+
if (caps.size) {
info.flags |= VFIO_IOMMU_INFO_CAPS;

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 298ac80..8837219 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
struct vfio_iova_range iova_ranges[];
};

+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 2
+
+struct vfio_iommu_type1_info_cap_nesting {
+ struct vfio_info_cap_header header;
+#define VFIO_IOMMU_PASID_REQS (1 << 0)
+ __u32 nesting_capabilities;
+};
+
#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)

/**
--
2.7.4

2020-03-22 18:12:06

Hi Alex,

> From: Alex Williamson <[email protected]>
> Sent: Friday, April 3, 2020 3:57 AM
> To: Liu, Yi L <[email protected]>
> Subject: Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
>
> On Sun, 22 Mar 2020 05:32:03 -0700
> "Liu, Yi L" <[email protected]> wrote:
>
> > From: Liu Yi L <[email protected]>
> >
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
[...]
> > +/**
> > + * Unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> > + struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > + return vfio_iommu_for_each_dev(iommu,
> > + vfio_unbind_gpasid_fn, gbind_data);
> > +}
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > + struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > + ret = -EINVAL;
> > + goto out_unlock;
> > + }
> > +
> > + ret = vfio_iommu_for_each_dev(iommu,
> > + vfio_bind_gpasid_fn, gbind_data);
> > + /*
> > + * If bind failed, it may not be a total failure. Some devices
> > + * within the iommu group may have bind successfully. Although
> > + * we don't enable pasid capability for non-singletion iommu
> > + * groups, a unbind operation would be helpful to ensure no
> > + * partial binding for an iommu group.
>
> Where was the non-singleton group restriction done, I missed that.
>
> > + */
> > + if (ret)
> > + /*
> > + * Undo all binds that already succeeded, no need to
> > + * check the return value here since some device within
> > + * the group has no successful bind when coming to this
> > + * place switch.
> > + */
> > + vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
>
> However, the for_each_dev function stops when the callback function
> returns error, are we just assuming we stop at the same device as we
> faulted on the first time and that we traverse the same set of devices
> the second time? It seems strange to me that unbind should be able to
> fail.

I think the code needs enhancement. Although one group per container
and one device per group is the most typical and desired case, but
the code here loops domains, groups, devices. It should be able to
unwind prior bind when the loop failed for a device. So I plan to do
below change for bind path.

list_for_each_entry(domain, &iommu->domain_list, next) {
list_for_each_entry(group, &domain->group_list, next) {
/*
* if bind failed on a certain device, should unbind prior successful
* bind iommu_group_for_each_dev() should be modified to take two
* callbacks, one for forward loop and one for reverse loop when failure
* happened. "return_upon_failure" indicates whether return upon failure
* during forward loop or not. If yes, iommu_group_for_each_dev() should
* unwind the prior bind in this iommu group before return.
*/
ret = iommu_group_for_each_dev(iommu_group, bind_gpasid_fn,
unbind_gpasid_fn, data, return_upon_failure);
if (ret)
break;
}
if (ret) {
/* unwind bindings with prior groups */
list_for_each_entry_continue_reverse(group,
&domain->group_list, next) {
iommu_group_for_each_dev(iommu_group, unbind_gpasid_fn,
NULL, data, ignore_intermediate_failure);
}
break;
}
}

if (ret) {
/* unwind bindings with prior domains */
list_for_each_entry_continue_reverse(domain, &iommu->domain_list, next) {
iommu_group_for_each_dev(iommu_group, unbind_gpasid_fn,
NULL, data, ignore_intermediate_failure);
}
}
}

return ret;

Regards,
Yi Liu