2020-03-22 12:28:01

by Yi Liu

[permalink] [raw]
Subject: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

From: Liu Yi L <[email protected]>

For a long time, devices have only one DMA address space from platform
IOMMU's point of view. This is true for both bare metal and directed-
access in virtualization environment. Reason is the source ID of DMA in
PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
DMA isolation. However, this is changing with the latest advancement in
I/O technology area. More and more platform vendors are utilizing the PCIe
PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
address spaces as identified by their individual PASIDs. For example,
Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
let device access multiple process virtual address space by binding the
virtual address space with a PASID. Wherein the PASID is allocated in
software and programmed to device per device specific manner. Devices
which support PASID capability are called PASID-capable devices. If such
devices are passed through to VMs, guest software are also able to bind
guest process virtual address space on such devices. Therefore, the guest
software could reuse the bare metal software programming model, which
means guest software will also allocate PASID and program it to device
directly. This is a dangerous situation since it has potential PASID
conflicts and unauthorized address space access. It would be safer to
let host intercept in the guest software's PASID allocation. Thus PASID
are managed system-wide.

This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
PASID allocation/free request from the virtual IOMMU. Additionally, such
requests are intended to be invoked by QEMU or other applications which
are running in userspace, it is necessary to have a mechanism to prevent
single application from abusing available PASIDs in system. With such
consideration, this patch tracks the VFIO PASID allocation per-VM. There
was a discussion to make quota to be per assigned devices. e.g. if a VM
has many assigned devices, then it should have more quota. However, it
is not sure how many PASIDs an assigned devices will use. e.g. it is
possible that a VM with multiples assigned devices but requests less
PASIDs. Therefore per-VM quota would be better.

This patch uses struct mm pointer as a per-VM token. We also considered
using task structure pointer and vfio_iommu structure pointer. However,
task structure is per-thread, which means it cannot achieve per-VM PASID
alloc tracking purpose. While for vfio_iommu structure, it is visible
only within vfio. Therefore, structure mm pointer is selected. This patch
adds a structure vfio_mm. A vfio_mm is created when the first vfio
container is opened by a VM. On the reverse order, vfio_mm is free when
the last vfio container is released. Each VM is assigned with a PASID
quota, so that it is not able to request PASID beyond its quota. This
patch adds a default quota of 1000. This quota could be tuned by
administrator. Making PASID quota tunable will be added in another patch
in this series.

Previous discussions:
https://patchwork.kernel.org/patch/11209429/

Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/vfio/vfio.c | 130 ++++++++++++++++++++++++++++++++++++++++
drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
include/linux/vfio.h | 20 +++++++
include/uapi/linux/vfio.h | 41 +++++++++++++
4 files changed, 295 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c848262..d13b483 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,6 +32,7 @@
#include <linux/vfio.h>
#include <linux/wait.h>
#include <linux/sched/signal.h>
+#include <linux/sched/mm.h>

#define DRIVER_VERSION "0.3"
#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
@@ -46,6 +47,8 @@ static struct vfio {
struct mutex group_lock;
struct cdev group_cdev;
dev_t group_devt;
+ struct list_head vfio_mm_list;
+ struct mutex vfio_mm_lock;
wait_queue_head_t release_q;
} vfio;

@@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
EXPORT_SYMBOL(vfio_unregister_notifier);

/**
+ * VFIO_MM objects - create, release, get, put, search
+ * Caller of the function should have held vfio.vfio_mm_lock.
+ */
+static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
+{
+ struct vfio_mm *vmm;
+ struct vfio_mm_token *token;
+ int ret = 0;
+
+ vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
+ if (!vmm)
+ return ERR_PTR(-ENOMEM);
+
+ /* Per mm IOASID set used for quota control and group operations */
+ ret = ioasid_alloc_set((struct ioasid_set *) mm,
+ VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
+ if (ret) {
+ kfree(vmm);
+ return ERR_PTR(ret);
+ }
+
+ kref_init(&vmm->kref);
+ token = &vmm->token;
+ token->val = mm;
+ vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
+ mutex_init(&vmm->pasid_lock);
+
+ list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
+
+ return vmm;
+}
+
+static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
+{
+ /* destroy the ioasid set */
+ ioasid_free_set(vmm->ioasid_sid, true);
+ mutex_unlock(&vfio.vfio_mm_lock);
+ kfree(vmm);
+}
+
+/* called with vfio.vfio_mm_lock held */
+static void vfio_mm_release(struct kref *kref)
+{
+ struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
+
+ list_del(&vmm->vfio_next);
+ vfio_mm_unlock_and_free(vmm);
+}
+
+void vfio_mm_put(struct vfio_mm *vmm)
+{
+ kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_mm_put);
+
+/* Assume vfio_mm_lock or vfio_mm reference is held */
+static void vfio_mm_get(struct vfio_mm *vmm)
+{
+ kref_get(&vmm->kref);
+}
+
+struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
+{
+ struct mm_struct *mm = get_task_mm(task);
+ struct vfio_mm *vmm;
+ unsigned long long val = (unsigned long long) mm;
+
+ mutex_lock(&vfio.vfio_mm_lock);
+ list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
+ if (vmm->token.val == val) {
+ vfio_mm_get(vmm);
+ goto out;
+ }
+ }
+
+ vmm = vfio_create_mm(mm);
+ if (IS_ERR(vmm))
+ vmm = NULL;
+out:
+ mutex_unlock(&vfio.vfio_mm_lock);
+ mmput(mm);
+ return vmm;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
+
+int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
+{
+ ioasid_t pasid;
+ int ret = -ENOSPC;
+
+ mutex_lock(&vmm->pasid_lock);
+
+ pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
+ if (pasid == INVALID_IOASID) {
+ ret = -ENOSPC;
+ goto out_unlock;
+ }
+
+ ret = pasid;
+out_unlock:
+ mutex_unlock(&vmm->pasid_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
+
+int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
+{
+ void *pdata;
+ int ret = 0;
+
+ mutex_lock(&vmm->pasid_lock);
+ pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
+ if (IS_ERR(pdata)) {
+ ret = PTR_ERR(pdata);
+ goto out_unlock;
+ }
+ ioasid_free(pasid);
+
+out_unlock:
+ mutex_unlock(&vmm->pasid_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
+
+/**
* Module/class support
*/
static char *vfio_devnode(struct device *dev, umode_t *mode)
@@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
idr_init(&vfio.group_idr);
mutex_init(&vfio.group_lock);
mutex_init(&vfio.iommu_drivers_lock);
+ mutex_init(&vfio.vfio_mm_lock);
INIT_LIST_HEAD(&vfio.group_list);
INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+ INIT_LIST_HEAD(&vfio.vfio_mm_list);
init_waitqueue_head(&vfio.release_q);

ret = misc_register(&vfio_dev);
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a177bf2..331ceee 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -70,6 +70,7 @@ struct vfio_iommu {
unsigned int dma_avail;
bool v2;
bool nesting;
+ struct vfio_mm *vmm;
};

struct vfio_domain {
@@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
static void *vfio_iommu_type1_open(unsigned long arg)
{
struct vfio_iommu *iommu;
+ struct vfio_mm *vmm = NULL;

iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
if (!iommu)
@@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long arg)
iommu->dma_avail = dma_entry_limit;
mutex_init(&iommu->lock);
BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
+ vmm = vfio_mm_get_from_task(current);
+ if (!vmm)
+ pr_err("Failed to get vfio_mm track\n");
+ iommu->vmm = vmm;

return iommu;
}
@@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void *iommu_data)
}

vfio_iommu_iova_free(&iommu->iova_list);
+ if (iommu->vmm)
+ vfio_mm_put(iommu->vmm);

kfree(iommu);
}
@@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
return ret;
}

+static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
+{
+ return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
+ (flags & VFIO_IOMMU_PASID_ALLOC &&
+ flags & VFIO_IOMMU_PASID_FREE));
+}
+
+static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
+ int min,
+ int max)
+{
+ struct vfio_mm *vmm = iommu->vmm;
+ int ret = 0;
+
+ mutex_lock(&iommu->lock);
+ if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ if (vmm)
+ ret = vfio_mm_pasid_alloc(vmm, min, max);
+ else
+ ret = -EINVAL;
+out_unlock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
+ unsigned int pasid)
+{
+ struct vfio_mm *vmm = iommu->vmm;
+ int ret = 0;
+
+ mutex_lock(&iommu->lock);
+ if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+
+ if (vmm)
+ ret = vfio_mm_pasid_free(vmm, pasid);
+ else
+ ret = -EINVAL;
+out_unlock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

return copy_to_user((void __user *)arg, &unmap, minsz) ?
-EFAULT : 0;
+
+ } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
+ struct vfio_iommu_type1_pasid_request req;
+ unsigned long offset;
+
+ minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
+ flags);
+
+ if (copy_from_user(&req, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (req.argsz < minsz ||
+ !vfio_iommu_type1_pasid_req_valid(req.flags))
+ return -EINVAL;
+
+ if (copy_from_user((void *)&req + minsz,
+ (void __user *)arg + minsz,
+ sizeof(req) - minsz))
+ return -EFAULT;
+
+ switch (req.flags & VFIO_PASID_REQUEST_MASK) {
+ case VFIO_IOMMU_PASID_ALLOC:
+ {
+ int ret = 0, result;
+
+ result = vfio_iommu_type1_pasid_alloc(iommu,
+ req.alloc_pasid.min,
+ req.alloc_pasid.max);
+ if (result > 0) {
+ offset = offsetof(
+ struct vfio_iommu_type1_pasid_request,
+ alloc_pasid.result);
+ ret = copy_to_user(
+ (void __user *) (arg + offset),
+ &result, sizeof(result));
+ } else {
+ pr_debug("%s: PASID alloc failed\n", __func__);
+ ret = -EFAULT;
+ }
+ return ret;
+ }
+ case VFIO_IOMMU_PASID_FREE:
+ return vfio_iommu_type1_pasid_free(iommu,
+ req.free_pasid);
+ default:
+ return -EINVAL;
+ }
}

return -ENOTTY;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711..75f9f7f1 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
extern void vfio_unregister_iommu_driver(
const struct vfio_iommu_driver_ops *ops);

+#define VFIO_DEFAULT_PASID_QUOTA 1000
+struct vfio_mm_token {
+ unsigned long long val;
+};
+
+struct vfio_mm {
+ struct kref kref;
+ struct vfio_mm_token token;
+ int ioasid_sid;
+ /* protect @pasid_quota field and pasid allocation/free */
+ struct mutex pasid_lock;
+ int pasid_quota;
+ struct list_head vfio_next;
+};
+
+extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
+extern void vfio_mm_put(struct vfio_mm *vmm);
+extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
+extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
+
/*
* External user API
*/
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a1..298ac80 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)

+/*
+ * PASID (Process Address Space ID) is a PCIe concept which
+ * has been extended to support DMA isolation in fine-grain.
+ * With device assigned to user space (e.g. VMs), PASID alloc
+ * and free need to be system wide. This structure defines
+ * the info for pasid alloc/free between user space and kernel
+ * space.
+ *
+ * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
+ * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
+ */
+struct vfio_iommu_type1_pasid_request {
+ __u32 argsz;
+#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
+#define VFIO_IOMMU_PASID_FREE (1 << 1)
+ __u32 flags;
+ union {
+ struct {
+ __u32 min;
+ __u32 max;
+ __u32 result;
+ } alloc_pasid;
+ __u32 free_pasid;
+ };
+};
+
+#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC | \
+ VFIO_IOMMU_PASID_FREE)
+
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ * struct vfio_iommu_type1_pasid_request)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 22)
+
/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */

/*
--
2.7.4


2020-03-22 16:22:54

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

Hi Yi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.6-rc6 next-20200320]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Liu-Yi-L/vfio-expose-virtual-Shared-Virtual-Addressing-to-VMs/20200322-213259
base: https://github.com/awilliam/linux-vfio.git next
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=9.2.0 make.cross ARCH=arm64

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All warnings (new ones prefixed by >>):

drivers/vfio/vfio.c: In function 'vfio_create_mm':
drivers/vfio/vfio.c:2149:8: error: implicit declaration of function 'ioasid_alloc_set'; did you mean 'ioasid_alloc'? [-Werror=implicit-function-declaration]
2149 | ret = ioasid_alloc_set((struct ioasid_set *) mm,
| ^~~~~~~~~~~~~~~~
| ioasid_alloc
>> drivers/vfio/vfio.c:2158:13: warning: assignment to 'long long unsigned int' from 'struct mm_struct *' makes integer from pointer without a cast [-Wint-conversion]
2158 | token->val = mm;
| ^
drivers/vfio/vfio.c: In function 'vfio_mm_unlock_and_free':
drivers/vfio/vfio.c:2170:2: error: implicit declaration of function 'ioasid_free_set'; did you mean 'ioasid_free'? [-Werror=implicit-function-declaration]
2170 | ioasid_free_set(vmm->ioasid_sid, true);
| ^~~~~~~~~~~~~~~
| ioasid_free
drivers/vfio/vfio.c: In function 'vfio_mm_pasid_alloc':
drivers/vfio/vfio.c:2227:26: warning: passing argument 1 of 'ioasid_alloc' makes pointer from integer without a cast [-Wint-conversion]
2227 | pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
| ~~~^~~~~~~~~~~~
| |
| int
In file included from include/linux/iommu.h:16,
from drivers/vfio/vfio.c:20:
include/linux/ioasid.h:45:56: note: expected 'struct ioasid_set *' but argument is of type 'int'
45 | static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
| ~~~~~~~~~~~~~~~~~~~^~~
drivers/vfio/vfio.c: In function 'vfio_mm_pasid_free':
drivers/vfio/vfio.c:2246:25: warning: passing argument 1 of 'ioasid_find' makes pointer from integer without a cast [-Wint-conversion]
2246 | pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
| ~~~^~~~~~~~~~~~
| |
| int
In file included from include/linux/iommu.h:16,
from drivers/vfio/vfio.c:20:
include/linux/ioasid.h:55:52: note: expected 'struct ioasid_set *' but argument is of type 'int'
55 | static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
| ~~~~~~~~~~~~~~~~~~~^~~
cc1: some warnings being treated as errors

vim +2158 drivers/vfio/vfio.c

2133
2134 /**
2135 * VFIO_MM objects - create, release, get, put, search
2136 * Caller of the function should have held vfio.vfio_mm_lock.
2137 */
2138 static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
2139 {
2140 struct vfio_mm *vmm;
2141 struct vfio_mm_token *token;
2142 int ret = 0;
2143
2144 vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
2145 if (!vmm)
2146 return ERR_PTR(-ENOMEM);
2147
2148 /* Per mm IOASID set used for quota control and group operations */
2149 ret = ioasid_alloc_set((struct ioasid_set *) mm,
2150 VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
2151 if (ret) {
2152 kfree(vmm);
2153 return ERR_PTR(ret);
2154 }
2155
2156 kref_init(&vmm->kref);
2157 token = &vmm->token;
> 2158 token->val = mm;
2159 vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
2160 mutex_init(&vmm->pasid_lock);
2161
2162 list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
2163
2164 return vmm;
2165 }
2166

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (4.61 kB)
.config.gz (45.38 kB)
Download all attachments

2020-03-30 08:33:27

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Liu, Yi L <[email protected]>
> Sent: Sunday, March 22, 2020 8:32 PM
>
> From: Liu Yi L <[email protected]>
>
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity

are->is

> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the

"address space" -> "address spaces"

"binding the" -> "binding each"

> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access. It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.
>
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such

"Additionally, because such"

> requests are intended to be invoked by QEMU or other applications which

simplify to "intended to be invoked from userspace"

> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is

devices -> device

> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
>
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
>
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Yi Sun <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/vfio/vfio.c | 130
> ++++++++++++++++++++++++++++++++++++++++
> drivers/vfio/vfio_iommu_type1.c | 104
> ++++++++++++++++++++++++++++++++
> include/linux/vfio.h | 20 +++++++
> include/uapi/linux/vfio.h | 41 +++++++++++++
> 4 files changed, 295 insertions(+)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
> #include <linux/vfio.h>
> #include <linux/wait.h>
> #include <linux/sched/signal.h>
> +#include <linux/sched/mm.h>
>
> #define DRIVER_VERSION "0.3"
> #define DRIVER_AUTHOR "Alex Williamson
> <[email protected]>"
> @@ -46,6 +47,8 @@ static struct vfio {
> struct mutex group_lock;
> struct cdev group_cdev;
> dev_t group_devt;
> + struct list_head vfio_mm_list;
> + struct mutex vfio_mm_lock;
> wait_queue_head_t release_q;
> } vfio;
>
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
> EXPORT_SYMBOL(vfio_unregister_notifier);
>
> /**
> + * VFIO_MM objects - create, release, get, put, search

why capitalizing vfio_mm?

> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> + struct vfio_mm *vmm;
> + struct vfio_mm_token *token;
> + int ret = 0;
> +
> + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> + if (!vmm)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Per mm IOASID set used for quota control and group operations
> */
> + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> + VFIO_DEFAULT_PASID_QUOTA, &vmm-
> >ioasid_sid);
> + if (ret) {
> + kfree(vmm);
> + return ERR_PTR(ret);
> + }
> +
> + kref_init(&vmm->kref);
> + token = &vmm->token;
> + token->val = mm;
> + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> + mutex_init(&vmm->pasid_lock);
> +
> + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> +
> + return vmm;
> +}
> +
> +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> +{
> + /* destroy the ioasid set */
> + ioasid_free_set(vmm->ioasid_sid, true);

do we need hold pasid lock here, since it attempts to destroy a
set which might be referenced by vfio_mm_pasid_free? or is
there guarantee that such race won't happen?

> + mutex_unlock(&vfio.vfio_mm_lock);
> + kfree(vmm);
> +}
> +
> +/* called with vfio.vfio_mm_lock held */
> +static void vfio_mm_release(struct kref *kref)
> +{
> + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> +
> + list_del(&vmm->vfio_next);
> + vfio_mm_unlock_and_free(vmm);
> +}
> +
> +void vfio_mm_put(struct vfio_mm *vmm)
> +{
> + kref_put_mutex(&vmm->kref, vfio_mm_release,
> &vfio.vfio_mm_lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_put);
> +
> +/* Assume vfio_mm_lock or vfio_mm reference is held */
> +static void vfio_mm_get(struct vfio_mm *vmm)
> +{
> + kref_get(&vmm->kref);
> +}
> +
> +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> +{
> + struct mm_struct *mm = get_task_mm(task);
> + struct vfio_mm *vmm;
> + unsigned long long val = (unsigned long long) mm;
> +
> + mutex_lock(&vfio.vfio_mm_lock);
> + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> + if (vmm->token.val == val) {
> + vfio_mm_get(vmm);
> + goto out;
> + }
> + }
> +
> + vmm = vfio_create_mm(mm);
> + if (IS_ERR(vmm))
> + vmm = NULL;
> +out:
> + mutex_unlock(&vfio.vfio_mm_lock);
> + mmput(mm);

I assume this has been discussed before, but from readability p.o.v
it might be good to add a comment for this function to explain
how the recording of mm in vfio_mm can be correctly removed
when the mm is being destroyed, since we don't hold a reference
of mm here.

> + return vmm;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> +
> +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +{
> + ioasid_t pasid;
> + int ret = -ENOSPC;
> +
> + mutex_lock(&vmm->pasid_lock);
> +
> + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> + if (pasid == INVALID_IOASID) {
> + ret = -ENOSPC;
> + goto out_unlock;
> + }
> +
> + ret = pasid;
> +out_unlock:
> + mutex_unlock(&vmm->pasid_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> +
> +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> +{
> + void *pdata;
> + int ret = 0;
> +
> + mutex_lock(&vmm->pasid_lock);
> + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> + if (IS_ERR(pdata)) {
> + ret = PTR_ERR(pdata);
> + goto out_unlock;
> + }
> + ioasid_free(pasid);
> +
> +out_unlock:
> + mutex_unlock(&vmm->pasid_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> +
> +/**
> * Module/class support
> */
> static char *vfio_devnode(struct device *dev, umode_t *mode)
> @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> idr_init(&vfio.group_idr);
> mutex_init(&vfio.group_lock);
> mutex_init(&vfio.iommu_drivers_lock);
> + mutex_init(&vfio.vfio_mm_lock);
> INIT_LIST_HEAD(&vfio.group_list);
> INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> init_waitqueue_head(&vfio.release_q);
>
> ret = misc_register(&vfio_dev);
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index a177bf2..331ceee 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
> unsigned int dma_avail;
> bool v2;
> bool nesting;
> + struct vfio_mm *vmm;
> };
>
> struct vfio_domain {
> @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
> static void *vfio_iommu_type1_open(unsigned long arg)
> {
> struct vfio_iommu *iommu;
> + struct vfio_mm *vmm = NULL;
>
> iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> if (!iommu)
> @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned
> long arg)
> iommu->dma_avail = dma_entry_limit;
> mutex_init(&iommu->lock);
> BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> + vmm = vfio_mm_get_from_task(current);
> + if (!vmm)
> + pr_err("Failed to get vfio_mm track\n");

I assume error should be returned when pr_err is used...

> + iommu->vmm = vmm;
>
> return iommu;
> }
> @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> *iommu_data)
> }
>
> vfio_iommu_iova_free(&iommu->iova_list);
> + if (iommu->vmm)
> + vfio_mm_put(iommu->vmm);
>
> kfree(iommu);
> }
> @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
> return ret;
> }
>
> +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)

I don't think you need prefix "vfio_iommu_type1" for every new
function here, especially for leaf internal function as this one.

> +{
> + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> + (flags & VFIO_IOMMU_PASID_ALLOC &&
> + flags & VFIO_IOMMU_PASID_FREE));
> +}
> +
> +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> + int min,
> + int max)
> +{
> + struct vfio_mm *vmm = iommu->vmm;
> + int ret = 0;
> +
> + mutex_lock(&iommu->lock);
> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> + ret = -EFAULT;

why -EFAULT?

> + goto out_unlock;
> + }
> + if (vmm)
> + ret = vfio_mm_pasid_alloc(vmm, min, max);
> + else
> + ret = -EINVAL;
> +out_unlock:
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> + unsigned int pasid)
> +{
> + struct vfio_mm *vmm = iommu->vmm;
> + int ret = 0;
> +
> + mutex_lock(&iommu->lock);
> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> + ret = -EFAULT;

ditto

> + goto out_unlock;
> + }
> +
> + if (vmm)
> + ret = vfio_mm_pasid_free(vmm, pasid);
> + else
> + ret = -EINVAL;
> +out_unlock:
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
> static long vfio_iommu_type1_ioctl(void *iommu_data,
> unsigned int cmd, unsigned long arg)
> {
> @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>
> return copy_to_user((void __user *)arg, &unmap, minsz) ?
> -EFAULT : 0;
> +
> + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> + struct vfio_iommu_type1_pasid_request req;
> + unsigned long offset;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> + flags);
> +
> + if (copy_from_user(&req, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (req.argsz < minsz ||
> + !vfio_iommu_type1_pasid_req_valid(req.flags))
> + return -EINVAL;
> +
> + if (copy_from_user((void *)&req + minsz,
> + (void __user *)arg + minsz,
> + sizeof(req) - minsz))
> + return -EFAULT;

why copying in two steps instead of copying them together?

> +
> + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> + case VFIO_IOMMU_PASID_ALLOC:
> + {
> + int ret = 0, result;
> +
> + result = vfio_iommu_type1_pasid_alloc(iommu,
> + req.alloc_pasid.min,
> + req.alloc_pasid.max);
> + if (result > 0) {
> + offset = offsetof(
> + struct
> vfio_iommu_type1_pasid_request,
> + alloc_pasid.result);
> + ret = copy_to_user(
> + (void __user *) (arg + offset),
> + &result, sizeof(result));
> + } else {
> + pr_debug("%s: PASID alloc failed\n",
> __func__);
> + ret = -EFAULT;

no, this branch is not for copy_to_user error. it is about pasid alloc
failure. you should handle both.

> + }
> + return ret;
> + }
> + case VFIO_IOMMU_PASID_FREE:
> + return vfio_iommu_type1_pasid_free(iommu,
> + req.free_pasid);
> + default:
> + return -EINVAL;
> + }
> }
>
> return -ENOTTY;
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711..75f9f7f1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> vfio_iommu_driver_ops *ops);
> extern void vfio_unregister_iommu_driver(
> const struct vfio_iommu_driver_ops *ops);
>
> +#define VFIO_DEFAULT_PASID_QUOTA 1000
> +struct vfio_mm_token {
> + unsigned long long val;
> +};
> +
> +struct vfio_mm {
> + struct kref kref;
> + struct vfio_mm_token token;
> + int ioasid_sid;
> + /* protect @pasid_quota field and pasid allocation/free */
> + struct mutex pasid_lock;
> + int pasid_quota;
> + struct list_head vfio_next;
> +};
> +
> +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> +extern void vfio_mm_put(struct vfio_mm *vmm);
> +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> +
> /*
> * External user API
> */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a1..298ac80 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
>
> +/*
> + * PASID (Process Address Space ID) is a PCIe concept which
> + * has been extended to support DMA isolation in fine-grain.
> + * With device assigned to user space (e.g. VMs), PASID alloc
> + * and free need to be system wide. This structure defines
> + * the info for pasid alloc/free between user space and kernel
> + * space.
> + *
> + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> + */
> +struct vfio_iommu_type1_pasid_request {
> + __u32 argsz;
> +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> + __u32 flags;
> + union {
> + struct {
> + __u32 min;
> + __u32 max;
> + __u32 result;

result->pasid?

> + } alloc_pasid;
> + __u32 free_pasid;

what about putting a common pasid field after flags?

> + };
> +};
> +
> +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC | \
> + VFIO_IOMMU_PASID_FREE)
> +
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> + * struct vfio_iommu_type1_pasid_request)
> + *
> + * Availability of this feature depends on PASID support in the device,
> + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> + * is available after VFIO_SET_IOMMU.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE +
> 22)
> +
> /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>
> /*
> --
> 2.7.4

2020-03-30 14:43:44

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Tian, Kevin <[email protected]>
> Sent: Monday, March 30, 2020 4:32 PM
> To: Liu, Yi L <[email protected]>; [email protected];
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> > From: Liu, Yi L <[email protected]>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <[email protected]>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
>
> are->is

thanks.

> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
>
> "address space" -> "address spaces"
>
> "binding the" -> "binding each"

will correct both.

> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> >
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
>
> "Additionally, because such"
>
> > requests are intended to be invoked by QEMU or other applications which
>
> simplify to "intended to be invoked from userspace"

got it.

> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
>
> devices -> device

got it.

> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Yi Sun <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/vfio/vfio.c | 130
> > ++++++++++++++++++++++++++++++++++++++++
> > drivers/vfio/vfio_iommu_type1.c | 104
> > ++++++++++++++++++++++++++++++++
> > include/linux/vfio.h | 20 +++++++
> > include/uapi/linux/vfio.h | 41 +++++++++++++
> > 4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> > #include <linux/vfio.h>
> > #include <linux/wait.h>
> > #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> > #define DRIVER_VERSION "0.3"
> > #define DRIVER_AUTHOR "Alex Williamson
> > <[email protected]>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> > struct mutex group_lock;
> > struct cdev group_cdev;
> > dev_t group_devt;
> > + struct list_head vfio_mm_list;
> > + struct mutex vfio_mm_lock;
> > wait_queue_head_t release_q;
> > } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> > enum vfio_notify_type type,
> > EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> > /**
> > + * VFIO_MM objects - create, release, get, put, search
>
> why capitalizing vfio_mm?

oops, it's not intended, will fix it.

> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > + struct vfio_mm *vmm;
> > + struct vfio_mm_token *token;
> > + int ret = 0;
> > +
> > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > + if (!vmm)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + /* Per mm IOASID set used for quota control and group operations
> > */
> > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > + VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > >ioasid_sid);
> > + if (ret) {
> > + kfree(vmm);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + kref_init(&vmm->kref);
> > + token = &vmm->token;
> > + token->val = mm;
> > + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > + mutex_init(&vmm->pasid_lock);
> > +
> > + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > +
> > + return vmm;
> > +}
> > +
> > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > +{
> > + /* destroy the ioasid set */
> > + ioasid_free_set(vmm->ioasid_sid, true);
>
> do we need hold pasid lock here, since it attempts to destroy a
> set which might be referenced by vfio_mm_pasid_free? or is
> there guarantee that such race won't happen?

Emmm, if considering the race between ioasid_free_set and
vfio_mm_pasid_free, I guess ioasid core should sequence the
two operations with its internal lock. right?

> > + mutex_unlock(&vfio.vfio_mm_lock);
> > + kfree(vmm);
> > +}
> > +
> > +/* called with vfio.vfio_mm_lock held */
> > +static void vfio_mm_release(struct kref *kref)
> > +{
> > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > +
> > + list_del(&vmm->vfio_next);
> > + vfio_mm_unlock_and_free(vmm);
> > +}
> > +
> > +void vfio_mm_put(struct vfio_mm *vmm)
> > +{
> > + kref_put_mutex(&vmm->kref, vfio_mm_release,
> > &vfio.vfio_mm_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > +
> > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > +static void vfio_mm_get(struct vfio_mm *vmm)
> > +{
> > + kref_get(&vmm->kref);
> > +}
> > +
> > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > +{
> > + struct mm_struct *mm = get_task_mm(task);
> > + struct vfio_mm *vmm;
> > + unsigned long long val = (unsigned long long) mm;
> > +
> > + mutex_lock(&vfio.vfio_mm_lock);
> > + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > + if (vmm->token.val == val) {
> > + vfio_mm_get(vmm);
> > + goto out;
> > + }
> > + }
> > +
> > + vmm = vfio_create_mm(mm);
> > + if (IS_ERR(vmm))
> > + vmm = NULL;
> > +out:
> > + mutex_unlock(&vfio.vfio_mm_lock);
> > + mmput(mm);
>
> I assume this has been discussed before, but from readability p.o.v
> it might be good to add a comment for this function to explain
> how the recording of mm in vfio_mm can be correctly removed
> when the mm is being destroyed, since we don't hold a reference
> of mm here.

yeah, I'll add it.

> > + return vmm;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > +
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +{
> > + ioasid_t pasid;
> > + int ret = -ENOSPC;
> > +
> > + mutex_lock(&vmm->pasid_lock);
> > +
> > + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > + if (pasid == INVALID_IOASID) {
> > + ret = -ENOSPC;
> > + goto out_unlock;
> > + }
> > +
> > + ret = pasid;
> > +out_unlock:
> > + mutex_unlock(&vmm->pasid_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > +
> > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > +{
> > + void *pdata;
> > + int ret = 0;
> > +
> > + mutex_lock(&vmm->pasid_lock);
> > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > + if (IS_ERR(pdata)) {
> > + ret = PTR_ERR(pdata);
> > + goto out_unlock;
> > + }
> > + ioasid_free(pasid);
> > +
> > +out_unlock:
> > + mutex_unlock(&vmm->pasid_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > +
> > +/**
> > * Module/class support
> > */
> > static char *vfio_devnode(struct device *dev, umode_t *mode)
> > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > idr_init(&vfio.group_idr);
> > mutex_init(&vfio.group_lock);
> > mutex_init(&vfio.iommu_drivers_lock);
> > + mutex_init(&vfio.vfio_mm_lock);
> > INIT_LIST_HEAD(&vfio.group_list);
> > INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > init_waitqueue_head(&vfio.release_q);
> >
> > ret = misc_register(&vfio_dev);
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c
> > index a177bf2..331ceee 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > unsigned int dma_avail;
> > bool v2;
> > bool nesting;
> > + struct vfio_mm *vmm;
> > };
> >
> > struct vfio_domain {
> > @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void
> > *iommu_data,
> > static void *vfio_iommu_type1_open(unsigned long arg)
> > {
> > struct vfio_iommu *iommu;
> > + struct vfio_mm *vmm = NULL;
> >
> > iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > if (!iommu)
> > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned
> > long arg)
> > iommu->dma_avail = dma_entry_limit;
> > mutex_init(&iommu->lock);
> > BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > + vmm = vfio_mm_get_from_task(current);
> > + if (!vmm)
> > + pr_err("Failed to get vfio_mm track\n");
>
> I assume error should be returned when pr_err is used...

got it. I didn't do it as I don't think vfio_mm is necessary for
every iommu open. It is necessary for the nesting type iommu. I'll
make it fetch vmm when it is opening nesting type and return error
if failed.

> > + iommu->vmm = vmm;
> >
> > return iommu;
> > }
> > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > *iommu_data)
> > }
> >
> > vfio_iommu_iova_free(&iommu->iova_list);
> > + if (iommu->vmm)
> > + vfio_mm_put(iommu->vmm);
> >
> > kfree(iommu);
> > }
> > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > vfio_iommu *iommu,
> > return ret;
> > }
> >
> > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
>
> I don't think you need prefix "vfio_iommu_type1" for every new
> function here, especially for leaf internal function as this one.

got it. thanks.

> > +{
> > + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > + (flags & VFIO_IOMMU_PASID_ALLOC &&
> > + flags & VFIO_IOMMU_PASID_FREE));
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > + int min,
> > + int max)
> > +{
> > + struct vfio_mm *vmm = iommu->vmm;
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > + ret = -EFAULT;
>
> why -EFAULT?

well, it's from a prior comment as below:
vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
differentiate the errors. We could use EFAULT for the no IOMMU case
and EINVAL here?
http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html

>
> > + goto out_unlock;
> > + }
> > + if (vmm)
> > + ret = vfio_mm_pasid_alloc(vmm, min, max);
> > + else
> > + ret = -EINVAL;
> > +out_unlock:
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > + unsigned int pasid)
> > +{
> > + struct vfio_mm *vmm = iommu->vmm;
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > + ret = -EFAULT;
>
> ditto
>
> > + goto out_unlock;
> > + }
> > +
> > + if (vmm)
> > + ret = vfio_mm_pasid_free(vmm, pasid);
> > + else
> > + ret = -EINVAL;
> > +out_unlock:
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > unsigned int cmd, unsigned long arg)
> > {
> > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > *iommu_data,
> >
> > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > -EFAULT : 0;
> > +
> > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > + struct vfio_iommu_type1_pasid_request req;
> > + unsigned long offset;
> > +
> > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > + flags);
> > +
> > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (req.argsz < minsz ||
> > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > + return -EINVAL;
> > +
> > + if (copy_from_user((void *)&req + minsz,
> > + (void __user *)arg + minsz,
> > + sizeof(req) - minsz))
> > + return -EFAULT;
>
> why copying in two steps instead of copying them together?

just want to do sanity check before copying all the data. I
can move it as one copy if it's better. :-)

> > +
> > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > + case VFIO_IOMMU_PASID_ALLOC:
> > + {
> > + int ret = 0, result;
> > +
> > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > + req.alloc_pasid.min,
> > + req.alloc_pasid.max);
> > + if (result > 0) {
> > + offset = offsetof(
> > + struct
> > vfio_iommu_type1_pasid_request,
> > + alloc_pasid.result);
> > + ret = copy_to_user(
> > + (void __user *) (arg + offset),
> > + &result, sizeof(result));
> > + } else {
> > + pr_debug("%s: PASID alloc failed\n",
> > __func__);
> > + ret = -EFAULT;
>
> no, this branch is not for copy_to_user error. it is about pasid alloc
> failure. you should handle both.

Emmm, I just want to fail the IOCTL in such case, so the @result field
is meaningless in the user side. How about using another return value
(e.g. ENOSPC) to indicate the IOCTL failure?

> > + }
> > + return ret;
> > + }
> > + case VFIO_IOMMU_PASID_FREE:
> > + return vfio_iommu_type1_pasid_free(iommu,
> > + req.free_pasid);
> > + default:
> > + return -EINVAL;
> > + }
> > }
> >
> > return -ENOTTY;
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e42a711..75f9f7f1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > vfio_iommu_driver_ops *ops);
> > extern void vfio_unregister_iommu_driver(
> > const struct vfio_iommu_driver_ops *ops);
> >
> > +#define VFIO_DEFAULT_PASID_QUOTA 1000
> > +struct vfio_mm_token {
> > + unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > + struct kref kref;
> > + struct vfio_mm_token token;
> > + int ioasid_sid;
> > + /* protect @pasid_quota field and pasid allocation/free */
> > + struct mutex pasid_lock;
> > + int pasid_quota;
> > + struct list_head vfio_next;
> > +};
> > +
> > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > +
> > /*
> > * External user API
> > */
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a1..298ac80 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> >
> > +/*
> > + * PASID (Process Address Space ID) is a PCIe concept which
> > + * has been extended to support DMA isolation in fine-grain.
> > + * With device assigned to user space (e.g. VMs), PASID alloc
> > + * and free need to be system wide. This structure defines
> > + * the info for pasid alloc/free between user space and kernel
> > + * space.
> > + *
> > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > + __u32 argsz;
> > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > + __u32 flags;
> > + union {
> > + struct {
> > + __u32 min;
> > + __u32 max;
> > + __u32 result;
>
> result->pasid?

yes, the pasid allocated.

>
> > + } alloc_pasid;
> > + __u32 free_pasid;
>
> what about putting a common pasid field after flags?

looks good to me. But it would make the union part only meaningful
to alloc pasid. if so, maybe make the union part as a data field and
only alloc pasid will have it. For free pasid, it is not necessary
to read it from userspace. does it look good?

Regards,
Yi Liu

> > + };
> > +};
> > +
> > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC | \
> > + VFIO_IOMMU_PASID_FREE)
> > +
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > + * struct vfio_iommu_type1_pasid_request)
> > + *
> > + * Availability of this feature depends on PASID support in the device,
> > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > + * is available after VFIO_SET_IOMMU.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE +
> > 22)
> > +
> > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >
> > /*
> > --
> > 2.7.4

2020-03-31 05:41:18

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Liu, Yi L <[email protected]>
> Sent: Monday, March 30, 2020 10:37 PM
>
> > From: Tian, Kevin <[email protected]>
> > Sent: Monday, March 30, 2020 4:32 PM
> > To: Liu, Yi L <[email protected]>; [email protected];
> > Subject: RE: [PATCH v1 1/8] vfio: Add
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Liu, Yi L <[email protected]>
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L <[email protected]>
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> >
> > are->is
>
> thanks.
>
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> >
> > "address space" -> "address spaces"
> >
> > "binding the" -> "binding each"
>
> will correct both.
>
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model,
> which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.
> > >
> > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > passdown
> > > PASID allocation/free request from the virtual IOMMU. Additionally, such
> >
> > "Additionally, because such"
> >
> > > requests are intended to be invoked by QEMU or other applications
> which
> >
> > simplify to "intended to be invoked from userspace"
>
> got it.
>
> > > are running in userspace, it is necessary to have a mechanism to prevent
> > > single application from abusing available PASIDs in system. With such
> > > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > > has many assigned devices, then it should have more quota. However, it
> > > is not sure how many PASIDs an assigned devices will use. e.g. it is
> >
> > devices -> device
>
> got it.
>
> > > possible that a VM with multiples assigned devices but requests less
> > > PASIDs. Therefore per-VM quota would be better.
> > >
> > > This patch uses struct mm pointer as a per-VM token. We also considered
> > > using task structure pointer and vfio_iommu structure pointer. However,
> > > task structure is per-thread, which means it cannot achieve per-VM PASID
> > > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > > only within vfio. Therefore, structure mm pointer is selected. This patch
> > > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > > container is opened by a VM. On the reverse order, vfio_mm is free when
> > > the last vfio container is released. Each VM is assigned with a PASID
> > > quota, so that it is not able to request PASID beyond its quota. This
> > > patch adds a default quota of 1000. This quota could be tuned by
> > > administrator. Making PASID quota tunable will be added in another
> patch
> > > in this series.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian <[email protected]>
> > > CC: Jacob Pan <[email protected]>
> > > Cc: Alex Williamson <[email protected]>
> > > Cc: Eric Auger <[email protected]>
> > > Cc: Jean-Philippe Brucker <[email protected]>
> > > Signed-off-by: Liu Yi L <[email protected]>
> > > Signed-off-by: Yi Sun <[email protected]>
> > > Signed-off-by: Jacob Pan <[email protected]>
> > > ---
> > > drivers/vfio/vfio.c | 130
> > > ++++++++++++++++++++++++++++++++++++++++
> > > drivers/vfio/vfio_iommu_type1.c | 104
> > > ++++++++++++++++++++++++++++++++
> > > include/linux/vfio.h | 20 +++++++
> > > include/uapi/linux/vfio.h | 41 +++++++++++++
> > > 4 files changed, 295 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index c848262..d13b483 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -32,6 +32,7 @@
> > > #include <linux/vfio.h>
> > > #include <linux/wait.h>
> > > #include <linux/sched/signal.h>
> > > +#include <linux/sched/mm.h>
> > >
> > > #define DRIVER_VERSION "0.3"
> > > #define DRIVER_AUTHOR "Alex Williamson
> > > <[email protected]>"
> > > @@ -46,6 +47,8 @@ static struct vfio {
> > > struct mutex group_lock;
> > > struct cdev group_cdev;
> > > dev_t group_devt;
> > > + struct list_head vfio_mm_list;
> > > + struct mutex vfio_mm_lock;
> > > wait_queue_head_t release_q;
> > > } vfio;
> > >
> > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device
> *dev,
> > > enum vfio_notify_type type,
> > > EXPORT_SYMBOL(vfio_unregister_notifier);
> > >
> > > /**
> > > + * VFIO_MM objects - create, release, get, put, search
> >
> > why capitalizing vfio_mm?
>
> oops, it's not intended, will fix it.
>
> > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > + */
> > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > > +{
> > > + struct vfio_mm *vmm;
> > > + struct vfio_mm_token *token;
> > > + int ret = 0;
> > > +
> > > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > + if (!vmm)
> > > + return ERR_PTR(-ENOMEM);
> > > +
> > > + /* Per mm IOASID set used for quota control and group operations
> > > */
> > > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > + VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > > >ioasid_sid);
> > > + if (ret) {
> > > + kfree(vmm);
> > > + return ERR_PTR(ret);
> > > + }
> > > +
> > > + kref_init(&vmm->kref);
> > > + token = &vmm->token;
> > > + token->val = mm;
> > > + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > + mutex_init(&vmm->pasid_lock);
> > > +
> > > + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > +
> > > + return vmm;
> > > +}
> > > +
> > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > > +{
> > > + /* destroy the ioasid set */
> > > + ioasid_free_set(vmm->ioasid_sid, true);
> >
> > do we need hold pasid lock here, since it attempts to destroy a
> > set which might be referenced by vfio_mm_pasid_free? or is
> > there guarantee that such race won't happen?
>
> Emmm, if considering the race between ioasid_free_set and
> vfio_mm_pasid_free, I guess ioasid core should sequence the
> two operations with its internal lock. right?

I looked at below code in free path:

+ mutex_lock(&vmm->pasid_lock);
+ pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
+ if (IS_ERR(pdata)) {
+ ret = PTR_ERR(pdata);
+ goto out_unlock;
+ }
+ ioasid_free(pasid);
+
+out_unlock:
+ mutex_unlock(&vmm->pasid_lock);

above implies that ioasid_find/free must be paired within the pasid_lock.
Then if we don't hold pasid_lock above, ioasid_free_set could
happen between find/free. I'm not sure whether this race would
lead to real problem, but it doesn't look correct simply by looking at
this file.

>
> > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > + kfree(vmm);
> > > +}
> > > +
> > > +/* called with vfio.vfio_mm_lock held */
> > > +static void vfio_mm_release(struct kref *kref)
> > > +{
> > > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > > +
> > > + list_del(&vmm->vfio_next);
> > > + vfio_mm_unlock_and_free(vmm);
> > > +}
> > > +
> > > +void vfio_mm_put(struct vfio_mm *vmm)
> > > +{
> > > + kref_put_mutex(&vmm->kref, vfio_mm_release,
> > > &vfio.vfio_mm_lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > +
> > > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > > +static void vfio_mm_get(struct vfio_mm *vmm)
> > > +{
> > > + kref_get(&vmm->kref);
> > > +}
> > > +
> > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > > +{
> > > + struct mm_struct *mm = get_task_mm(task);
> > > + struct vfio_mm *vmm;
> > > + unsigned long long val = (unsigned long long) mm;
> > > +
> > > + mutex_lock(&vfio.vfio_mm_lock);
> > > + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > + if (vmm->token.val == val) {
> > > + vfio_mm_get(vmm);
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + vmm = vfio_create_mm(mm);
> > > + if (IS_ERR(vmm))
> > > + vmm = NULL;
> > > +out:
> > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > + mmput(mm);
> >
> > I assume this has been discussed before, but from readability p.o.v
> > it might be good to add a comment for this function to explain
> > how the recording of mm in vfio_mm can be correctly removed
> > when the mm is being destroyed, since we don't hold a reference
> > of mm here.
>
> yeah, I'll add it.
>
> > > + return vmm;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > +
> > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > +{
> > > + ioasid_t pasid;
> > > + int ret = -ENOSPC;
> > > +
> > > + mutex_lock(&vmm->pasid_lock);
> > > +
> > > + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > + if (pasid == INVALID_IOASID) {
> > > + ret = -ENOSPC;
> > > + goto out_unlock;
> > > + }
> > > +
> > > + ret = pasid;
> > > +out_unlock:
> > > + mutex_unlock(&vmm->pasid_lock);
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > +
> > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > > +{
> > > + void *pdata;
> > > + int ret = 0;
> > > +
> > > + mutex_lock(&vmm->pasid_lock);
> > > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > + if (IS_ERR(pdata)) {
> > > + ret = PTR_ERR(pdata);
> > > + goto out_unlock;
> > > + }
> > > + ioasid_free(pasid);
> > > +
> > > +out_unlock:
> > > + mutex_unlock(&vmm->pasid_lock);
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > +
> > > +/**
> > > * Module/class support
> > > */
> > > static char *vfio_devnode(struct device *dev, umode_t *mode)
> > > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > > idr_init(&vfio.group_idr);
> > > mutex_init(&vfio.group_lock);
> > > mutex_init(&vfio.iommu_drivers_lock);
> > > + mutex_init(&vfio.vfio_mm_lock);
> > > INIT_LIST_HEAD(&vfio.group_list);
> > > INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > > init_waitqueue_head(&vfio.release_q);
> > >
> > > ret = misc_register(&vfio_dev);
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > index a177bf2..331ceee 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > unsigned int dma_avail;
> > > bool v2;
> > > bool nesting;
> > > + struct vfio_mm *vmm;
> > > };
> > >
> > > struct vfio_domain {
> > > @@ -2018,6 +2019,7 @@ static void
> vfio_iommu_type1_detach_group(void
> > > *iommu_data,
> > > static void *vfio_iommu_type1_open(unsigned long arg)
> > > {
> > > struct vfio_iommu *iommu;
> > > + struct vfio_mm *vmm = NULL;
> > >
> > > iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > > if (!iommu)
> > > @@ -2043,6 +2045,10 @@ static void
> *vfio_iommu_type1_open(unsigned
> > > long arg)
> > > iommu->dma_avail = dma_entry_limit;
> > > mutex_init(&iommu->lock);
> > > BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > + vmm = vfio_mm_get_from_task(current);
> > > + if (!vmm)
> > > + pr_err("Failed to get vfio_mm track\n");
> >
> > I assume error should be returned when pr_err is used...
>
> got it. I didn't do it as I don’t think vfio_mm is necessary for
> every iommu open. It is necessary for the nesting type iommu. I'll
> make it fetch vmm when it is opening nesting type and return error
> if failed.

sounds good.

>
> > > + iommu->vmm = vmm;
> > >
> > > return iommu;
> > > }
> > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > > *iommu_data)
> > > }
> > >
> > > vfio_iommu_iova_free(&iommu->iova_list);
> > > + if (iommu->vmm)
> > > + vfio_mm_put(iommu->vmm);
> > >
> > > kfree(iommu);
> > > }
> > > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > > vfio_iommu *iommu,
> > > return ret;
> > > }
> > >
> > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> >
> > I don't think you need prefix "vfio_iommu_type1" for every new
> > function here, especially for leaf internal function as this one.
>
> got it. thanks.
>
> > > +{
> > > + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > + (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > + flags & VFIO_IOMMU_PASID_FREE));
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > + int min,
> > > + int max)
> > > +{
> > > + struct vfio_mm *vmm = iommu->vmm;
> > > + int ret = 0;
> > > +
> > > + mutex_lock(&iommu->lock);
> > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > + ret = -EFAULT;
> >
> > why -EFAULT?
>
> well, it's from a prior comment as below:
> vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
> differentiate the errors. We could use EFAULT for the no IOMMU case
> and EINVAL here?
> http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html
>
> >
> > > + goto out_unlock;
> > > + }
> > > + if (vmm)
> > > + ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > + else
> > > + ret = -EINVAL;
> > > +out_unlock:
> > > + mutex_unlock(&iommu->lock);
> > > + return ret;
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > + unsigned int pasid)
> > > +{
> > > + struct vfio_mm *vmm = iommu->vmm;
> > > + int ret = 0;
> > > +
> > > + mutex_lock(&iommu->lock);
> > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > + ret = -EFAULT;
> >
> > ditto
> >
> > > + goto out_unlock;
> > > + }
> > > +
> > > + if (vmm)
> > > + ret = vfio_mm_pasid_free(vmm, pasid);
> > > + else
> > > + ret = -EINVAL;
> > > +out_unlock:
> > > + mutex_unlock(&iommu->lock);
> > > + return ret;
> > > +}
> > > +
> > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > unsigned int cmd, unsigned long arg)
> > > {
> > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > >
> > > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > -EFAULT : 0;
> > > +
> > > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > + struct vfio_iommu_type1_pasid_request req;
> > > + unsigned long offset;
> > > +
> > > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > + flags);
> > > +
> > > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > > + return -EFAULT;
> > > +
> > > + if (req.argsz < minsz ||
> > > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > + return -EINVAL;
> > > +
> > > + if (copy_from_user((void *)&req + minsz,
> > > + (void __user *)arg + minsz,
> > > + sizeof(req) - minsz))
> > > + return -EFAULT;
> >
> > why copying in two steps instead of copying them together?
>
> just want to do sanity check before copying all the data. I
> can move it as one copy if it's better. :-)

it's possible fine. I just saw you did same thing for other uapis.

>
> > > +
> > > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > + case VFIO_IOMMU_PASID_ALLOC:
> > > + {
> > > + int ret = 0, result;
> > > +
> > > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > > + req.alloc_pasid.min,
> > > + req.alloc_pasid.max);
> > > + if (result > 0) {
> > > + offset = offsetof(
> > > + struct
> > > vfio_iommu_type1_pasid_request,
> > > + alloc_pasid.result);
> > > + ret = copy_to_user(
> > > + (void __user *) (arg + offset),
> > > + &result, sizeof(result));
> > > + } else {
> > > + pr_debug("%s: PASID alloc failed\n",
> > > __func__);
> > > + ret = -EFAULT;
> >
> > no, this branch is not for copy_to_user error. it is about pasid alloc
> > failure. you should handle both.
>
> Emmm, I just want to fail the IOCTL in such case, so the @result field
> is meaningless in the user side. How about using another return value
> (e.g. ENOSPC) to indicate the IOCTL failure?

If pasid_alloc fails, you return its result to userspace
if copy_to_user fails, then return -EFAULT.

however, above you return -EFAULT for pasid_alloc failure, and
then the number of not-copied bytes for copy_to_user.

>
> > > + }
> > > + return ret;
> > > + }
> > > + case VFIO_IOMMU_PASID_FREE:
> > > + return vfio_iommu_type1_pasid_free(iommu,
> > > + req.free_pasid);
> > > + default:
> > > + return -EINVAL;
> > > + }
> > > }
> > >
> > > return -ENOTTY;
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index e42a711..75f9f7f1 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > > vfio_iommu_driver_ops *ops);
> > > extern void vfio_unregister_iommu_driver(
> > > const struct vfio_iommu_driver_ops *ops);
> > >
> > > +#define VFIO_DEFAULT_PASID_QUOTA 1000
> > > +struct vfio_mm_token {
> > > + unsigned long long val;
> > > +};
> > > +
> > > +struct vfio_mm {
> > > + struct kref kref;
> > > + struct vfio_mm_token token;
> > > + int ioasid_sid;
> > > + /* protect @pasid_quota field and pasid allocation/free */
> > > + struct mutex pasid_lock;
> > > + int pasid_quota;
> > > + struct list_head vfio_next;
> > > +};
> > > +
> > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> *task);
> > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > > +
> > > /*
> > > * External user API
> > > */
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 9e843a1..298ac80 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> > >
> > > +/*
> > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > + * has been extended to support DMA isolation in fine-grain.
> > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > + * and free need to be system wide. This structure defines
> > > + * the info for pasid alloc/free between user space and kernel
> > > + * space.
> > > + *
> > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > + */
> > > +struct vfio_iommu_type1_pasid_request {
> > > + __u32 argsz;
> > > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > > + __u32 flags;
> > > + union {
> > > + struct {
> > > + __u32 min;
> > > + __u32 max;
> > > + __u32 result;
> >
> > result->pasid?
>
> yes, the pasid allocated.
>
> >
> > > + } alloc_pasid;
> > > + __u32 free_pasid;
> >
> > what about putting a common pasid field after flags?
>
> looks good to me. But it would make the union part only meaningful
> to alloc pasid. if so, maybe make the union part as a data field and
> only alloc pasid will have it. For free pasid, it is not necessary
> to read it from userspace. does it look good?

maybe keeping the union is also OK, just with {min, max} for alloc.
who knows whether more pasid ops will be added in the future
which may require its specific union structure. ???? putting pasid
as a common field is reasonable because the whole cmd is for
pasid.

>
> Regards,
> Yi Liu
>
> > > + };
> > > +};
> > > +
> > > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC
> | \
> > > + VFIO_IOMMU_PASID_FREE)
> > > +
> > > +/**
> > > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > > + * struct vfio_iommu_type1_pasid_request)
> > > + *
> > > + * Availability of this feature depends on PASID support in the device,
> > > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > > + * is available after VFIO_SET_IOMMU.
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE +
> > > 22)
> > > +
> > > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> > >
> > > /*
> > > --
> > > 2.7.4

2020-03-31 07:54:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

Who is going to use thse exports? Please submit them together with
a driver actually using them.

2020-03-31 08:18:45

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Christoph Hellwig <[email protected]>
> Sent: Tuesday, March 31, 2020 3:54 PM
> To: Liu, Yi L <[email protected]>
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> Who is going to use thse exports? Please submit them together with
> a driver actually using them.

Hi Hellwig,

These are exposed for SVA (Shared Virtual Addressing) usage in VMs. If
say a driver who actually using them, it is the iommu driver running in
guest. The flow is: guest iommu driver programs the virtual command interface
and it traps to host. The virtual IOMMU device model lays in QEMU will
utilize the exported ioctl to get PASIDs.
Here is iommu kernel driver patch which utilizes virtual command interface
to request pasid alloc/free.
https://lkml.org/lkml/2020/3/20/1176
And, the below patch is one which utilizes the ioctl exported in this patch:
https://patchwork.kernel.org/patch/11464601/

Regards,
Yi Liu

2020-03-31 08:33:57

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Christoph Hellwig <[email protected]>
> Sent: Tuesday, March 31, 2020 3:54 PM
> To: Liu, Yi L <[email protected]>
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> Who is going to use thse exports? Please submit them together with
> a driver actually using them.

Hi Hellwig,

Sorry, maybe I misunderstood your point. Do you mean the exported symbol
below? They are used by the vfio_iommu_type1 driver which is a separate
driver besides the vfio.ko driver.

+EXPORT_SYMBOL_GPL(vfio_mm_put);
+EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);

Regards,
Yi Liu

2020-03-31 08:37:06

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Liu, Yi L
> Sent: Tuesday, March 31, 2020 4:33 PM
> To: 'Christoph Hellwig' <[email protected]>
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> > From: Christoph Hellwig <[email protected]>
> > Sent: Tuesday, March 31, 2020 3:54 PM
> > To: Liu, Yi L <[email protected]>
> > Subject: Re: [PATCH v1 1/8] vfio: Add
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > Who is going to use thse exports? Please submit them together with a
> > driver actually using them.
the user of the symbols are already in this patch. sorry for the split answer..

Regards,
Yi Liu

2020-03-31 09:15:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

On Tue, Mar 31, 2020 at 08:36:32AM +0000, Liu, Yi L wrote:
> > From: Liu, Yi L
> > Sent: Tuesday, March 31, 2020 4:33 PM
> > To: 'Christoph Hellwig' <[email protected]>
> > Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Christoph Hellwig <[email protected]>
> > > Sent: Tuesday, March 31, 2020 3:54 PM
> > > To: Liu, Yi L <[email protected]>
> > > Subject: Re: [PATCH v1 1/8] vfio: Add
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > Who is going to use thse exports? Please submit them together with a
> > > driver actually using them.
> the user of the symbols are already in this patch. sorry for the split answer..

Thanks, sorry for the noise!

2020-03-31 13:22:42

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Tian, Kevin <[email protected]>
> Sent: Tuesday, March 31, 2020 1:41 PM
> To: Liu, Yi L <[email protected]>; [email protected];
> [email protected]
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> > From: Liu, Yi L <[email protected]>
> > Sent: Monday, March 30, 2020 10:37 PM
> >
> > > From: Tian, Kevin <[email protected]>
> > > Sent: Monday, March 30, 2020 4:32 PM
> > > To: Liu, Yi L <[email protected]>; [email protected];
> > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > > From: Liu, Yi L <[email protected]>
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L <[email protected]>
> > > >
> > > > For a long time, devices have only one DMA address space from
> > > > platform IOMMU's point of view. This is true for both bare metal
> > > > and directed- access in virtualization environment. Reason is the
> > > > source ID of DMA in PCIe are BDF (bus/dev/fnc ID), which results
> > > > in only device granularity
> > >
> > > are->is
> >
> > thanks.
> >
> > > > DMA isolation. However, this is changing with the latest
> > > > advancement in I/O technology area. More and more platform vendors
> > > > are utilizing the
> > PCIe
> > > > PASID TLP prefix in DMA requests, thus to give devices with
> > > > multiple DMA address spaces as identified by their individual
> > > > PASIDs. For example, Shared Virtual Addressing (SVA, a.k.a Shared
> > > > Virtual Memory) is able to let device access multiple process
> > > > virtual address space by binding the
> > >
> > > "address space" -> "address spaces"
> > >
> > > "binding the" -> "binding each"
> >
> > will correct both.
> >
> > > > virtual address space with a PASID. Wherein the PASID is allocated
> > > > in software and programmed to device per device specific manner.
> > > > Devices which support PASID capability are called PASID-capable
> > > > devices. If such devices are passed through to VMs, guest software
> > > > are also able to bind guest process virtual address space on such
> > > > devices. Therefore, the guest software could reuse the bare metal
> > > > software programming model,
> > which
> > > > means guest software will also allocate PASID and program it to
> > > > device directly. This is a dangerous situation since it has
> > > > potential PASID conflicts and unauthorized address space access.
> > > > It would be safer to let host intercept in the guest software's
> > > > PASID allocation. Thus PASID are managed system-wide.
> > > >
> > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > > passdown PASID allocation/free request from the virtual IOMMU.
> > > > Additionally, such
> > >
> > > "Additionally, because such"
> > >
> > > > requests are intended to be invoked by QEMU or other applications
> > which
> > >
> > > simplify to "intended to be invoked from userspace"
> >
> > got it.
> >
> > > > are running in userspace, it is necessary to have a mechanism to
> > > > prevent single application from abusing available PASIDs in
> > > > system. With such consideration, this patch tracks the VFIO PASID
> > > > allocation per-VM. There was a discussion to make quota to be per
> > > > assigned devices. e.g. if a VM has many assigned devices, then it
> > > > should have more quota. However, it is not sure how many PASIDs an
> > > > assigned devices will use. e.g. it is
> > >
> > > devices -> device
> >
> > got it.
> >
> > > > possible that a VM with multiples assigned devices but requests
> > > > less PASIDs. Therefore per-VM quota would be better.
> > > >
> > > > This patch uses struct mm pointer as a per-VM token. We also
> > > > considered using task structure pointer and vfio_iommu structure
> > > > pointer. However, task structure is per-thread, which means it
> > > > cannot achieve per-VM PASID alloc tracking purpose. While for
> > > > vfio_iommu structure, it is visible only within vfio. Therefore,
> > > > structure mm pointer is selected. This patch adds a structure
> > > > vfio_mm. A vfio_mm is created when the first vfio container is
> > > > opened by a VM. On the reverse order, vfio_mm is free when the
> > > > last vfio container is released. Each VM is assigned with a PASID
> > > > quota, so that it is not able to request PASID beyond its quota.
> > > > This patch adds a default quota of 1000. This quota could be tuned
> > > > by administrator. Making PASID quota tunable will be added in
> > > > another
> > patch
> > > > in this series.
> > > >
> > > > Previous discussions:
> > > > https://patchwork.kernel.org/patch/11209429/
> > > >
> > > > Cc: Kevin Tian <[email protected]>
> > > > CC: Jacob Pan <[email protected]>
> > > > Cc: Alex Williamson <[email protected]>
> > > > Cc: Eric Auger <[email protected]>
> > > > Cc: Jean-Philippe Brucker <[email protected]>
> > > > Signed-off-by: Liu Yi L <[email protected]>
> > > > Signed-off-by: Yi Sun <[email protected]>
> > > > Signed-off-by: Jacob Pan <[email protected]>
> > > > ---
> > > > drivers/vfio/vfio.c | 130
> > > > ++++++++++++++++++++++++++++++++++++++++
> > > > drivers/vfio/vfio_iommu_type1.c | 104
> > > > ++++++++++++++++++++++++++++++++
> > > > include/linux/vfio.h | 20 +++++++
> > > > include/uapi/linux/vfio.h | 41 +++++++++++++
> > > > 4 files changed, 295 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > c848262..d13b483 100644
> > > > --- a/drivers/vfio/vfio.c
> > > > +++ b/drivers/vfio/vfio.c
> > > > @@ -32,6 +32,7 @@
> > > > #include <linux/vfio.h>
> > > > #include <linux/wait.h>
> > > > #include <linux/sched/signal.h>
> > > > +#include <linux/sched/mm.h>
> > > >
> > > > #define DRIVER_VERSION "0.3"
> > > > #define DRIVER_AUTHOR "Alex Williamson
> > > > <[email protected]>"
> > > > @@ -46,6 +47,8 @@ static struct vfio {
> > > > struct mutex group_lock;
> > > > struct cdev group_cdev;
> > > > dev_t group_devt;
> > > > + struct list_head vfio_mm_list;
> > > > + struct mutex vfio_mm_lock;
> > > > wait_queue_head_t release_q;
> > > > } vfio;
> > > >
> > > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device
> > *dev,
> > > > enum vfio_notify_type type,
> > > > EXPORT_SYMBOL(vfio_unregister_notifier);
> > > >
> > > > /**
> > > > + * VFIO_MM objects - create, release, get, put, search
> > >
> > > why capitalizing vfio_mm?
> >
> > oops, it's not intended, will fix it.
> >
> > > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > > + */
> > > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm) {
> > > > + struct vfio_mm *vmm;
> > > > + struct vfio_mm_token *token;
> > > > + int ret = 0;
> > > > +
> > > > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > > + if (!vmm)
> > > > + return ERR_PTR(-ENOMEM);
> > > > +
> > > > + /* Per mm IOASID set used for quota control and group operations
> > > > */
> > > > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > > + VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > > > >ioasid_sid);
> > > > + if (ret) {
> > > > + kfree(vmm);
> > > > + return ERR_PTR(ret);
> > > > + }
> > > > +
> > > > + kref_init(&vmm->kref);
> > > > + token = &vmm->token;
> > > > + token->val = mm;
> > > > + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > > + mutex_init(&vmm->pasid_lock);
> > > > +
> > > > + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > > +
> > > > + return vmm;
> > > > +}
> > > > +
> > > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm) {
> > > > + /* destroy the ioasid set */
> > > > + ioasid_free_set(vmm->ioasid_sid, true);
> > >
> > > do we need hold pasid lock here, since it attempts to destroy a set
> > > which might be referenced by vfio_mm_pasid_free? or is there
> > > guarantee that such race won't happen?
> >
> > Emmm, if considering the race between ioasid_free_set and
> > vfio_mm_pasid_free, I guess ioasid core should sequence the two
> > operations with its internal lock. right?
>
> I looked at below code in free path:
>
> + mutex_lock(&vmm->pasid_lock);
> + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> + if (IS_ERR(pdata)) {
> + ret = PTR_ERR(pdata);
> + goto out_unlock;
> + }
> + ioasid_free(pasid);
> +
> +out_unlock:
> + mutex_unlock(&vmm->pasid_lock);
>
> above implies that ioasid_find/free must be paired within the pasid_lock.
> Then if we don't hold pasid_lock above, ioasid_free_set could happen between
> find/free. I'm not sure whether this race would lead to real problem, but it doesn't
> look correct simply by looking at this file.

Well, Jacob told me to remove the ioasid_find in another email as he
believes ioasid core should be able to take care of it. and also need to
be protected by lock. If so, does it look good? :-)

" [Jacob Pan] this might be better to put under ioasid code such that it
is under the ioasid lock. no one else can free the ioasid between find() and free().
e.g. ioasid_free(sid, pasid)
if sid == INVALID_IOASID_SET, then no set ownership checking.
thoughts?"

> >
> > > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > > + kfree(vmm);
> > > > +}
> > > > +
> > > > +/* called with vfio.vfio_mm_lock held */ static void
> > > > +vfio_mm_release(struct kref *kref) {
> > > > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > > > +
> > > > + list_del(&vmm->vfio_next);
> > > > + vfio_mm_unlock_and_free(vmm);
> > > > +}
> > > > +
> > > > +void vfio_mm_put(struct vfio_mm *vmm) {
> > > > + kref_put_mutex(&vmm->kref, vfio_mm_release,
> > > > &vfio.vfio_mm_lock);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > > +
> > > > +/* Assume vfio_mm_lock or vfio_mm reference is held */ static
> > > > +void vfio_mm_get(struct vfio_mm *vmm) {
> > > > + kref_get(&vmm->kref);
> > > > +}
> > > > +
> > > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) {
> > > > + struct mm_struct *mm = get_task_mm(task);
> > > > + struct vfio_mm *vmm;
> > > > + unsigned long long val = (unsigned long long) mm;
> > > > +
> > > > + mutex_lock(&vfio.vfio_mm_lock);
> > > > + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > > + if (vmm->token.val == val) {
> > > > + vfio_mm_get(vmm);
> > > > + goto out;
> > > > + }
> > > > + }
> > > > +
> > > > + vmm = vfio_create_mm(mm);
> > > > + if (IS_ERR(vmm))
> > > > + vmm = NULL;
> > > > +out:
> > > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > > + mmput(mm);
> > >
> > > I assume this has been discussed before, but from readability p.o.v
> > > it might be good to add a comment for this function to explain how
> > > the recording of mm in vfio_mm can be correctly removed when the mm
> > > is being destroyed, since we don't hold a reference of mm here.
> >
> > yeah, I'll add it.
> >
> > > > + return vmm;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > +
> > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max) {
> > > > + ioasid_t pasid;
> > > > + int ret = -ENOSPC;
> > > > +
> > > > + mutex_lock(&vmm->pasid_lock);
> > > > +
> > > > + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > > + if (pasid == INVALID_IOASID) {
> > > > + ret = -ENOSPC;
> > > > + goto out_unlock;
> > > > + }
> > > > +
> > > > + ret = pasid;
> > > > +out_unlock:
> > > > + mutex_unlock(&vmm->pasid_lock);
> > > > + return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > > +
> > > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid) {
> > > > + void *pdata;
> > > > + int ret = 0;
> > > > +
> > > > + mutex_lock(&vmm->pasid_lock);
> > > > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > > + if (IS_ERR(pdata)) {
> > > > + ret = PTR_ERR(pdata);
> > > > + goto out_unlock;
> > > > + }
> > > > + ioasid_free(pasid);
> > > > +
> > > > +out_unlock:
> > > > + mutex_unlock(&vmm->pasid_lock);
> > > > + return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > > +
> > > > +/**
> > > > * Module/class support
> > > > */
> > > > static char *vfio_devnode(struct device *dev, umode_t *mode) @@
> > > > -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > > > idr_init(&vfio.group_idr);
> > > > mutex_init(&vfio.group_lock);
> > > > mutex_init(&vfio.iommu_drivers_lock);
> > > > + mutex_init(&vfio.vfio_mm_lock);
> > > > INIT_LIST_HEAD(&vfio.group_list);
> > > > INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > > + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > > > init_waitqueue_head(&vfio.release_q);
> > > >
> > > > ret = misc_register(&vfio_dev);
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index a177bf2..331ceee 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > > unsigned int dma_avail;
> > > > bool v2;
> > > > bool nesting;
> > > > + struct vfio_mm *vmm;
> > > > };
> > > >
> > > > struct vfio_domain {
> > > > @@ -2018,6 +2019,7 @@ static void
> > vfio_iommu_type1_detach_group(void
> > > > *iommu_data,
> > > > static void *vfio_iommu_type1_open(unsigned long arg) {
> > > > struct vfio_iommu *iommu;
> > > > + struct vfio_mm *vmm = NULL;
> > > >
> > > > iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > > > if (!iommu)
> > > > @@ -2043,6 +2045,10 @@ static void
> > *vfio_iommu_type1_open(unsigned
> > > > long arg)
> > > > iommu->dma_avail = dma_entry_limit;
> > > > mutex_init(&iommu->lock);
> > > > BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > > + vmm = vfio_mm_get_from_task(current);
> > > > + if (!vmm)
> > > > + pr_err("Failed to get vfio_mm track\n");
> > >
> > > I assume error should be returned when pr_err is used...
> >
> > got it. I didn't do it as I don’t think vfio_mm is necessary for
> > every iommu open. It is necessary for the nesting type iommu. I'll
> > make it fetch vmm when it is opening nesting type and return error
> > if failed.
>
> sounds good.
>
> >
> > > > + iommu->vmm = vmm;
> > > >
> > > > return iommu;
> > > > }
> > > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > > > *iommu_data)
> > > > }
> > > >
> > > > vfio_iommu_iova_free(&iommu->iova_list);
> > > > + if (iommu->vmm)
> > > > + vfio_mm_put(iommu->vmm);
> > > >
> > > > kfree(iommu);
> > > > }
> > > > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > > > vfio_iommu *iommu,
> > > > return ret;
> > > > }
> > > >
> > > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > >
> > > I don't think you need prefix "vfio_iommu_type1" for every new
> > > function here, especially for leaf internal function as this one.
> >
> > got it. thanks.
> >
> > > > +{
> > > > + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > > + (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > > + flags & VFIO_IOMMU_PASID_FREE));
> > > > +}
> > > > +
> > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > + int min,
> > > > + int max)
> > > > +{
> > > > + struct vfio_mm *vmm = iommu->vmm;
> > > > + int ret = 0;
> > > > +
> > > > + mutex_lock(&iommu->lock);
> > > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > + ret = -EFAULT;
> > >
> > > why -EFAULT?
> >
> > well, it's from a prior comment as below:
> > vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
> > differentiate the errors. We could use EFAULT for the no IOMMU case
> > and EINVAL here?
> > http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html
> >
> > >
> > > > + goto out_unlock;
> > > > + }
> > > > + if (vmm)
> > > > + ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > > + else
> > > > + ret = -EINVAL;
> > > > +out_unlock:
> > > > + mutex_unlock(&iommu->lock);
> > > > + return ret;
> > > > +}
> > > > +
> > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > + unsigned int pasid)
> > > > +{
> > > > + struct vfio_mm *vmm = iommu->vmm;
> > > > + int ret = 0;
> > > > +
> > > > + mutex_lock(&iommu->lock);
> > > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > + ret = -EFAULT;
> > >
> > > ditto
> > >
> > > > + goto out_unlock;
> > > > + }
> > > > +
> > > > + if (vmm)
> > > > + ret = vfio_mm_pasid_free(vmm, pasid);
> > > > + else
> > > > + ret = -EINVAL;
> > > > +out_unlock:
> > > > + mutex_unlock(&iommu->lock);
> > > > + return ret;
> > > > +}
> > > > +
> > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > unsigned int cmd, unsigned long arg)
> > > > {
> > > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > > *iommu_data,
> > > >
> > > > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > > -EFAULT : 0;
> > > > +
> > > > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > + struct vfio_iommu_type1_pasid_request req;
> > > > + unsigned long offset;
> > > > +
> > > > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > > + flags);
> > > > +
> > > > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > + return -EFAULT;
> > > > +
> > > > + if (req.argsz < minsz ||
> > > > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > > + return -EINVAL;
> > > > +
> > > > + if (copy_from_user((void *)&req + minsz,
> > > > + (void __user *)arg + minsz,
> > > > + sizeof(req) - minsz))
> > > > + return -EFAULT;
> > >
> > > why copying in two steps instead of copying them together?
> >
> > just want to do sanity check before copying all the data. I
> > can move it as one copy if it's better. :-)
>
> it's possible fine. I just saw you did same thing for other uapis.
>
> >
> > > > +
> > > > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > + case VFIO_IOMMU_PASID_ALLOC:
> > > > + {
> > > > + int ret = 0, result;
> > > > +
> > > > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > > > + req.alloc_pasid.min,
> > > > + req.alloc_pasid.max);
> > > > + if (result > 0) {
> > > > + offset = offsetof(
> > > > + struct
> > > > vfio_iommu_type1_pasid_request,
> > > > + alloc_pasid.result);
> > > > + ret = copy_to_user(
> > > > + (void __user *) (arg + offset),
> > > > + &result, sizeof(result));
> > > > + } else {
> > > > + pr_debug("%s: PASID alloc failed\n",
> > > > __func__);
> > > > + ret = -EFAULT;
> > >
> > > no, this branch is not for copy_to_user error. it is about pasid alloc
> > > failure. you should handle both.
> >
> > Emmm, I just want to fail the IOCTL in such case, so the @result field
> > is meaningless in the user side. How about using another return value
> > (e.g. ENOSPC) to indicate the IOCTL failure?
>
> If pasid_alloc fails, you return its result to userspace
> if copy_to_user fails, then return -EFAULT.
>
> however, above you return -EFAULT for pasid_alloc failure, and
> then the number of not-copied bytes for copy_to_user.

not quite get. Let me re-paste the code. :-)

+ case VFIO_IOMMU_PASID_ALLOC:
+ {
+ int ret = 0, result;
+
+ result = vfio_iommu_type1_pasid_alloc(iommu,
+ req.alloc_pasid.min,
+ req.alloc_pasid.max);
+ if (result > 0) {
+ offset = offsetof(
+ struct vfio_iommu_type1_pasid_request,
+ alloc_pasid.result);
+ ret = copy_to_user(
+ (void __user *) (arg + offset),
+ &result, sizeof(result));
if copy_to_user failed, ret is the number of uncopied bytes and
will be returned to userspace to indicate failure. userspace will
not use the data in result field. perhaps, I should check the ret
here and return -EFAULT or another errno, instead of return the
number of un-copied bytes.
+ } else {
+ pr_debug("%s: PASID alloc failed\n", __func__);
+ ret = -EFAULT;
if vfio_iommu_type1_pasid_alloc() failed, no doubt, return -EFAULT
to userspace to indicate failure.
+ }
+ return ret;
+ }

is there still porblem here?
> >
> > > > + }
> > > > + return ret;
> > > > + }
> > > > + case VFIO_IOMMU_PASID_FREE:
> > > > + return vfio_iommu_type1_pasid_free(iommu,
> > > > + req.free_pasid);
> > > > + default:
> > > > + return -EINVAL;
> > > > + }
> > > > }
> > > >
> > > > return -ENOTTY;
> > > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > > index e42a711..75f9f7f1 100644
> > > > --- a/include/linux/vfio.h
> > > > +++ b/include/linux/vfio.h
> > > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > > > vfio_iommu_driver_ops *ops);
> > > > extern void vfio_unregister_iommu_driver(
> > > > const struct vfio_iommu_driver_ops *ops);
> > > >
> > > > +#define VFIO_DEFAULT_PASID_QUOTA 1000
> > > > +struct vfio_mm_token {
> > > > + unsigned long long val;
> > > > +};
> > > > +
> > > > +struct vfio_mm {
> > > > + struct kref kref;
> > > > + struct vfio_mm_token token;
> > > > + int ioasid_sid;
> > > > + /* protect @pasid_quota field and pasid allocation/free */
> > > > + struct mutex pasid_lock;
> > > > + int pasid_quota;
> > > > + struct list_head vfio_next;
> > > > +};
> > > > +
> > > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> > *task);
> > > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > > > +
> > > > /*
> > > > * External user API
> > > > */
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..298ac80 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > > > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/*
> > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > + * has been extended to support DMA isolation in fine-grain.
> > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > + * and free need to be system wide. This structure defines
> > > > + * the info for pasid alloc/free between user space and kernel
> > > > + * space.
> > > > + *
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > + __u32 argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > > > + __u32 flags;
> > > > + union {
> > > > + struct {
> > > > + __u32 min;
> > > > + __u32 max;
> > > > + __u32 result;
> > >
> > > result->pasid?
> >
> > yes, the pasid allocated.
> >
> > >
> > > > + } alloc_pasid;
> > > > + __u32 free_pasid;
> > >
> > > what about putting a common pasid field after flags?
> >
> > looks good to me. But it would make the union part only meaningful
> > to alloc pasid. if so, maybe make the union part as a data field and
> > only alloc pasid will have it. For free pasid, it is not necessary
> > to read it from userspace. does it look good?
>
> maybe keeping the union is also OK, just with {min, max} for alloc.
> who knows whether more pasid ops will be added in the future
> which may require its specific union structure. ?? putting pasid
> as a common field is reasonable because the whole cmd is for
> pasid.

got it.

Thanks,
Yi Liu


Attachments:
(No filename) (2.20 kB)

2020-04-01 05:45:00

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Liu, Yi L <[email protected]>
> Sent: Tuesday, March 31, 2020 9:22 PM
>
> > From: Tian, Kevin <[email protected]>
> > Sent: Tuesday, March 31, 2020 1:41 PM
> > To: Liu, Yi L <[email protected]>; [email protected];
> > [email protected]
> > Subject: RE: [PATCH v1 1/8] vfio: Add
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Liu, Yi L <[email protected]>
> > > Sent: Monday, March 30, 2020 10:37 PM
> > >
> > > > From: Tian, Kevin <[email protected]>
> > > > Sent: Monday, March 30, 2020 4:32 PM
> > > > To: Liu, Yi L <[email protected]>; [email protected];
> > > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > >
> > > > > From: Liu, Yi L <[email protected]>
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L <[email protected]>
> > > > >
> > > > > For a long time, devices have only one DMA address space from
> > > > > platform IOMMU's point of view. This is true for both bare metal
> > > > > and directed- access in virtualization environment. Reason is the
> > > > > source ID of DMA in PCIe are BDF (bus/dev/fnc ID), which results
> > > > > in only device granularity
> > > >
> > > > are->is
> > >
> > > thanks.
> > >
> > > > > DMA isolation. However, this is changing with the latest
> > > > > advancement in I/O technology area. More and more platform
> vendors
> > > > > are utilizing the
> > > PCIe
> > > > > PASID TLP prefix in DMA requests, thus to give devices with
> > > > > multiple DMA address spaces as identified by their individual
> > > > > PASIDs. For example, Shared Virtual Addressing (SVA, a.k.a Shared
> > > > > Virtual Memory) is able to let device access multiple process
> > > > > virtual address space by binding the
> > > >
> > > > "address space" -> "address spaces"
> > > >
> > > > "binding the" -> "binding each"
> > >
> > > will correct both.
> > >
> > > > > virtual address space with a PASID. Wherein the PASID is allocated
> > > > > in software and programmed to device per device specific manner.
> > > > > Devices which support PASID capability are called PASID-capable
> > > > > devices. If such devices are passed through to VMs, guest software
> > > > > are also able to bind guest process virtual address space on such
> > > > > devices. Therefore, the guest software could reuse the bare metal
> > > > > software programming model,
> > > which
> > > > > means guest software will also allocate PASID and program it to
> > > > > device directly. This is a dangerous situation since it has
> > > > > potential PASID conflicts and unauthorized address space access.
> > > > > It would be safer to let host intercept in the guest software's
> > > > > PASID allocation. Thus PASID are managed system-wide.
> > > > >
> > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > > > passdown PASID allocation/free request from the virtual IOMMU.
> > > > > Additionally, such
> > > >
> > > > "Additionally, because such"
> > > >
> > > > > requests are intended to be invoked by QEMU or other applications
> > > which
> > > >
> > > > simplify to "intended to be invoked from userspace"
> > >
> > > got it.
> > >
> > > > > are running in userspace, it is necessary to have a mechanism to
> > > > > prevent single application from abusing available PASIDs in
> > > > > system. With such consideration, this patch tracks the VFIO PASID
> > > > > allocation per-VM. There was a discussion to make quota to be per
> > > > > assigned devices. e.g. if a VM has many assigned devices, then it
> > > > > should have more quota. However, it is not sure how many PASIDs an
> > > > > assigned devices will use. e.g. it is
> > > >
> > > > devices -> device
> > >
> > > got it.
> > >
> > > > > possible that a VM with multiples assigned devices but requests
> > > > > less PASIDs. Therefore per-VM quota would be better.
> > > > >
> > > > > This patch uses struct mm pointer as a per-VM token. We also
> > > > > considered using task structure pointer and vfio_iommu structure
> > > > > pointer. However, task structure is per-thread, which means it
> > > > > cannot achieve per-VM PASID alloc tracking purpose. While for
> > > > > vfio_iommu structure, it is visible only within vfio. Therefore,
> > > > > structure mm pointer is selected. This patch adds a structure
> > > > > vfio_mm. A vfio_mm is created when the first vfio container is
> > > > > opened by a VM. On the reverse order, vfio_mm is free when the
> > > > > last vfio container is released. Each VM is assigned with a PASID
> > > > > quota, so that it is not able to request PASID beyond its quota.
> > > > > This patch adds a default quota of 1000. This quota could be tuned
> > > > > by administrator. Making PASID quota tunable will be added in
> > > > > another
> > > patch
> > > > > in this series.
> > > > >
> > > > > Previous discussions:
> > > > > https://patchwork.kernel.org/patch/11209429/
> > > > >
> > > > > Cc: Kevin Tian <[email protected]>
> > > > > CC: Jacob Pan <[email protected]>
> > > > > Cc: Alex Williamson <[email protected]>
> > > > > Cc: Eric Auger <[email protected]>
> > > > > Cc: Jean-Philippe Brucker <[email protected]>
> > > > > Signed-off-by: Liu Yi L <[email protected]>
> > > > > Signed-off-by: Yi Sun <[email protected]>
> > > > > Signed-off-by: Jacob Pan <[email protected]>
> > > > > ---
> > > > > drivers/vfio/vfio.c | 130
> > > > > ++++++++++++++++++++++++++++++++++++++++
> > > > > drivers/vfio/vfio_iommu_type1.c | 104
> > > > > ++++++++++++++++++++++++++++++++
> > > > > include/linux/vfio.h | 20 +++++++
> > > > > include/uapi/linux/vfio.h | 41 +++++++++++++
> > > > > 4 files changed, 295 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > c848262..d13b483 100644
> > > > > --- a/drivers/vfio/vfio.c
> > > > > +++ b/drivers/vfio/vfio.c
> > > > > @@ -32,6 +32,7 @@
> > > > > #include <linux/vfio.h>
> > > > > #include <linux/wait.h>
> > > > > #include <linux/sched/signal.h>
> > > > > +#include <linux/sched/mm.h>
> > > > >
> > > > > #define DRIVER_VERSION "0.3"
> > > > > #define DRIVER_AUTHOR "Alex Williamson
> > > > > <[email protected]>"
> > > > > @@ -46,6 +47,8 @@ static struct vfio {
> > > > > struct mutex group_lock;
> > > > > struct cdev group_cdev;
> > > > > dev_t group_devt;
> > > > > + struct list_head vfio_mm_list;
> > > > > + struct mutex vfio_mm_lock;
> > > > > wait_queue_head_t release_q;
> > > > > } vfio;
> > > > >
> > > > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device
> > > *dev,
> > > > > enum vfio_notify_type type,
> > > > > EXPORT_SYMBOL(vfio_unregister_notifier);
> > > > >
> > > > > /**
> > > > > + * VFIO_MM objects - create, release, get, put, search
> > > >
> > > > why capitalizing vfio_mm?
> > >
> > > oops, it's not intended, will fix it.
> > >
> > > > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > > > + */
> > > > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm) {
> > > > > + struct vfio_mm *vmm;
> > > > > + struct vfio_mm_token *token;
> > > > > + int ret = 0;
> > > > > +
> > > > > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > > > + if (!vmm)
> > > > > + return ERR_PTR(-ENOMEM);
> > > > > +
> > > > > + /* Per mm IOASID set used for quota control and group
> operations
> > > > > */
> > > > > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > > > + VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > > > > >ioasid_sid);
> > > > > + if (ret) {
> > > > > + kfree(vmm);
> > > > > + return ERR_PTR(ret);
> > > > > + }
> > > > > +
> > > > > + kref_init(&vmm->kref);
> > > > > + token = &vmm->token;
> > > > > + token->val = mm;
> > > > > + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > > > + mutex_init(&vmm->pasid_lock);
> > > > > +
> > > > > + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > > > +
> > > > > + return vmm;
> > > > > +}
> > > > > +
> > > > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm) {
> > > > > + /* destroy the ioasid set */
> > > > > + ioasid_free_set(vmm->ioasid_sid, true);
> > > >
> > > > do we need hold pasid lock here, since it attempts to destroy a set
> > > > which might be referenced by vfio_mm_pasid_free? or is there
> > > > guarantee that such race won't happen?
> > >
> > > Emmm, if considering the race between ioasid_free_set and
> > > vfio_mm_pasid_free, I guess ioasid core should sequence the two
> > > operations with its internal lock. right?
> >
> > I looked at below code in free path:
> >
> > + mutex_lock(&vmm->pasid_lock);
> > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > + if (IS_ERR(pdata)) {
> > + ret = PTR_ERR(pdata);
> > + goto out_unlock;
> > + }
> > + ioasid_free(pasid);
> > +
> > +out_unlock:
> > + mutex_unlock(&vmm->pasid_lock);
> >
> > above implies that ioasid_find/free must be paired within the pasid_lock.
> > Then if we don't hold pasid_lock above, ioasid_free_set could happen
> between
> > find/free. I'm not sure whether this race would lead to real problem, but it
> doesn't
> > look correct simply by looking at this file.
>
> Well, Jacob told me to remove the ioasid_find in another email as he
> believes ioasid core should be able to take care of it. and also need to
> be protected by lock. If so, does it look good? :-)
>
> " [Jacob Pan] this might be better to put under ioasid code such that it
> is under the ioasid lock. no one else can free the ioasid between find() and
> free().
> e.g. ioasid_free(sid, pasid)
> if sid == INVALID_IOASID_SET, then no set ownership checking.
> thoughts?"

yes, that way looks better.

>
> > >
> > > > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > > > + kfree(vmm);
> > > > > +}
> > > > > +
> > > > > +/* called with vfio.vfio_mm_lock held */ static void
> > > > > +vfio_mm_release(struct kref *kref) {
> > > > > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm,
> kref);
> > > > > +
> > > > > + list_del(&vmm->vfio_next);
> > > > > + vfio_mm_unlock_and_free(vmm);
> > > > > +}
> > > > > +
> > > > > +void vfio_mm_put(struct vfio_mm *vmm) {
> > > > > + kref_put_mutex(&vmm->kref, vfio_mm_release,
> > > > > &vfio.vfio_mm_lock);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > > > +
> > > > > +/* Assume vfio_mm_lock or vfio_mm reference is held */ static
> > > > > +void vfio_mm_get(struct vfio_mm *vmm) {
> > > > > + kref_get(&vmm->kref);
> > > > > +}
> > > > > +
> > > > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) {
> > > > > + struct mm_struct *mm = get_task_mm(task);
> > > > > + struct vfio_mm *vmm;
> > > > > + unsigned long long val = (unsigned long long) mm;
> > > > > +
> > > > > + mutex_lock(&vfio.vfio_mm_lock);
> > > > > + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > > > + if (vmm->token.val == val) {
> > > > > + vfio_mm_get(vmm);
> > > > > + goto out;
> > > > > + }
> > > > > + }
> > > > > +
> > > > > + vmm = vfio_create_mm(mm);
> > > > > + if (IS_ERR(vmm))
> > > > > + vmm = NULL;
> > > > > +out:
> > > > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > > > + mmput(mm);
> > > >
> > > > I assume this has been discussed before, but from readability p.o.v
> > > > it might be good to add a comment for this function to explain how
> > > > the recording of mm in vfio_mm can be correctly removed when the
> mm
> > > > is being destroyed, since we don't hold a reference of mm here.
> > >
> > > yeah, I'll add it.
> > >
> > > > > + return vmm;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > > +
> > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max) {
> > > > > + ioasid_t pasid;
> > > > > + int ret = -ENOSPC;
> > > > > +
> > > > > + mutex_lock(&vmm->pasid_lock);
> > > > > +
> > > > > + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > > > + if (pasid == INVALID_IOASID) {
> > > > > + ret = -ENOSPC;
> > > > > + goto out_unlock;
> > > > > + }
> > > > > +
> > > > > + ret = pasid;
> > > > > +out_unlock:
> > > > > + mutex_unlock(&vmm->pasid_lock);
> > > > > + return ret;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > > > +
> > > > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid) {
> > > > > + void *pdata;
> > > > > + int ret = 0;
> > > > > +
> > > > > + mutex_lock(&vmm->pasid_lock);
> > > > > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > > > + if (IS_ERR(pdata)) {
> > > > > + ret = PTR_ERR(pdata);
> > > > > + goto out_unlock;
> > > > > + }
> > > > > + ioasid_free(pasid);
> > > > > +
> > > > > +out_unlock:
> > > > > + mutex_unlock(&vmm->pasid_lock);
> > > > > + return ret;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > > > +
> > > > > +/**
> > > > > * Module/class support
> > > > > */
> > > > > static char *vfio_devnode(struct device *dev, umode_t *mode) @@
> > > > > -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > > > > idr_init(&vfio.group_idr);
> > > > > mutex_init(&vfio.group_lock);
> > > > > mutex_init(&vfio.iommu_drivers_lock);
> > > > > + mutex_init(&vfio.vfio_mm_lock);
> > > > > INIT_LIST_HEAD(&vfio.group_list);
> > > > > INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > > > + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > > > > init_waitqueue_head(&vfio.release_q);
> > > > >
> > > > > ret = misc_register(&vfio_dev);
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > b/drivers/vfio/vfio_iommu_type1.c index a177bf2..331ceee 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > > > unsigned int dma_avail;
> > > > > bool v2;
> > > > > bool nesting;
> > > > > + struct vfio_mm *vmm;
> > > > > };
> > > > >
> > > > > struct vfio_domain {
> > > > > @@ -2018,6 +2019,7 @@ static void
> > > vfio_iommu_type1_detach_group(void
> > > > > *iommu_data,
> > > > > static void *vfio_iommu_type1_open(unsigned long arg) {
> > > > > struct vfio_iommu *iommu;
> > > > > + struct vfio_mm *vmm = NULL;
> > > > >
> > > > > iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > > > > if (!iommu)
> > > > > @@ -2043,6 +2045,10 @@ static void
> > > *vfio_iommu_type1_open(unsigned
> > > > > long arg)
> > > > > iommu->dma_avail = dma_entry_limit;
> > > > > mutex_init(&iommu->lock);
> > > > > BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > > > + vmm = vfio_mm_get_from_task(current);
> > > > > + if (!vmm)
> > > > > + pr_err("Failed to get vfio_mm track\n");
> > > >
> > > > I assume error should be returned when pr_err is used...
> > >
> > > got it. I didn't do it as I don’t think vfio_mm is necessary for
> > > every iommu open. It is necessary for the nesting type iommu. I'll
> > > make it fetch vmm when it is opening nesting type and return error
> > > if failed.
> >
> > sounds good.
> >
> > >
> > > > > + iommu->vmm = vmm;
> > > > >
> > > > > return iommu;
> > > > > }
> > > > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > > > > *iommu_data)
> > > > > }
> > > > >
> > > > > vfio_iommu_iova_free(&iommu->iova_list);
> > > > > + if (iommu->vmm)
> > > > > + vfio_mm_put(iommu->vmm);
> > > > >
> > > > > kfree(iommu);
> > > > > }
> > > > > @@ -2172,6 +2180,55 @@ static int
> vfio_iommu_iova_build_caps(struct
> > > > > vfio_iommu *iommu,
> > > > > return ret;
> > > > > }
> > > > >
> > > > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > > >
> > > > I don't think you need prefix "vfio_iommu_type1" for every new
> > > > function here, especially for leaf internal function as this one.
> > >
> > > got it. thanks.
> > >
> > > > > +{
> > > > > + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > > > + (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > > > + flags & VFIO_IOMMU_PASID_FREE));
> > > > > +}
> > > > > +
> > > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > > + int min,
> > > > > + int max)
> > > > > +{
> > > > > + struct vfio_mm *vmm = iommu->vmm;
> > > > > + int ret = 0;
> > > > > +
> > > > > + mutex_lock(&iommu->lock);
> > > > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > + ret = -EFAULT;
> > > >
> > > > why -EFAULT?
> > >
> > > well, it's from a prior comment as below:
> > > vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
> > > differentiate the errors. We could use EFAULT for the no IOMMU case
> > > and EINVAL here?
> > > http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html
> > >
> > > >
> > > > > + goto out_unlock;
> > > > > + }
> > > > > + if (vmm)
> > > > > + ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > > > + else
> > > > > + ret = -EINVAL;
> > > > > +out_unlock:
> > > > > + mutex_unlock(&iommu->lock);
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > > + unsigned int pasid)
> > > > > +{
> > > > > + struct vfio_mm *vmm = iommu->vmm;
> > > > > + int ret = 0;
> > > > > +
> > > > > + mutex_lock(&iommu->lock);
> > > > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > + ret = -EFAULT;
> > > >
> > > > ditto
> > > >
> > > > > + goto out_unlock;
> > > > > + }
> > > > > +
> > > > > + if (vmm)
> > > > > + ret = vfio_mm_pasid_free(vmm, pasid);
> > > > > + else
> > > > > + ret = -EINVAL;
> > > > > +out_unlock:
> > > > > + mutex_unlock(&iommu->lock);
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > > unsigned int cmd, unsigned long arg)
> > > > > {
> > > > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > > > *iommu_data,
> > > > >
> > > > > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > > > -EFAULT : 0;
> > > > > +
> > > > > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > > + struct vfio_iommu_type1_pasid_request req;
> > > > > + unsigned long offset;
> > > > > +
> > > > > + minsz = offsetofend(struct
> vfio_iommu_type1_pasid_request,
> > > > > + flags);
> > > > > +
> > > > > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > > + return -EFAULT;
> > > > > +
> > > > > + if (req.argsz < minsz ||
> > > > > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > > > + return -EINVAL;
> > > > > +
> > > > > + if (copy_from_user((void *)&req + minsz,
> > > > > + (void __user *)arg + minsz,
> > > > > + sizeof(req) - minsz))
> > > > > + return -EFAULT;
> > > >
> > > > why copying in two steps instead of copying them together?
> > >
> > > just want to do sanity check before copying all the data. I
> > > can move it as one copy if it's better. :-)
> >
> > it's possible fine. I just saw you did same thing for other uapis.
> >
> > >
> > > > > +
> > > > > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > > + case VFIO_IOMMU_PASID_ALLOC:
> > > > > + {
> > > > > + int ret = 0, result;
> > > > > +
> > > > > + result =
> vfio_iommu_type1_pasid_alloc(iommu,
> > > > > +
> req.alloc_pasid.min,
> > > > > +
> req.alloc_pasid.max);
> > > > > + if (result > 0) {
> > > > > + offset = offsetof(
> > > > > + struct
> > > > > vfio_iommu_type1_pasid_request,
> > > > > + alloc_pasid.result);
> > > > > + ret = copy_to_user(
> > > > > + (void __user *) (arg +
> offset),
> > > > > + &result, sizeof(result));
> > > > > + } else {
> > > > > + pr_debug("%s: PASID alloc failed\n",
> > > > > __func__);
> > > > > + ret = -EFAULT;
> > > >
> > > > no, this branch is not for copy_to_user error. it is about pasid alloc
> > > > failure. you should handle both.
> > >
> > > Emmm, I just want to fail the IOCTL in such case, so the @result field
> > > is meaningless in the user side. How about using another return value
> > > (e.g. ENOSPC) to indicate the IOCTL failure?
> >
> > If pasid_alloc fails, you return its result to userspace
> > if copy_to_user fails, then return -EFAULT.
> >
> > however, above you return -EFAULT for pasid_alloc failure, and
> > then the number of not-copied bytes for copy_to_user.
>
> not quite get. Let me re-paste the code. :-)
>
> + case VFIO_IOMMU_PASID_ALLOC:
> + {
> + int ret = 0, result;
> +
> + result = vfio_iommu_type1_pasid_alloc(iommu,
> + req.alloc_pasid.min,
> + req.alloc_pasid.max);
> + if (result > 0) {
> + offset = offsetof(
> + struct
> vfio_iommu_type1_pasid_request,
> + alloc_pasid.result);
> + ret = copy_to_user(
> + (void __user *) (arg + offset),
> + &result, sizeof(result));
> if copy_to_user failed, ret is the number of uncopied bytes and
> will be returned to userspace to indicate failure. userspace will
> not use the data in result field. perhaps, I should check the ret
> here and return -EFAULT or another errno, instead of return the
> number of un-copied bytes.

here should return -EFAULT.

> + } else {
> + pr_debug("%s: PASID alloc failed\n",
> __func__);
> + ret = -EFAULT;
> if vfio_iommu_type1_pasid_alloc() failed, no doubt, return -EFAULT
> to userspace to indicate failure.

pasid_alloc has its own error types returned. why blindly replace it
with -EFAULT?

> + }
> + return ret;
> + }
>
> is there still porblem here?
> > >
> > > > > + }
> > > > > + return ret;
> > > > > + }
> > > > > + case VFIO_IOMMU_PASID_FREE:
> > > > > + return vfio_iommu_type1_pasid_free(iommu,
> > > > > +
> req.free_pasid);
> > > > > + default:
> > > > > + return -EINVAL;
> > > > > + }
> > > > > }
> > > > >
> > > > > return -ENOTTY;
> > > > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > > > index e42a711..75f9f7f1 100644
> > > > > --- a/include/linux/vfio.h
> > > > > +++ b/include/linux/vfio.h
> > > > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const
> struct
> > > > > vfio_iommu_driver_ops *ops);
> > > > > extern void vfio_unregister_iommu_driver(
> > > > > const struct vfio_iommu_driver_ops *ops);
> > > > >
> > > > > +#define VFIO_DEFAULT_PASID_QUOTA 1000
> > > > > +struct vfio_mm_token {
> > > > > + unsigned long long val;
> > > > > +};
> > > > > +
> > > > > +struct vfio_mm {
> > > > > + struct kref kref;
> > > > > + struct vfio_mm_token token;
> > > > > + int ioasid_sid;
> > > > > + /* protect @pasid_quota field and pasid allocation/free */
> > > > > + struct mutex pasid_lock;
> > > > > + int pasid_quota;
> > > > > + struct list_head vfio_next;
> > > > > +};
> > > > > +
> > > > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> > > *task);
> > > > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int
> max);
> > > > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t
> pasid);
> > > > > +
> > > > > /*
> > > > > * External user API
> > > > > */
> > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > > index 9e843a1..298ac80 100644
> > > > > --- a/include/uapi/linux/vfio.h
> > > > > +++ b/include/uapi/linux/vfio.h
> > > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > > > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > > > > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> > > > >
> > > > > +/*
> > > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > > + * has been extended to support DMA isolation in fine-grain.
> > > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > > + * and free need to be system wide. This structure defines
> > > > > + * the info for pasid alloc/free between user space and kernel
> > > > > + * space.
> > > > > + *
> > > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > > + */
> > > > > +struct vfio_iommu_type1_pasid_request {
> > > > > + __u32 argsz;
> > > > > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > > > > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > > > > + __u32 flags;
> > > > > + union {
> > > > > + struct {
> > > > > + __u32 min;
> > > > > + __u32 max;
> > > > > + __u32 result;
> > > >
> > > > result->pasid?
> > >
> > > yes, the pasid allocated.
> > >
> > > >
> > > > > + } alloc_pasid;
> > > > > + __u32 free_pasid;
> > > >
> > > > what about putting a common pasid field after flags?
> > >
> > > looks good to me. But it would make the union part only meaningful
> > > to alloc pasid. if so, maybe make the union part as a data field and
> > > only alloc pasid will have it. For free pasid, it is not necessary
> > > to read it from userspace. does it look good?
> >
> > maybe keeping the union is also OK, just with {min, max} for alloc.
> > who knows whether more pasid ops will be added in the future
> > which may require its specific union structure. ?? putting pasid
> > as a common field is reasonable because the whole cmd is for
> > pasid.
>
> got it.
>
> Thanks,
> Yi Liu

2020-04-01 05:59:22

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Tian, Kevin <[email protected]>
> Sent: Wednesday, April 1, 2020 1:43 PM
> To: Liu, Yi L <[email protected]>; [email protected];
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> > From: Liu, Yi L <[email protected]>
> > Sent: Tuesday, March 31, 2020 9:22 PM
> >
> > > From: Tian, Kevin <[email protected]>
> > > Sent: Tuesday, March 31, 2020 1:41 PM
> > > To: Liu, Yi L <[email protected]>; [email protected];
> > > [email protected]
> > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > > From: Liu, Yi L <[email protected]>
> > > > Sent: Monday, March 30, 2020 10:37 PM
> > > >
> > > > > From: Tian, Kevin <[email protected]>
> > > > > Sent: Monday, March 30, 2020 4:32 PM
> > > > > To: Liu, Yi L <[email protected]>; [email protected];
> > > > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > >
> > > > > > From: Liu, Yi L <[email protected]>
> > > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > > >
> > > > > > From: Liu Yi L <[email protected]>
> > > > > >
> > > > > > For a long time, devices have only one DMA address space from
> > > > > > platform IOMMU's point of view. This is true for both bare
> > > > > > metal and directed- access in virtualization environment.
> > > > > > Reason is the source ID of DMA in PCIe are BDF (bus/dev/fnc
> > > > > > ID), which results in only device granularity
[...]
> > >
> > > >
> > > > > > +
> > > > > > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > > > + case VFIO_IOMMU_PASID_ALLOC:
> > > > > > + {
> > > > > > + int ret = 0, result;
> > > > > > +
> > > > > > + result =
> > vfio_iommu_type1_pasid_alloc(iommu,
> > > > > > +
> > req.alloc_pasid.min,
> > > > > > +
> > req.alloc_pasid.max);
> > > > > > + if (result > 0) {
> > > > > > + offset = offsetof(
> > > > > > + struct
> > > > > > vfio_iommu_type1_pasid_request,
> > > > > > + alloc_pasid.result);
> > > > > > + ret = copy_to_user(
> > > > > > + (void __user *) (arg +
> > offset),
> > > > > > + &result, sizeof(result));
> > > > > > + } else {
> > > > > > + pr_debug("%s: PASID alloc failed\n",
> > > > > > __func__);
> > > > > > + ret = -EFAULT;
> > > > >
> > > > > no, this branch is not for copy_to_user error. it is about pasid
> > > > > alloc failure. you should handle both.
> > > >
> > > > Emmm, I just want to fail the IOCTL in such case, so the @result
> > > > field is meaningless in the user side. How about using another
> > > > return value (e.g. ENOSPC) to indicate the IOCTL failure?
> > >
> > > If pasid_alloc fails, you return its result to userspace if
> > > copy_to_user fails, then return -EFAULT.
> > >
> > > however, above you return -EFAULT for pasid_alloc failure, and then
> > > the number of not-copied bytes for copy_to_user.
> >
> > not quite get. Let me re-paste the code. :-)
> >
> > + case VFIO_IOMMU_PASID_ALLOC:
> > + {
> > + int ret = 0, result;
> > +
> > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > + req.alloc_pasid.min,
> > + req.alloc_pasid.max);
> > + if (result > 0) {
> > + offset = offsetof(
> > + struct
> > vfio_iommu_type1_pasid_request,
> > + alloc_pasid.result);
> > + ret = copy_to_user(
> > + (void __user *) (arg + offset),
> > + &result, sizeof(result));
> > if copy_to_user failed, ret is the number of uncopied bytes and will
> > be returned to userspace to indicate failure. userspace will not use
> > the data in result field. perhaps, I should check the ret here and
> > return -EFAULT or another errno, instead of return the number of
> > un-copied bytes.
>
> here should return -EFAULT.

got it. so if any failure, the return value of this ioctl is a -ERROR_VAL.

>
> > + } else {
> > + pr_debug("%s: PASID alloc failed\n",
> > __func__);
> > + ret = -EFAULT;
> > if vfio_iommu_type1_pasid_alloc() failed, no doubt, return -EFAULT to
> > userspace to indicate failure.
>
> pasid_alloc has its own error types returned. why blindly replace it with -EFAULT?

right, should use its own error types.

Thanks,
Yi Liu

2020-04-02 13:54:02

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

Hi Yi,

On Sun, Mar 22, 2020 at 05:31:58AM -0700, Liu, Yi L wrote:
> From: Liu Yi L <[email protected]>
>
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the
> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access.

It's worth noting that this applies to Intel VT-d with scalable mode, not
IOMMUs that use one PASID space per VM

> It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.
>
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such
> requests are intended to be invoked by QEMU or other applications which
> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is
> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
>
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
>
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Yi Sun <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/vfio/vfio.c | 130 ++++++++++++++++++++++++++++++++++++++++
> drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> include/linux/vfio.h | 20 +++++++
> include/uapi/linux/vfio.h | 41 +++++++++++++
> 4 files changed, 295 insertions(+)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
> #include <linux/vfio.h>
> #include <linux/wait.h>
> #include <linux/sched/signal.h>
> +#include <linux/sched/mm.h>
>
> #define DRIVER_VERSION "0.3"
> #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> @@ -46,6 +47,8 @@ static struct vfio {
> struct mutex group_lock;
> struct cdev group_cdev;
> dev_t group_devt;
> + struct list_head vfio_mm_list;
> + struct mutex vfio_mm_lock;
> wait_queue_head_t release_q;
> } vfio;
>
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
> EXPORT_SYMBOL(vfio_unregister_notifier);
>
> /**
> + * VFIO_MM objects - create, release, get, put, search
> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> + struct vfio_mm *vmm;
> + struct vfio_mm_token *token;
> + int ret = 0;
> +
> + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> + if (!vmm)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Per mm IOASID set used for quota control and group operations */
> + ret = ioasid_alloc_set((struct ioasid_set *) mm,

Hmm, either we need to change the token of ioasid_alloc_set() to "void *",
or pass an actual ioasid_set struct, but this cast doesn't look good :)

As I commented on the IOASID series, I think we could embed a struct
ioasid_set into vfio_mm, pass that struct to all other ioasid_* functions,
and get rid of ioasid_sid.

> + VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> + if (ret) {
> + kfree(vmm);
> + return ERR_PTR(ret);
> + }
> +
> + kref_init(&vmm->kref);
> + token = &vmm->token;
> + token->val = mm;

Why the intermediate token struct? Could we just store the mm_struct
pointer within vfio_mm?

Thanks,
Jean

> + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> + mutex_init(&vmm->pasid_lock);
> +
> + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> +
> + return vmm;
> +}

2020-04-02 18:20:30

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

On Sun, 22 Mar 2020 05:31:58 -0700
"Liu, Yi L" <[email protected]> wrote:

> From: Liu Yi L <[email protected]>
>
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the
> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access. It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.

Providing an allocation interface only allows for collaborative usage
of PASIDs though. Do we have any ability to enforce PASID usage or can
a user spoof other PASIDs on the same BDF?

> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such
> requests are intended to be invoked by QEMU or other applications which
> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is
> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
>
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
>
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Yi Sun <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/vfio/vfio.c | 130 ++++++++++++++++++++++++++++++++++++++++
> drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> include/linux/vfio.h | 20 +++++++
> include/uapi/linux/vfio.h | 41 +++++++++++++
> 4 files changed, 295 insertions(+)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
> #include <linux/vfio.h>
> #include <linux/wait.h>
> #include <linux/sched/signal.h>
> +#include <linux/sched/mm.h>
>
> #define DRIVER_VERSION "0.3"
> #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> @@ -46,6 +47,8 @@ static struct vfio {
> struct mutex group_lock;
> struct cdev group_cdev;
> dev_t group_devt;
> + struct list_head vfio_mm_list;
> + struct mutex vfio_mm_lock;
> wait_queue_head_t release_q;
> } vfio;
>
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
> EXPORT_SYMBOL(vfio_unregister_notifier);
>
> /**
> + * VFIO_MM objects - create, release, get, put, search
> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> + struct vfio_mm *vmm;
> + struct vfio_mm_token *token;
> + int ret = 0;
> +
> + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> + if (!vmm)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Per mm IOASID set used for quota control and group operations */
> + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> + VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> + if (ret) {
> + kfree(vmm);
> + return ERR_PTR(ret);
> + }
> +
> + kref_init(&vmm->kref);
> + token = &vmm->token;
> + token->val = mm;
> + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> + mutex_init(&vmm->pasid_lock);
> +
> + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> +
> + return vmm;
> +}
> +
> +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> +{
> + /* destroy the ioasid set */
> + ioasid_free_set(vmm->ioasid_sid, true);
> + mutex_unlock(&vfio.vfio_mm_lock);
> + kfree(vmm);
> +}
> +
> +/* called with vfio.vfio_mm_lock held */
> +static void vfio_mm_release(struct kref *kref)
> +{
> + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> +
> + list_del(&vmm->vfio_next);
> + vfio_mm_unlock_and_free(vmm);
> +}
> +
> +void vfio_mm_put(struct vfio_mm *vmm)
> +{
> + kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_put);
> +
> +/* Assume vfio_mm_lock or vfio_mm reference is held */
> +static void vfio_mm_get(struct vfio_mm *vmm)
> +{
> + kref_get(&vmm->kref);
> +}
> +
> +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> +{
> + struct mm_struct *mm = get_task_mm(task);
> + struct vfio_mm *vmm;
> + unsigned long long val = (unsigned long long) mm;
> +
> + mutex_lock(&vfio.vfio_mm_lock);
> + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> + if (vmm->token.val == val) {
> + vfio_mm_get(vmm);
> + goto out;
> + }
> + }
> +
> + vmm = vfio_create_mm(mm);
> + if (IS_ERR(vmm))
> + vmm = NULL;
> +out:
> + mutex_unlock(&vfio.vfio_mm_lock);
> + mmput(mm);
> + return vmm;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> +
> +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +{
> + ioasid_t pasid;
> + int ret = -ENOSPC;
> +
> + mutex_lock(&vmm->pasid_lock);
> +
> + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> + if (pasid == INVALID_IOASID) {
> + ret = -ENOSPC;
> + goto out_unlock;
> + }
> +
> + ret = pasid;
> +out_unlock:
> + mutex_unlock(&vmm->pasid_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> +
> +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> +{
> + void *pdata;
> + int ret = 0;
> +
> + mutex_lock(&vmm->pasid_lock);
> + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> + if (IS_ERR(pdata)) {
> + ret = PTR_ERR(pdata);
> + goto out_unlock;
> + }
> + ioasid_free(pasid);
> +
> +out_unlock:
> + mutex_unlock(&vmm->pasid_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> +
> +/**
> * Module/class support
> */
> static char *vfio_devnode(struct device *dev, umode_t *mode)
> @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> idr_init(&vfio.group_idr);
> mutex_init(&vfio.group_lock);
> mutex_init(&vfio.iommu_drivers_lock);
> + mutex_init(&vfio.vfio_mm_lock);
> INIT_LIST_HEAD(&vfio.group_list);
> INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> init_waitqueue_head(&vfio.release_q);
>
> ret = misc_register(&vfio_dev);

Is vfio.c the right place for any of the above? It seems like it could
all be in a separate vfio_pasid module, similar to our virqfd module.

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index a177bf2..331ceee 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
> unsigned int dma_avail;
> bool v2;
> bool nesting;
> + struct vfio_mm *vmm;
> };
>
> struct vfio_domain {
> @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> static void *vfio_iommu_type1_open(unsigned long arg)
> {
> struct vfio_iommu *iommu;
> + struct vfio_mm *vmm = NULL;
>
> iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> if (!iommu)
> @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> iommu->dma_avail = dma_entry_limit;
> mutex_init(&iommu->lock);
> BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> + vmm = vfio_mm_get_from_task(current);
> + if (!vmm)
> + pr_err("Failed to get vfio_mm track\n");

Doesn't this presume everyone is instantly running PASID capable hosts?
Looks like a noisy support regression to me.

> + iommu->vmm = vmm;
>
> return iommu;
> }
> @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void *iommu_data)
> }
>
> vfio_iommu_iova_free(&iommu->iova_list);
> + if (iommu->vmm)
> + vfio_mm_put(iommu->vmm);
>
> kfree(iommu);
> }
> @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
> return ret;
> }
>
> +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> +{
> + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> + (flags & VFIO_IOMMU_PASID_ALLOC &&
> + flags & VFIO_IOMMU_PASID_FREE));
> +}
> +
> +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> + int min,
> + int max)
> +{
> + struct vfio_mm *vmm = iommu->vmm;
> + int ret = 0;
> +
> + mutex_lock(&iommu->lock);
> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }

Non-iommu backed mdevs are excluded from this? Is this a matter of
wiring the call out through the mdev parent device, or is this just
possible?

> + if (vmm)
> + ret = vfio_mm_pasid_alloc(vmm, min, max);
> + else
> + ret = -EINVAL;
> +out_unlock:
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> + unsigned int pasid)
> +{
> + struct vfio_mm *vmm = iommu->vmm;
> + int ret = 0;
> +
> + mutex_lock(&iommu->lock);
> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }

So if a container had an iommu backed device when the pasid was
allocated, but it was removed, now they can't free it? Why do we need
the check above?

> +
> + if (vmm)
> + ret = vfio_mm_pasid_free(vmm, pasid);
> + else
> + ret = -EINVAL;
> +out_unlock:
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
> static long vfio_iommu_type1_ioctl(void *iommu_data,
> unsigned int cmd, unsigned long arg)
> {
> @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>
> return copy_to_user((void __user *)arg, &unmap, minsz) ?
> -EFAULT : 0;
> +
> + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> + struct vfio_iommu_type1_pasid_request req;
> + unsigned long offset;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> + flags);
> +
> + if (copy_from_user(&req, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (req.argsz < minsz ||
> + !vfio_iommu_type1_pasid_req_valid(req.flags))
> + return -EINVAL;
> +
> + if (copy_from_user((void *)&req + minsz,
> + (void __user *)arg + minsz,
> + sizeof(req) - minsz))
> + return -EFAULT;

Huh? Why do we have argsz if we're going to assume this is here?

> +
> + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> + case VFIO_IOMMU_PASID_ALLOC:
> + {
> + int ret = 0, result;
> +
> + result = vfio_iommu_type1_pasid_alloc(iommu,
> + req.alloc_pasid.min,
> + req.alloc_pasid.max);
> + if (result > 0) {
> + offset = offsetof(
> + struct vfio_iommu_type1_pasid_request,
> + alloc_pasid.result);
> + ret = copy_to_user(
> + (void __user *) (arg + offset),
> + &result, sizeof(result));

Again assuming argsz supports this.

> + } else {
> + pr_debug("%s: PASID alloc failed\n", __func__);

rate limit?

> + ret = -EFAULT;
> + }
> + return ret;
> + }
> + case VFIO_IOMMU_PASID_FREE:
> + return vfio_iommu_type1_pasid_free(iommu,
> + req.free_pasid);
> + default:
> + return -EINVAL;
> + }
> }
>
> return -ENOTTY;
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711..75f9f7f1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> extern void vfio_unregister_iommu_driver(
> const struct vfio_iommu_driver_ops *ops);
>
> +#define VFIO_DEFAULT_PASID_QUOTA 1000
> +struct vfio_mm_token {
> + unsigned long long val;
> +};
> +
> +struct vfio_mm {
> + struct kref kref;
> + struct vfio_mm_token token;
> + int ioasid_sid;
> + /* protect @pasid_quota field and pasid allocation/free */
> + struct mutex pasid_lock;
> + int pasid_quota;
> + struct list_head vfio_next;
> +};
> +
> +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> +extern void vfio_mm_put(struct vfio_mm *vmm);
> +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> +
> /*
> * External user API
> */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a1..298ac80 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
>
> +/*
> + * PASID (Process Address Space ID) is a PCIe concept which
> + * has been extended to support DMA isolation in fine-grain.
> + * With device assigned to user space (e.g. VMs), PASID alloc
> + * and free need to be system wide. This structure defines
> + * the info for pasid alloc/free between user space and kernel
> + * space.
> + *
> + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> + */
> +struct vfio_iommu_type1_pasid_request {
> + __u32 argsz;
> +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> + __u32 flags;
> + union {
> + struct {
> + __u32 min;
> + __u32 max;
> + __u32 result;
> + } alloc_pasid;
> + __u32 free_pasid;
> + };

We seem to be using __u8 data[] lately where the struct at data is
defined by the flags. should we do that here?

> +};
> +
> +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC | \
> + VFIO_IOMMU_PASID_FREE)
> +
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> + * struct vfio_iommu_type1_pasid_request)
> + *
> + * Availability of this feature depends on PASID support in the device,
> + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> + * is available after VFIO_SET_IOMMU.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 22)

So a user needs to try to allocate a PASID in order to test for the
support? Should we have a PROBE flag?

> +
> /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>
> /*

2020-04-03 06:00:56

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Alex Williamson <[email protected]>
> Sent: Friday, April 3, 2020 1:50 AM
>
> On Sun, 22 Mar 2020 05:31:58 -0700
> "Liu, Yi L" <[email protected]> wrote:
>
> > From: Liu Yi L <[email protected]>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
>
> Providing an allocation interface only allows for collaborative usage
> of PASIDs though. Do we have any ability to enforce PASID usage or can
> a user spoof other PASIDs on the same BDF?

An user can access only PASIDs allocated to itself, i.e. the specific IOASID
set tied to its mm_struct.

Thanks
Kevin

>
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Yi Sun <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/vfio/vfio.c | 130
> ++++++++++++++++++++++++++++++++++++++++
> > drivers/vfio/vfio_iommu_type1.c | 104
> ++++++++++++++++++++++++++++++++
> > include/linux/vfio.h | 20 +++++++
> > include/uapi/linux/vfio.h | 41 +++++++++++++
> > 4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> > #include <linux/vfio.h>
> > #include <linux/wait.h>
> > #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> > #define DRIVER_VERSION "0.3"
> > #define DRIVER_AUTHOR "Alex Williamson
> <[email protected]>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> > struct mutex group_lock;
> > struct cdev group_cdev;
> > dev_t group_devt;
> > + struct list_head vfio_mm_list;
> > + struct mutex vfio_mm_lock;
> > wait_queue_head_t release_q;
> > } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
> > EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> > /**
> > + * VFIO_MM objects - create, release, get, put, search
> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > + struct vfio_mm *vmm;
> > + struct vfio_mm_token *token;
> > + int ret = 0;
> > +
> > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > + if (!vmm)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + /* Per mm IOASID set used for quota control and group operations
> */
> > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > + VFIO_DEFAULT_PASID_QUOTA, &vmm-
> >ioasid_sid);
> > + if (ret) {
> > + kfree(vmm);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + kref_init(&vmm->kref);
> > + token = &vmm->token;
> > + token->val = mm;
> > + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > + mutex_init(&vmm->pasid_lock);
> > +
> > + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > +
> > + return vmm;
> > +}
> > +
> > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > +{
> > + /* destroy the ioasid set */
> > + ioasid_free_set(vmm->ioasid_sid, true);
> > + mutex_unlock(&vfio.vfio_mm_lock);
> > + kfree(vmm);
> > +}
> > +
> > +/* called with vfio.vfio_mm_lock held */
> > +static void vfio_mm_release(struct kref *kref)
> > +{
> > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > +
> > + list_del(&vmm->vfio_next);
> > + vfio_mm_unlock_and_free(vmm);
> > +}
> > +
> > +void vfio_mm_put(struct vfio_mm *vmm)
> > +{
> > + kref_put_mutex(&vmm->kref, vfio_mm_release,
> &vfio.vfio_mm_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > +
> > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > +static void vfio_mm_get(struct vfio_mm *vmm)
> > +{
> > + kref_get(&vmm->kref);
> > +}
> > +
> > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > +{
> > + struct mm_struct *mm = get_task_mm(task);
> > + struct vfio_mm *vmm;
> > + unsigned long long val = (unsigned long long) mm;
> > +
> > + mutex_lock(&vfio.vfio_mm_lock);
> > + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > + if (vmm->token.val == val) {
> > + vfio_mm_get(vmm);
> > + goto out;
> > + }
> > + }
> > +
> > + vmm = vfio_create_mm(mm);
> > + if (IS_ERR(vmm))
> > + vmm = NULL;
> > +out:
> > + mutex_unlock(&vfio.vfio_mm_lock);
> > + mmput(mm);
> > + return vmm;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > +
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +{
> > + ioasid_t pasid;
> > + int ret = -ENOSPC;
> > +
> > + mutex_lock(&vmm->pasid_lock);
> > +
> > + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > + if (pasid == INVALID_IOASID) {
> > + ret = -ENOSPC;
> > + goto out_unlock;
> > + }
> > +
> > + ret = pasid;
> > +out_unlock:
> > + mutex_unlock(&vmm->pasid_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > +
> > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > +{
> > + void *pdata;
> > + int ret = 0;
> > +
> > + mutex_lock(&vmm->pasid_lock);
> > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > + if (IS_ERR(pdata)) {
> > + ret = PTR_ERR(pdata);
> > + goto out_unlock;
> > + }
> > + ioasid_free(pasid);
> > +
> > +out_unlock:
> > + mutex_unlock(&vmm->pasid_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > +
> > +/**
> > * Module/class support
> > */
> > static char *vfio_devnode(struct device *dev, umode_t *mode)
> > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > idr_init(&vfio.group_idr);
> > mutex_init(&vfio.group_lock);
> > mutex_init(&vfio.iommu_drivers_lock);
> > + mutex_init(&vfio.vfio_mm_lock);
> > INIT_LIST_HEAD(&vfio.group_list);
> > INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > init_waitqueue_head(&vfio.release_q);
> >
> > ret = misc_register(&vfio_dev);
>
> Is vfio.c the right place for any of the above? It seems like it could
> all be in a separate vfio_pasid module, similar to our virqfd module.
>
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > index a177bf2..331ceee 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > unsigned int dma_avail;
> > bool v2;
> > bool nesting;
> > + struct vfio_mm *vmm;
> > };
> >
> > struct vfio_domain {
> > @@ -2018,6 +2019,7 @@ static void
> vfio_iommu_type1_detach_group(void *iommu_data,
> > static void *vfio_iommu_type1_open(unsigned long arg)
> > {
> > struct vfio_iommu *iommu;
> > + struct vfio_mm *vmm = NULL;
> >
> > iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > if (!iommu)
> > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned
> long arg)
> > iommu->dma_avail = dma_entry_limit;
> > mutex_init(&iommu->lock);
> > BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > + vmm = vfio_mm_get_from_task(current);
> > + if (!vmm)
> > + pr_err("Failed to get vfio_mm track\n");
>
> Doesn't this presume everyone is instantly running PASID capable hosts?
> Looks like a noisy support regression to me.
>
> > + iommu->vmm = vmm;
> >
> > return iommu;
> > }
> > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> *iommu_data)
> > }
> >
> > vfio_iommu_iova_free(&iommu->iova_list);
> > + if (iommu->vmm)
> > + vfio_mm_put(iommu->vmm);
> >
> > kfree(iommu);
> > }
> > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
> > return ret;
> > }
> >
> > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > +{
> > + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > + (flags & VFIO_IOMMU_PASID_ALLOC &&
> > + flags & VFIO_IOMMU_PASID_FREE));
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > + int min,
> > + int max)
> > +{
> > + struct vfio_mm *vmm = iommu->vmm;
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > + ret = -EFAULT;
> > + goto out_unlock;
> > + }
>
> Non-iommu backed mdevs are excluded from this? Is this a matter of
> wiring the call out through the mdev parent device, or is this just
> possible?
>
> > + if (vmm)
> > + ret = vfio_mm_pasid_alloc(vmm, min, max);
> > + else
> > + ret = -EINVAL;
> > +out_unlock:
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > + unsigned int pasid)
> > +{
> > + struct vfio_mm *vmm = iommu->vmm;
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > + ret = -EFAULT;
> > + goto out_unlock;
> > + }
>
> So if a container had an iommu backed device when the pasid was
> allocated, but it was removed, now they can't free it? Why do we need
> the check above?
>
> > +
> > + if (vmm)
> > + ret = vfio_mm_pasid_free(vmm, pasid);
> > + else
> > + ret = -EINVAL;
> > +out_unlock:
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > unsigned int cmd, unsigned long arg)
> > {
> > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> >
> > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > -EFAULT : 0;
> > +
> > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > + struct vfio_iommu_type1_pasid_request req;
> > + unsigned long offset;
> > +
> > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > + flags);
> > +
> > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (req.argsz < minsz ||
> > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > + return -EINVAL;
> > +
> > + if (copy_from_user((void *)&req + minsz,
> > + (void __user *)arg + minsz,
> > + sizeof(req) - minsz))
> > + return -EFAULT;
>
> Huh? Why do we have argsz if we're going to assume this is here?
>
> > +
> > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > + case VFIO_IOMMU_PASID_ALLOC:
> > + {
> > + int ret = 0, result;
> > +
> > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > + req.alloc_pasid.min,
> > + req.alloc_pasid.max);
> > + if (result > 0) {
> > + offset = offsetof(
> > + struct
> vfio_iommu_type1_pasid_request,
> > + alloc_pasid.result);
> > + ret = copy_to_user(
> > + (void __user *) (arg + offset),
> > + &result, sizeof(result));
>
> Again assuming argsz supports this.
>
> > + } else {
> > + pr_debug("%s: PASID alloc failed\n",
> __func__);
>
> rate limit?
>
> > + ret = -EFAULT;
> > + }
> > + return ret;
> > + }
> > + case VFIO_IOMMU_PASID_FREE:
> > + return vfio_iommu_type1_pasid_free(iommu,
> > + req.free_pasid);
> > + default:
> > + return -EINVAL;
> > + }
> > }
> >
> > return -ENOTTY;
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e42a711..75f9f7f1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> vfio_iommu_driver_ops *ops);
> > extern void vfio_unregister_iommu_driver(
> > const struct vfio_iommu_driver_ops *ops);
> >
> > +#define VFIO_DEFAULT_PASID_QUOTA 1000
> > +struct vfio_mm_token {
> > + unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > + struct kref kref;
> > + struct vfio_mm_token token;
> > + int ioasid_sid;
> > + /* protect @pasid_quota field and pasid allocation/free */
> > + struct mutex pasid_lock;
> > + int pasid_quota;
> > + struct list_head vfio_next;
> > +};
> > +
> > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > +
> > /*
> > * External user API
> > */
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a1..298ac80 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> >
> > +/*
> > + * PASID (Process Address Space ID) is a PCIe concept which
> > + * has been extended to support DMA isolation in fine-grain.
> > + * With device assigned to user space (e.g. VMs), PASID alloc
> > + * and free need to be system wide. This structure defines
> > + * the info for pasid alloc/free between user space and kernel
> > + * space.
> > + *
> > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > + __u32 argsz;
> > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > + __u32 flags;
> > + union {
> > + struct {
> > + __u32 min;
> > + __u32 max;
> > + __u32 result;
> > + } alloc_pasid;
> > + __u32 free_pasid;
> > + };
>
> We seem to be using __u8 data[] lately where the struct at data is
> defined by the flags. should we do that here?
>
> > +};
> > +
> > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC
> | \
> > + VFIO_IOMMU_PASID_FREE)
> > +
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > + * struct vfio_iommu_type1_pasid_request)
> > + *
> > + * Availability of this feature depends on PASID support in the device,
> > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > + * is available after VFIO_SET_IOMMU.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE +
> 22)
>
> So a user needs to try to allocate a PASID in order to test for the
> support? Should we have a PROBE flag?
>
> > +
> > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> >
> > /*

2020-04-03 12:34:09

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

Hi Jean,

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Thursday, April 2, 2020 9:53 PM
> To: Liu, Yi L <[email protected]>
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> Hi Yi,
>
> On Sun, Mar 22, 2020 at 05:31:58AM -0700, Liu, Yi L wrote:
> > From: Liu Yi L <[email protected]>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access.
>
> It's worth noting that this applies to Intel VT-d with scalable mode, not
> IOMMUs that use one PASID space per VM

Oh yes. will add it.

>
> > It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> >
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Yi Sun <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/vfio/vfio.c | 130 ++++++++++++++++++++++++++++++++++++++++
> > drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> > include/linux/vfio.h | 20 +++++++
> > include/uapi/linux/vfio.h | 41 +++++++++++++
> > 4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> > #include <linux/vfio.h>
> > #include <linux/wait.h>
> > #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> > #define DRIVER_VERSION "0.3"
> > #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> > struct mutex group_lock;
> > struct cdev group_cdev;
> > dev_t group_devt;
> > + struct list_head vfio_mm_list;
> > + struct mutex vfio_mm_lock;
> > wait_queue_head_t release_q;
> > } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum
> vfio_notify_type type,
> > EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> > /**
> > + * VFIO_MM objects - create, release, get, put, search
> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > + struct vfio_mm *vmm;
> > + struct vfio_mm_token *token;
> > + int ret = 0;
> > +
> > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > + if (!vmm)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + /* Per mm IOASID set used for quota control and group operations */
> > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
>
> Hmm, either we need to change the token of ioasid_alloc_set() to "void *",
> or pass an actual ioasid_set struct, but this cast doesn't look good :)
>
> As I commented on the IOASID series, I think we could embed a struct
> ioasid_set into vfio_mm, pass that struct to all other ioasid_* functions,
> and get rid of ioasid_sid.

I think change to "void *" is better as we needs the token to ensure all
threads within a single VM share the same ioasid_set.

> > + VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > + if (ret) {
> > + kfree(vmm);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + kref_init(&vmm->kref);
> > + token = &vmm->token;
> > + token->val = mm;
>
> Why the intermediate token struct? Could we just store the mm_struct
> pointer within vfio_mm?

Hmm, here we only want to use the pointer as a token, instead of using
the structure behind the pointer. If store the mm_struct directly, may
leave a space to further use its content, this is not good.

Regards,
Yi Liu

2020-04-03 13:01:32

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

On Fri, Apr 03, 2020 at 11:56:09AM +0000, Liu, Yi L wrote:
> > > /**
> > > + * VFIO_MM objects - create, release, get, put, search
> > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > + */
> > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > > +{
> > > + struct vfio_mm *vmm;
> > > + struct vfio_mm_token *token;
> > > + int ret = 0;
> > > +
> > > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > + if (!vmm)
> > > + return ERR_PTR(-ENOMEM);
> > > +
> > > + /* Per mm IOASID set used for quota control and group operations */
> > > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> >
> > Hmm, either we need to change the token of ioasid_alloc_set() to "void *",
> > or pass an actual ioasid_set struct, but this cast doesn't look good :)
> >
> > As I commented on the IOASID series, I think we could embed a struct
> > ioasid_set into vfio_mm, pass that struct to all other ioasid_* functions,
> > and get rid of ioasid_sid.
>
> I think change to "void *" is better as we needs the token to ensure all
> threads within a single VM share the same ioasid_set.

Don't they share the same vfio_mm?

Thanks,
Jean
>
> > > + VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > > + if (ret) {
> > > + kfree(vmm);
> > > + return ERR_PTR(ret);
> > > + }
> > > +
> > > + kref_init(&vmm->kref);
> > > + token = &vmm->token;
> > > + token->val = mm;
> >
> > Why the intermediate token struct? Could we just store the mm_struct
> > pointer within vfio_mm?
>
> Hmm, here we only want to use the pointer as a token, instead of using
> the structure behind the pointer. If store the mm_struct directly, may
> leave a space to further use its content, this is not good.
>
> Regards,
> Yi Liu
>

2020-04-03 13:03:04

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Friday, April 3, 2020 8:40 PM
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> On Fri, Apr 03, 2020 at 11:56:09AM +0000, Liu, Yi L wrote:
> > > > /**
> > > > + * VFIO_MM objects - create, release, get, put, search
> > > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > > + */
> > > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm) {
> > > > + struct vfio_mm *vmm;
> > > > + struct vfio_mm_token *token;
> > > > + int ret = 0;
> > > > +
> > > > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > > + if (!vmm)
> > > > + return ERR_PTR(-ENOMEM);
> > > > +
> > > > + /* Per mm IOASID set used for quota control and group operations */
> > > > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > >
> > > Hmm, either we need to change the token of ioasid_alloc_set() to
> > > "void *", or pass an actual ioasid_set struct, but this cast doesn't
> > > look good :)
> > >
> > > As I commented on the IOASID series, I think we could embed a struct
> > > ioasid_set into vfio_mm, pass that struct to all other ioasid_*
> > > functions, and get rid of ioasid_sid.
> >
> > I think change to "void *" is better as we needs the token to ensure
> > all threads within a single VM share the same ioasid_set.
>
> Don't they share the same vfio_mm?

that's right. then both works well for me.

Regards,
Yi Liu

2020-04-03 13:14:53

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

Hi Alex,

> From: Alex Williamson <[email protected]>
> Sent: Friday, April 3, 2020 1:50 AM
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> On Sun, 22 Mar 2020 05:31:58 -0700
> "Liu, Yi L" <[email protected]> wrote:
>
> > From: Liu Yi L <[email protected]>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
>
> Providing an allocation interface only allows for collaborative usage
> of PASIDs though. Do we have any ability to enforce PASID usage or can
> a user spoof other PASIDs on the same BDF?
>
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Yi Sun <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/vfio/vfio.c | 130 ++++++++++++++++++++++++++++++++++++++++
> > drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> > include/linux/vfio.h | 20 +++++++
> > include/uapi/linux/vfio.h | 41 +++++++++++++
> > 4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> > #include <linux/vfio.h>
> > #include <linux/wait.h>
> > #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> > #define DRIVER_VERSION "0.3"
> > #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> > struct mutex group_lock;
> > struct cdev group_cdev;
> > dev_t group_devt;
> > + struct list_head vfio_mm_list;
> > + struct mutex vfio_mm_lock;
> > wait_queue_head_t release_q;
> > } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum
> vfio_notify_type type,
> > EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> > /**
> > + * VFIO_MM objects - create, release, get, put, search
> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > + struct vfio_mm *vmm;
> > + struct vfio_mm_token *token;
> > + int ret = 0;
> > +
> > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > + if (!vmm)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + /* Per mm IOASID set used for quota control and group operations */
> > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > + VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > + if (ret) {
> > + kfree(vmm);
> > + return ERR_PTR(ret);
> > + }
> > +
> > + kref_init(&vmm->kref);
> > + token = &vmm->token;
> > + token->val = mm;
> > + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > + mutex_init(&vmm->pasid_lock);
> > +
> > + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > +
> > + return vmm;
> > +}
> > +
> > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > +{
> > + /* destroy the ioasid set */
> > + ioasid_free_set(vmm->ioasid_sid, true);
> > + mutex_unlock(&vfio.vfio_mm_lock);
> > + kfree(vmm);
> > +}
> > +
> > +/* called with vfio.vfio_mm_lock held */
> > +static void vfio_mm_release(struct kref *kref)
> > +{
> > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > +
> > + list_del(&vmm->vfio_next);
> > + vfio_mm_unlock_and_free(vmm);
> > +}
> > +
> > +void vfio_mm_put(struct vfio_mm *vmm)
> > +{
> > + kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > +
> > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > +static void vfio_mm_get(struct vfio_mm *vmm)
> > +{
> > + kref_get(&vmm->kref);
> > +}
> > +
> > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > +{
> > + struct mm_struct *mm = get_task_mm(task);
> > + struct vfio_mm *vmm;
> > + unsigned long long val = (unsigned long long) mm;
> > +
> > + mutex_lock(&vfio.vfio_mm_lock);
> > + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > + if (vmm->token.val == val) {
> > + vfio_mm_get(vmm);
> > + goto out;
> > + }
> > + }
> > +
> > + vmm = vfio_create_mm(mm);
> > + if (IS_ERR(vmm))
> > + vmm = NULL;
> > +out:
> > + mutex_unlock(&vfio.vfio_mm_lock);
> > + mmput(mm);
> > + return vmm;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > +
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +{
> > + ioasid_t pasid;
> > + int ret = -ENOSPC;
> > +
> > + mutex_lock(&vmm->pasid_lock);
> > +
> > + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > + if (pasid == INVALID_IOASID) {
> > + ret = -ENOSPC;
> > + goto out_unlock;
> > + }
> > +
> > + ret = pasid;
> > +out_unlock:
> > + mutex_unlock(&vmm->pasid_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > +
> > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > +{
> > + void *pdata;
> > + int ret = 0;
> > +
> > + mutex_lock(&vmm->pasid_lock);
> > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > + if (IS_ERR(pdata)) {
> > + ret = PTR_ERR(pdata);
> > + goto out_unlock;
> > + }
> > + ioasid_free(pasid);
> > +
> > +out_unlock:
> > + mutex_unlock(&vmm->pasid_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > +
> > +/**
> > * Module/class support
> > */
> > static char *vfio_devnode(struct device *dev, umode_t *mode)
> > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > idr_init(&vfio.group_idr);
> > mutex_init(&vfio.group_lock);
> > mutex_init(&vfio.iommu_drivers_lock);
> > + mutex_init(&vfio.vfio_mm_lock);
> > INIT_LIST_HEAD(&vfio.group_list);
> > INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > init_waitqueue_head(&vfio.release_q);
> >
> > ret = misc_register(&vfio_dev);
>
> Is vfio.c the right place for any of the above? It seems like it could
> all be in a separate vfio_pasid module, similar to our virqfd module.

I think it could be a separate vfio_pasid module. let me make it in next
version if it's your preference. :-)

> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index a177bf2..331ceee 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > unsigned int dma_avail;
> > bool v2;
> > bool nesting;
> > + struct vfio_mm *vmm;
> > };
> >
> > struct vfio_domain {
> > @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
> > static void *vfio_iommu_type1_open(unsigned long arg)
> > {
> > struct vfio_iommu *iommu;
> > + struct vfio_mm *vmm = NULL;
> >
> > iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > if (!iommu)
> > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long
> arg)
> > iommu->dma_avail = dma_entry_limit;
> > mutex_init(&iommu->lock);
> > BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > + vmm = vfio_mm_get_from_task(current);
> > + if (!vmm)
> > + pr_err("Failed to get vfio_mm track\n");
>
> Doesn't this presume everyone is instantly running PASID capable hosts?
> Looks like a noisy support regression to me.

right, it is. Kevin also questioned this part, I'll refine it and avoid
regression noisy.

> > + iommu->vmm = vmm;
> >
> > return iommu;
> > }
> > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> *iommu_data)
> > }
> >
> > vfio_iommu_iova_free(&iommu->iova_list);
> > + if (iommu->vmm)
> > + vfio_mm_put(iommu->vmm);
> >
> > kfree(iommu);
> > }
> > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
> > return ret;
> > }
> >
> > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > +{
> > + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > + (flags & VFIO_IOMMU_PASID_ALLOC &&
> > + flags & VFIO_IOMMU_PASID_FREE));
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > + int min,
> > + int max)
> > +{
> > + struct vfio_mm *vmm = iommu->vmm;
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > + ret = -EFAULT;
> > + goto out_unlock;
> > + }
>
> Non-iommu backed mdevs are excluded from this? Is this a matter of
> wiring the call out through the mdev parent device, or is this just
> possible?

At the beginning, non-iommu backed mdevs are excluded. However,
Combined with your succeeded comment. I think this check should be
removed as the PASID alloc/free capability should be available as
long as the container is backed by a pasid-capable iommu backend.
So should remove it, and it is the same with the free path.

>
> > + if (vmm)
> > + ret = vfio_mm_pasid_alloc(vmm, min, max);
> > + else
> > + ret = -EINVAL;
> > +out_unlock:
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > + unsigned int pasid)
> > +{
> > + struct vfio_mm *vmm = iommu->vmm;
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > + ret = -EFAULT;
> > + goto out_unlock;
> > + }
>
> So if a container had an iommu backed device when the pasid was
> allocated, but it was removed, now they can't free it? Why do we need
> the check above?

should be removed. thanks for spotting it.

> > +
> > + if (vmm)
> > + ret = vfio_mm_pasid_free(vmm, pasid);
> > + else
> > + ret = -EINVAL;
> > +out_unlock:
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > unsigned int cmd, unsigned long arg)
> > {
> > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> >
> > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > -EFAULT : 0;
> > +
> > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > + struct vfio_iommu_type1_pasid_request req;
> > + unsigned long offset;
> > +
> > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > + flags);
> > +
> > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (req.argsz < minsz ||
> > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > + return -EINVAL;
> > +
> > + if (copy_from_user((void *)&req + minsz,
> > + (void __user *)arg + minsz,
> > + sizeof(req) - minsz))
> > + return -EFAULT;
>
> Huh? Why do we have argsz if we're going to assume this is here?

do you mean replacing sizeof(req) with argsz? if yes, I can do that.

> > +
> > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > + case VFIO_IOMMU_PASID_ALLOC:
> > + {
> > + int ret = 0, result;
> > +
> > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > + req.alloc_pasid.min,
> > + req.alloc_pasid.max);
> > + if (result > 0) {
> > + offset = offsetof(
> > + struct vfio_iommu_type1_pasid_request,
> > + alloc_pasid.result);
> > + ret = copy_to_user(
> > + (void __user *) (arg + offset),
> > + &result, sizeof(result));
>
> Again assuming argsz supports this.

same as above.

>
> > + } else {
> > + pr_debug("%s: PASID alloc failed\n", __func__);
>
> rate limit?

not quite get. could you give more hints?

> > + ret = -EFAULT;
> > + }
> > + return ret;
> > + }
> > + case VFIO_IOMMU_PASID_FREE:
> > + return vfio_iommu_type1_pasid_free(iommu,
> > + req.free_pasid);
> > + default:
> > + return -EINVAL;
> > + }
> > }
> >
> > return -ENOTTY;
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e42a711..75f9f7f1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> vfio_iommu_driver_ops *ops);
> > extern void vfio_unregister_iommu_driver(
> > const struct vfio_iommu_driver_ops *ops);
> >
> > +#define VFIO_DEFAULT_PASID_QUOTA 1000
> > +struct vfio_mm_token {
> > + unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > + struct kref kref;
> > + struct vfio_mm_token token;
> > + int ioasid_sid;
> > + /* protect @pasid_quota field and pasid allocation/free */
> > + struct mutex pasid_lock;
> > + int pasid_quota;
> > + struct list_head vfio_next;
> > +};
> > +
> > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > +
> > /*
> > * External user API
> > */
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a1..298ac80 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> >
> > +/*
> > + * PASID (Process Address Space ID) is a PCIe concept which
> > + * has been extended to support DMA isolation in fine-grain.
> > + * With device assigned to user space (e.g. VMs), PASID alloc
> > + * and free need to be system wide. This structure defines
> > + * the info for pasid alloc/free between user space and kernel
> > + * space.
> > + *
> > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > + __u32 argsz;
> > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > + __u32 flags;
> > + union {
> > + struct {
> > + __u32 min;
> > + __u32 max;
> > + __u32 result;
> > + } alloc_pasid;
> > + __u32 free_pasid;
> > + };
>
> We seem to be using __u8 data[] lately where the struct at data is
> defined by the flags. should we do that here?

yeah, I can do that. BTW. Do you want to let the structure in the
lately patch share the same structure with this one? As I can foresee,
the two structures would look like similar as both of them include
argsz, flags and data[] fields. The difference is the definition of
flags. what about your opinion?

struct vfio_iommu_type1_pasid_request {
__u32 argsz;
#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
#define VFIO_IOMMU_PASID_FREE (1 << 1)
__u32 flags;
__u8 data[];
};

struct vfio_iommu_type1_bind {
__u32 argsz;
__u32 flags;
#define VFIO_IOMMU_BIND_GUEST_PGTBL (1 << 0)
#define VFIO_IOMMU_UNBIND_GUEST_PGTBL (1 << 1)
__u8 data[];
};

>
> > +};
> > +
> > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC | \
> > + VFIO_IOMMU_PASID_FREE)
> > +
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > + * struct vfio_iommu_type1_pasid_request)
> > + *
> > + * Availability of this feature depends on PASID support in the device,
> > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > + * is available after VFIO_SET_IOMMU.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 22)
>
> So a user needs to try to allocate a PASID in order to test for the
> support? Should we have a PROBE flag?

answered in in later patch. :-)

Regards,
Yi Liu

2020-04-03 15:17:36

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

On Fri, 3 Apr 2020 05:58:55 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Friday, April 3, 2020 1:50 AM
> >
> > On Sun, 22 Mar 2020 05:31:58 -0700
> > "Liu, Yi L" <[email protected]> wrote:
> >
> > > From: Liu Yi L <[email protected]>
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the
> > PCIe
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model, which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.
> >
> > Providing an allocation interface only allows for collaborative usage
> > of PASIDs though. Do we have any ability to enforce PASID usage or can
> > a user spoof other PASIDs on the same BDF?
>
> An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> set tied to its mm_struct.

A user is only _supposed_ to access PASIDs allocated to itself. AIUI
the mm_struct is used for managing the pool of IOASIDs from which the
user may allocate that PASID. We also state that programming the PASID
into the device is device specific. Therefore, are we simply trusting
the user to use a PASID that's been allocated to them when they program
the device? If a user can program an arbitrary PASID into the device,
then what prevents them from attempting to access data from another
user via the device? I think I've asked this question before, so if
there's a previous explanation or spec section I need to review, please
point me to it. Thanks,

Alex

2020-04-03 17:52:09

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

On Fri, 3 Apr 2020 13:12:50 +0000
"Liu, Yi L" <[email protected]> wrote:

> Hi Alex,
>
> > From: Alex Williamson <[email protected]>
> > Sent: Friday, April 3, 2020 1:50 AM
> > Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > On Sun, 22 Mar 2020 05:31:58 -0700
> > "Liu, Yi L" <[email protected]> wrote:
> >
> > > From: Liu Yi L <[email protected]>
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the PCIe
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model, which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.
> >
> > Providing an allocation interface only allows for collaborative usage
> > of PASIDs though. Do we have any ability to enforce PASID usage or can
> > a user spoof other PASIDs on the same BDF?
> >
> > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> > > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > > requests are intended to be invoked by QEMU or other applications which
> > > are running in userspace, it is necessary to have a mechanism to prevent
> > > single application from abusing available PASIDs in system. With such
> > > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > > has many assigned devices, then it should have more quota. However, it
> > > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > > possible that a VM with multiples assigned devices but requests less
> > > PASIDs. Therefore per-VM quota would be better.
> > >
> > > This patch uses struct mm pointer as a per-VM token. We also considered
> > > using task structure pointer and vfio_iommu structure pointer. However,
> > > task structure is per-thread, which means it cannot achieve per-VM PASID
> > > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > > only within vfio. Therefore, structure mm pointer is selected. This patch
> > > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > > container is opened by a VM. On the reverse order, vfio_mm is free when
> > > the last vfio container is released. Each VM is assigned with a PASID
> > > quota, so that it is not able to request PASID beyond its quota. This
> > > patch adds a default quota of 1000. This quota could be tuned by
> > > administrator. Making PASID quota tunable will be added in another patch
> > > in this series.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian <[email protected]>
> > > CC: Jacob Pan <[email protected]>
> > > Cc: Alex Williamson <[email protected]>
> > > Cc: Eric Auger <[email protected]>
> > > Cc: Jean-Philippe Brucker <[email protected]>
> > > Signed-off-by: Liu Yi L <[email protected]>
> > > Signed-off-by: Yi Sun <[email protected]>
> > > Signed-off-by: Jacob Pan <[email protected]>
> > > ---
> > > drivers/vfio/vfio.c | 130 ++++++++++++++++++++++++++++++++++++++++
> > > drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> > > include/linux/vfio.h | 20 +++++++
> > > include/uapi/linux/vfio.h | 41 +++++++++++++
> > > 4 files changed, 295 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index c848262..d13b483 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -32,6 +32,7 @@
> > > #include <linux/vfio.h>
> > > #include <linux/wait.h>
> > > #include <linux/sched/signal.h>
> > > +#include <linux/sched/mm.h>
> > >
> > > #define DRIVER_VERSION "0.3"
> > > #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> > > @@ -46,6 +47,8 @@ static struct vfio {
> > > struct mutex group_lock;
> > > struct cdev group_cdev;
> > > dev_t group_devt;
> > > + struct list_head vfio_mm_list;
> > > + struct mutex vfio_mm_lock;
> > > wait_queue_head_t release_q;
> > > } vfio;
> > >
> > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum
> > vfio_notify_type type,
> > > EXPORT_SYMBOL(vfio_unregister_notifier);
> > >
> > > /**
> > > + * VFIO_MM objects - create, release, get, put, search
> > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > + */
> > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > > +{
> > > + struct vfio_mm *vmm;
> > > + struct vfio_mm_token *token;
> > > + int ret = 0;
> > > +
> > > + vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > + if (!vmm)
> > > + return ERR_PTR(-ENOMEM);
> > > +
> > > + /* Per mm IOASID set used for quota control and group operations */
> > > + ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > + VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > > + if (ret) {
> > > + kfree(vmm);
> > > + return ERR_PTR(ret);
> > > + }
> > > +
> > > + kref_init(&vmm->kref);
> > > + token = &vmm->token;
> > > + token->val = mm;
> > > + vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > + mutex_init(&vmm->pasid_lock);
> > > +
> > > + list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > +
> > > + return vmm;
> > > +}
> > > +
> > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > > +{
> > > + /* destroy the ioasid set */
> > > + ioasid_free_set(vmm->ioasid_sid, true);
> > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > + kfree(vmm);
> > > +}
> > > +
> > > +/* called with vfio.vfio_mm_lock held */
> > > +static void vfio_mm_release(struct kref *kref)
> > > +{
> > > + struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > > +
> > > + list_del(&vmm->vfio_next);
> > > + vfio_mm_unlock_and_free(vmm);
> > > +}
> > > +
> > > +void vfio_mm_put(struct vfio_mm *vmm)
> > > +{
> > > + kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > +
> > > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > > +static void vfio_mm_get(struct vfio_mm *vmm)
> > > +{
> > > + kref_get(&vmm->kref);
> > > +}
> > > +
> > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > > +{
> > > + struct mm_struct *mm = get_task_mm(task);
> > > + struct vfio_mm *vmm;
> > > + unsigned long long val = (unsigned long long) mm;
> > > +
> > > + mutex_lock(&vfio.vfio_mm_lock);
> > > + list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > + if (vmm->token.val == val) {
> > > + vfio_mm_get(vmm);
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + vmm = vfio_create_mm(mm);
> > > + if (IS_ERR(vmm))
> > > + vmm = NULL;
> > > +out:
> > > + mutex_unlock(&vfio.vfio_mm_lock);
> > > + mmput(mm);
> > > + return vmm;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > +
> > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > +{
> > > + ioasid_t pasid;
> > > + int ret = -ENOSPC;
> > > +
> > > + mutex_lock(&vmm->pasid_lock);
> > > +
> > > + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > + if (pasid == INVALID_IOASID) {
> > > + ret = -ENOSPC;
> > > + goto out_unlock;
> > > + }
> > > +
> > > + ret = pasid;
> > > +out_unlock:
> > > + mutex_unlock(&vmm->pasid_lock);
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > +
> > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > > +{
> > > + void *pdata;
> > > + int ret = 0;
> > > +
> > > + mutex_lock(&vmm->pasid_lock);
> > > + pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > + if (IS_ERR(pdata)) {
> > > + ret = PTR_ERR(pdata);
> > > + goto out_unlock;
> > > + }
> > > + ioasid_free(pasid);
> > > +
> > > +out_unlock:
> > > + mutex_unlock(&vmm->pasid_lock);
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > +
> > > +/**
> > > * Module/class support
> > > */
> > > static char *vfio_devnode(struct device *dev, umode_t *mode)
> > > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > > idr_init(&vfio.group_idr);
> > > mutex_init(&vfio.group_lock);
> > > mutex_init(&vfio.iommu_drivers_lock);
> > > + mutex_init(&vfio.vfio_mm_lock);
> > > INIT_LIST_HEAD(&vfio.group_list);
> > > INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > + INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > > init_waitqueue_head(&vfio.release_q);
> > >
> > > ret = misc_register(&vfio_dev);
> >
> > Is vfio.c the right place for any of the above? It seems like it could
> > all be in a separate vfio_pasid module, similar to our virqfd module.
>
> I think it could be a separate vfio_pasid module. let me make it in next
> version if it's your preference. :-)
>
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index a177bf2..331ceee 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > unsigned int dma_avail;
> > > bool v2;
> > > bool nesting;
> > > + struct vfio_mm *vmm;
> > > };
> > >
> > > struct vfio_domain {
> > > @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void
> > *iommu_data,
> > > static void *vfio_iommu_type1_open(unsigned long arg)
> > > {
> > > struct vfio_iommu *iommu;
> > > + struct vfio_mm *vmm = NULL;
> > >
> > > iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > > if (!iommu)
> > > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long
> > arg)
> > > iommu->dma_avail = dma_entry_limit;
> > > mutex_init(&iommu->lock);
> > > BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > + vmm = vfio_mm_get_from_task(current);
> > > + if (!vmm)
> > > + pr_err("Failed to get vfio_mm track\n");
> >
> > Doesn't this presume everyone is instantly running PASID capable hosts?
> > Looks like a noisy support regression to me.
>
> right, it is. Kevin also questioned this part, I'll refine it and avoid
> regression noisy.
>
> > > + iommu->vmm = vmm;
> > >
> > > return iommu;
> > > }
> > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > *iommu_data)
> > > }
> > >
> > > vfio_iommu_iova_free(&iommu->iova_list);
> > > + if (iommu->vmm)
> > > + vfio_mm_put(iommu->vmm);
> > >
> > > kfree(iommu);
> > > }
> > > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > vfio_iommu *iommu,
> > > return ret;
> > > }
> > >
> > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > > +{
> > > + return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > + (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > + flags & VFIO_IOMMU_PASID_FREE));
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > + int min,
> > > + int max)
> > > +{
> > > + struct vfio_mm *vmm = iommu->vmm;
> > > + int ret = 0;
> > > +
> > > + mutex_lock(&iommu->lock);
> > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > + ret = -EFAULT;
> > > + goto out_unlock;
> > > + }
> >
> > Non-iommu backed mdevs are excluded from this? Is this a matter of
> > wiring the call out through the mdev parent device, or is this just
> > possible?
>
> At the beginning, non-iommu backed mdevs are excluded. However,
> Combined with your succeeded comment. I think this check should be
> removed as the PASID alloc/free capability should be available as
> long as the container is backed by a pasid-capable iommu backend.
> So should remove it, and it is the same with the free path.
>
> >
> > > + if (vmm)
> > > + ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > + else
> > > + ret = -EINVAL;
> > > +out_unlock:
> > > + mutex_unlock(&iommu->lock);
> > > + return ret;
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > + unsigned int pasid)
> > > +{
> > > + struct vfio_mm *vmm = iommu->vmm;
> > > + int ret = 0;
> > > +
> > > + mutex_lock(&iommu->lock);
> > > + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > + ret = -EFAULT;
> > > + goto out_unlock;
> > > + }
> >
> > So if a container had an iommu backed device when the pasid was
> > allocated, but it was removed, now they can't free it? Why do we need
> > the check above?
>
> should be removed. thanks for spotting it.
>
> > > +
> > > + if (vmm)
> > > + ret = vfio_mm_pasid_free(vmm, pasid);
> > > + else
> > > + ret = -EINVAL;
> > > +out_unlock:
> > > + mutex_unlock(&iommu->lock);
> > > + return ret;
> > > +}
> > > +
> > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > unsigned int cmd, unsigned long arg)
> > > {
> > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > *iommu_data,
> > >
> > > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > -EFAULT : 0;
> > > +
> > > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > + struct vfio_iommu_type1_pasid_request req;
> > > + unsigned long offset;
> > > +
> > > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > + flags);
> > > +
> > > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > > + return -EFAULT;
> > > +
> > > + if (req.argsz < minsz ||
> > > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > + return -EINVAL;
> > > +
> > > + if (copy_from_user((void *)&req + minsz,
> > > + (void __user *)arg + minsz,
> > > + sizeof(req) - minsz))
> > > + return -EFAULT;
> >
> > Huh? Why do we have argsz if we're going to assume this is here?
>
> do you mean replacing sizeof(req) with argsz? if yes, I can do that.

No, I mean the user tells us how much data they've provided via argsz.
We create minsz the the end of flags and verify argsz includes flags.
Then we proceed to ignore argsz to see if the user has provided the
remainder of the structure.

> > > +
> > > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > + case VFIO_IOMMU_PASID_ALLOC:
> > > + {
> > > + int ret = 0, result;
> > > +
> > > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > > + req.alloc_pasid.min,
> > > + req.alloc_pasid.max);
> > > + if (result > 0) {
> > > + offset = offsetof(
> > > + struct vfio_iommu_type1_pasid_request,
> > > + alloc_pasid.result);
> > > + ret = copy_to_user(
> > > + (void __user *) (arg + offset),
> > > + &result, sizeof(result));
> >
> > Again assuming argsz supports this.
>
> same as above.
>
> >
> > > + } else {
> > > + pr_debug("%s: PASID alloc failed\n", __func__);
> >
> > rate limit?
>
> not quite get. could you give more hints?

A user can spam the host logs simply by allocating their quota of
PASIDs and then trying to allocate more, or by specifying min/max such
that they cannot allocate the requested PASID. If this logging is
necessary for debugging, it should be ratelimited to avoid a DoS on the
host.

> > > + ret = -EFAULT;
> > > + }
> > > + return ret;
> > > + }
> > > + case VFIO_IOMMU_PASID_FREE:
> > > + return vfio_iommu_type1_pasid_free(iommu,
> > > + req.free_pasid);
> > > + default:
> > > + return -EINVAL;
> > > + }
> > > }
> > >
> > > return -ENOTTY;
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index e42a711..75f9f7f1 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > vfio_iommu_driver_ops *ops);
> > > extern void vfio_unregister_iommu_driver(
> > > const struct vfio_iommu_driver_ops *ops);
> > >
> > > +#define VFIO_DEFAULT_PASID_QUOTA 1000
> > > +struct vfio_mm_token {
> > > + unsigned long long val;
> > > +};
> > > +
> > > +struct vfio_mm {
> > > + struct kref kref;
> > > + struct vfio_mm_token token;
> > > + int ioasid_sid;
> > > + /* protect @pasid_quota field and pasid allocation/free */
> > > + struct mutex pasid_lock;
> > > + int pasid_quota;
> > > + struct list_head vfio_next;
> > > +};
> > > +
> > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > > +
> > > /*
> > > * External user API
> > > */
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 9e843a1..298ac80 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> > >
> > > +/*
> > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > + * has been extended to support DMA isolation in fine-grain.
> > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > + * and free need to be system wide. This structure defines
> > > + * the info for pasid alloc/free between user space and kernel
> > > + * space.
> > > + *
> > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > + */
> > > +struct vfio_iommu_type1_pasid_request {
> > > + __u32 argsz;
> > > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > > + __u32 flags;
> > > + union {
> > > + struct {
> > > + __u32 min;
> > > + __u32 max;
> > > + __u32 result;
> > > + } alloc_pasid;
> > > + __u32 free_pasid;
> > > + };
> >
> > We seem to be using __u8 data[] lately where the struct at data is
> > defined by the flags. should we do that here?
>
> yeah, I can do that. BTW. Do you want to let the structure in the
> lately patch share the same structure with this one? As I can foresee,
> the two structures would look like similar as both of them include
> argsz, flags and data[] fields. The difference is the definition of
> flags. what about your opinion?
>
> struct vfio_iommu_type1_pasid_request {
> __u32 argsz;
> #define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> #define VFIO_IOMMU_PASID_FREE (1 << 1)
> __u32 flags;
> __u8 data[];
> };
>
> struct vfio_iommu_type1_bind {
> __u32 argsz;
> __u32 flags;
> #define VFIO_IOMMU_BIND_GUEST_PGTBL (1 << 0)
> #define VFIO_IOMMU_UNBIND_GUEST_PGTBL (1 << 1)
> __u8 data[];
> };


Yes, I was even wondering the same for the cache invalidate ioctl, or
whether this is going too far for a general purpose "everything related
to PASIDs" ioctl. We need to factor usability into the equation too.
I'd be interested in opinions from others here too. Clearly I don't
like single use, throw-away ioctls, but I can find myself on either
side of the argument that allocation, binding, and invalidating are all
within the domain of PASIDs and could fall within a single ioctl or
they each represent different facets of managing PASIDs and should have
separate ioctls. Thanks,

Alex


> > > +};
> > > +
> > > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_PASID_ALLOC | \
> > > + VFIO_IOMMU_PASID_FREE)
> > > +
> > > +/**
> > > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > > + * struct vfio_iommu_type1_pasid_request)
> > > + *
> > > + * Availability of this feature depends on PASID support in the device,
> > > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > > + * is available after VFIO_SET_IOMMU.
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 22)
> >
> > So a user needs to try to allocate a PASID in order to test for the
> > support? Should we have a PROBE flag?
>
> answered in in later patch. :-)
>
> Regards,
> Yi Liu
>

2020-04-07 04:44:01

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Alex Williamson
> Sent: Friday, April 3, 2020 11:14 PM
>
> On Fri, 3 Apr 2020 05:58:55 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Friday, April 3, 2020 1:50 AM
> > >
> > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > "Liu, Yi L" <[email protected]> wrote:
> > >
> > > > From: Liu Yi L <[email protected]>
> > > >
> > > > For a long time, devices have only one DMA address space from
> platform
> > > > IOMMU's point of view. This is true for both bare metal and directed-
> > > > access in virtualization environment. Reason is the source ID of DMA in
> > > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > > DMA isolation. However, this is changing with the latest advancement in
> > > > I/O technology area. More and more platform vendors are utilizing the
> > > PCIe
> > > > PASID TLP prefix in DMA requests, thus to give devices with multiple
> DMA
> > > > address spaces as identified by their individual PASIDs. For example,
> > > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > > let device access multiple process virtual address space by binding the
> > > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > > software and programmed to device per device specific manner.
> Devices
> > > > which support PASID capability are called PASID-capable devices. If such
> > > > devices are passed through to VMs, guest software are also able to bind
> > > > guest process virtual address space on such devices. Therefore, the
> guest
> > > > software could reuse the bare metal software programming model,
> which
> > > > means guest software will also allocate PASID and program it to device
> > > > directly. This is a dangerous situation since it has potential PASID
> > > > conflicts and unauthorized address space access. It would be safer to
> > > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > > are managed system-wide.
> > >
> > > Providing an allocation interface only allows for collaborative usage
> > > of PASIDs though. Do we have any ability to enforce PASID usage or can
> > > a user spoof other PASIDs on the same BDF?
> >
> > An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> > set tied to its mm_struct.
>
> A user is only _supposed_ to access PASIDs allocated to itself. AIUI
> the mm_struct is used for managing the pool of IOASIDs from which the
> user may allocate that PASID. We also state that programming the PASID
> into the device is device specific. Therefore, are we simply trusting
> the user to use a PASID that's been allocated to them when they program
> the device? If a user can program an arbitrary PASID into the device,
> then what prevents them from attempting to access data from another
> user via the device? I think I've asked this question before, so if
> there's a previous explanation or spec section I need to review, please
> point me to it. Thanks,
>

There are two scenarios:

(1) for PF/VF, the iommu driver maintains an individual PASID table per
PDF. Although the PASID namespace is global, the per-BDF PASID table
contains only valid entries for those PASIDs which are allocated to the
mm_struct. The user is free to program arbitrary PASID into the assigned
device, but using invalid PASIDs simply hit iommu fault.

(2) for mdev, multiple mdev instances share the same PASID table of
the parent BDF. However, PASID programming is a privileged operation
in multiplexing usage, thus must be mediated by mdev device driver.
The mediation logic will guarantee that only allocated PASIDs are
forwarded to the device.

Thanks
Kevin

2020-04-07 04:53:08

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

> From: Alex Williamson
> Sent: Saturday, April 4, 2020 1:50 AM
[...]
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..298ac80 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > > #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
> > > > #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/*
> > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > + * has been extended to support DMA isolation in fine-grain.
> > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > + * and free need to be system wide. This structure defines
> > > > + * the info for pasid alloc/free between user space and kernel
> > > > + * space.
> > > > + *
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > + __u32 argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE (1 << 1)
> > > > + __u32 flags;
> > > > + union {
> > > > + struct {
> > > > + __u32 min;
> > > > + __u32 max;
> > > > + __u32 result;
> > > > + } alloc_pasid;
> > > > + __u32 free_pasid;
> > > > + };
> > >
> > > We seem to be using __u8 data[] lately where the struct at data is
> > > defined by the flags. should we do that here?
> >
> > yeah, I can do that. BTW. Do you want to let the structure in the
> > lately patch share the same structure with this one? As I can foresee,
> > the two structures would look like similar as both of them include
> > argsz, flags and data[] fields. The difference is the definition of
> > flags. what about your opinion?
> >
> > struct vfio_iommu_type1_pasid_request {
> > __u32 argsz;
> > #define VFIO_IOMMU_PASID_ALLOC (1 << 0)
> > #define VFIO_IOMMU_PASID_FREE (1 << 1)
> > __u32 flags;
> > __u8 data[];
> > };
> >
> > struct vfio_iommu_type1_bind {
> > __u32 argsz;
> > __u32 flags;
> > #define VFIO_IOMMU_BIND_GUEST_PGTBL (1 << 0)
> > #define VFIO_IOMMU_UNBIND_GUEST_PGTBL (1 << 1)
> > __u8 data[];
> > };
>
>
> Yes, I was even wondering the same for the cache invalidate ioctl, or
> whether this is going too far for a general purpose "everything related
> to PASIDs" ioctl. We need to factor usability into the equation too.
> I'd be interested in opinions from others here too. Clearly I don't
> like single use, throw-away ioctls, but I can find myself on either
> side of the argument that allocation, binding, and invalidating are all
> within the domain of PASIDs and could fall within a single ioctl or
> they each represent different facets of managing PASIDs and should have
> separate ioctls. Thanks,
>

Looking at uapi/linux/iommu.h:

* Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
* @version and @cache.

Although intel-iommu handles only PASID-related invalidation now, I
suppose other vendors (or future usages?) may allow non-pasid
based invalidation too based on above comment.

Thanks
Kevin

2020-04-07 15:17:18

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

On Tue, 7 Apr 2020 04:42:02 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson
> > Sent: Friday, April 3, 2020 11:14 PM
> >
> > On Fri, 3 Apr 2020 05:58:55 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> >
> > > > From: Alex Williamson <[email protected]>
> > > > Sent: Friday, April 3, 2020 1:50 AM
> > > >
> > > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > > "Liu, Yi L" <[email protected]> wrote:
> > > >
> > > > > From: Liu Yi L <[email protected]>
> > > > >
> > > > > For a long time, devices have only one DMA address space from
> > platform
> > > > > IOMMU's point of view. This is true for both bare metal and directed-
> > > > > access in virtualization environment. Reason is the source ID of DMA in
> > > > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > > > DMA isolation. However, this is changing with the latest advancement in
> > > > > I/O technology area. More and more platform vendors are utilizing the
> > > > PCIe
> > > > > PASID TLP prefix in DMA requests, thus to give devices with multiple
> > DMA
> > > > > address spaces as identified by their individual PASIDs. For example,
> > > > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > > > let device access multiple process virtual address space by binding the
> > > > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > > > software and programmed to device per device specific manner.
> > Devices
> > > > > which support PASID capability are called PASID-capable devices. If such
> > > > > devices are passed through to VMs, guest software are also able to bind
> > > > > guest process virtual address space on such devices. Therefore, the
> > guest
> > > > > software could reuse the bare metal software programming model,
> > which
> > > > > means guest software will also allocate PASID and program it to device
> > > > > directly. This is a dangerous situation since it has potential PASID
> > > > > conflicts and unauthorized address space access. It would be safer to
> > > > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > > > are managed system-wide.
> > > >
> > > > Providing an allocation interface only allows for collaborative usage
> > > > of PASIDs though. Do we have any ability to enforce PASID usage or can
> > > > a user spoof other PASIDs on the same BDF?
> > >
> > > An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> > > set tied to its mm_struct.
> >
> > A user is only _supposed_ to access PASIDs allocated to itself. AIUI
> > the mm_struct is used for managing the pool of IOASIDs from which the
> > user may allocate that PASID. We also state that programming the PASID
> > into the device is device specific. Therefore, are we simply trusting
> > the user to use a PASID that's been allocated to them when they program
> > the device? If a user can program an arbitrary PASID into the device,
> > then what prevents them from attempting to access data from another
> > user via the device? I think I've asked this question before, so if
> > there's a previous explanation or spec section I need to review, please
> > point me to it. Thanks,
> >
>
> There are two scenarios:
>
> (1) for PF/VF, the iommu driver maintains an individual PASID table per
> PDF. Although the PASID namespace is global, the per-BDF PASID table
> contains only valid entries for those PASIDs which are allocated to the
> mm_struct. The user is free to program arbitrary PASID into the assigned
> device, but using invalid PASIDs simply hit iommu fault.
>
> (2) for mdev, multiple mdev instances share the same PASID table of
> the parent BDF. However, PASID programming is a privileged operation
> in multiplexing usage, thus must be mediated by mdev device driver.
> The mediation logic will guarantee that only allocated PASIDs are
> forwarded to the device.

Thanks, I was confused about multiple tenants sharing a BDF when PASID
programming to the device is device specific, and therefore not
something we can virtualize. However, the solution is device specific
virtualization via mdev. Thus, any time we're sharing a BDF between
tenants, we must virtualize the PASID programming and therefore it must
be an mdev device currently. If a tenant is the exclusive user of the
BDF, then no virtualization of the PASID programming is required. I
think it's clear now (again). Thanks,

Alex

2020-04-08 00:55:41

by Yi Liu

[permalink] [raw]
Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

Hi Alex,
> From: Alex Williamson <[email protected]>
> Sent: Saturday, April 4, 2020 1:50 AM
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
>
> On Fri, 3 Apr 2020 13:12:50 +0000
> "Liu, Yi L" <[email protected]> wrote:
>
> > Hi Alex,
> >
> > > From: Alex Williamson <[email protected]>
> > > Sent: Friday, April 3, 2020 1:50 AM
> > > Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > "Liu, Yi L" <[email protected]> wrote:
> > >
> > > > From: Liu Yi L <[email protected]>
> > > >
[...]
> > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > unsigned int cmd, unsigned long arg)
> > > > {
> > > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > > >
> > > > return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > > -EFAULT : 0;
> > > > +
> > > > + } else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > + struct vfio_iommu_type1_pasid_request req;
> > > > + unsigned long offset;
> > > > +
> > > > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > > + flags);
> > > > +
> > > > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > + return -EFAULT;
> > > > +
> > > > + if (req.argsz < minsz ||
> > > > + !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > > + return -EINVAL;
> > > > +
> > > > + if (copy_from_user((void *)&req + minsz,
> > > > + (void __user *)arg + minsz,
> > > > + sizeof(req) - minsz))
> > > > + return -EFAULT;
> > >
> > > Huh? Why do we have argsz if we're going to assume this is here?
> >
> > do you mean replacing sizeof(req) with argsz? if yes, I can do that.
>
> No, I mean the user tells us how much data they've provided via argsz.
> We create minsz the the end of flags and verify argsz includes flags.
> Then we proceed to ignore argsz to see if the user has provided the
> remainder of the structure.

I think I should avoid using sizeof(req) as it may be variable
new flag is added. I think better to make a data[] field in struct
vfio_iommu_type1_pasid_request and copy data[] per flag. I'll
make this change in new version.

> > > > +
> > > > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > + case VFIO_IOMMU_PASID_ALLOC:
> > > > + {
> > > > + int ret = 0, result;
> > > > +
> > > > + result = vfio_iommu_type1_pasid_alloc(iommu,
> > > > + req.alloc_pasid.min,
> > > > + req.alloc_pasid.max);
> > > > + if (result > 0) {
> > > > + offset = offsetof(
> > > > + struct vfio_iommu_type1_pasid_request,
> > > > + alloc_pasid.result);
> > > > + ret = copy_to_user(
> > > > + (void __user *) (arg + offset),
> > > > + &result, sizeof(result));
> > >
> > > Again assuming argsz supports this.
> >
> > same as above.
> >
> > >
> > > > + } else {
> > > > + pr_debug("%s: PASID alloc failed\n", __func__);
> > >
> > > rate limit?
> >
> > not quite get. could you give more hints?
>
> A user can spam the host logs simply by allocating their quota of
> PASIDs and then trying to allocate more, or by specifying min/max such
> that they cannot allocate the requested PASID. If this logging is
> necessary for debugging, it should be ratelimited to avoid a DoS on the
> host.

got it. thanks for the coaching. will use pr_debug_ratelimited().

Regards,
Yi Liu