LinuxLists.cc - [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

2021-09-19 14:44:49

Subject: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
vDPA) to manage secure device access from the userspace. One critical task
of those frameworks is to put the assigned device in a secure, IOMMU-
protected context so user-initiated DMAs are prevented from doing harm to
the rest of the system.

Currently those frameworks implement their own logic for managing I/O page
tables to isolate user-initiated DMAs. This doesn't scale to support many
new IOMMU features, such as PASID-granular DMA remapping, nested translation,
I/O page fault, IOMMU dirty bit, etc.

/dev/iommu is introduced as an unified interface for managing I/O address
spaces and DMA isolation for passthrough devices. It's originated from the
upstream discussion for the vSVA enabling work[1].

This RFC aims to provide a basic skeleton for above proposal, w/o adding
any new feature beyond what vfio type1 provides today. For an overview of
future extensions, please refer to the full design proposal [2].

The core concepts in /dev/iommu are iommufd and ioasid. iommufd (by opening
/dev/iommu) is the container holding multiple I/O address spaces, while
ioasid is the fd-local software handle representing an I/O address space and
associated with a single I/O page table. User manages those address spaces
through fd operations, e.g. by using vfio type1v2 mapping semantics to manage
respective I/O page tables.

An I/O address space takes effect in the iommu only after it is attached by
a device. One I/O address space can be attached by multiple devices. One
device can be only attached to a single I/O address space in this RFC, to
match vfio type1 behavior as the starting point.

Device must be bound to an iommufd before attach operation can be conducted.
The binding operation builds the connection between the devicefd (opened via
device-passthrough framework) and iommufd. Most importantly, the entire
/dev/iommu framework adopts a device-centric model w/o carrying any container/
group legacy as current vfio does. This requires the binding operation also
establishes a security context which prevents the bound device from accessing
the rest of the system, as the contract for vfio to grant user access to the
assigned device. Detail explanation of this aspect can be found in patch 06.

Last, the format of an I/O page table must be compatible to the attached
devices (or more specifically to the IOMMU which serves the DMA from the
attached devices). User is responsible for specifying the format when
allocating an IOASID, according to one or multiple devices which will be
attached right after. The device IOMMU format can be queried via iommufd
once a device is successfully bound to the iommufd. Attaching a device to
an IOASID with incompatible format is simply rejected.

The skeleton is mostly implemented in iommufd, except that bind_iommufd/
ioasid_attach operations are initiated via device-passthrough framework
specific uAPIs. This RFC only changes vfio to work with iommufd. vdpa
support can be added in a later stage.

Basically iommufd provides following uAPIs and helper functions:

- IOMMU_DEVICE_GET_INFO, for querying per-device iommu capability/format;
- IOMMU_IOASID_ALLOC/FREE, as the name stands;
- IOMMU_[UN]MAP_DMA, providing vfio type1v2 semantics for managing a
specific I/O page table;
- helper functions for vfio to bind_iommufd/attach_ioasid with devices;

vfio extensions include:
- A new interface for user to open a device w/o using container/group uAPI;
- VFIO_DEVICE_BIND_IOMMUFD, for binding a vfio device to an iommufd;
* unbind is automatically done when devicefd is closed;
- VFIO_DEVICE_[DE]ATTACH_IOASID, for attaching/detaching a vfio device
to/from an ioasid in the specified iommufd;

[TODO in RFC v2]

We did one temporary hack in v1 by reusing vfio_iommu_type1.c to implement
IOMMU_[UN]MAP_DMA. This leads to some dirty code in patch 16/17/18. We
estimated almost 80% of the current type1 code are related to map/unmap.
It needs non-trivial effort for either duplicating it in iommufd or making
it shared by both vfio and iommufd. We hope this hack doesn't affect the
review of the overall skeleton, since the role of this part is very clear.
Based on the received feedback we will make a clean implementation in v2.

For userspace our time doesn't afford a clean implementation in Qemu.
Instead, we just wrote a simple application (similar to the example in
iommufd.rst) and verified the basic work flow (bind/unbind, alloc/free
ioasid, attach/detach, map/unmap, multi-devices group, etc.). We did
verify the I/O page table mappings established as expected, though no
DMA is conducted. We plan to have a clean implementation in Qemu and
provide a public link for reference when v2 is sending out.

[TODO out of this RFC]

The entire /dev/iommu project involves lots of tasks. It has to grow in
a staging approach. Below is a rough list of TODO features. Most of them
can be developed in parallel after this skeleton is accepted. For more
detail please refer to the design proposal [2]:

1. Move more vfio device types to iommufd:
* device which does no-snoop DMA
* software mdev
* PPC device
* platform device

2. New vfio device type
* hardware mdev/subdev (with PASID)

3. vDPA adoption

4. User-managed I/O page table
* ioasid nesting (hardware)
* ioasid nesting (software)
* pasid virtualization
o pdev (arm/amd)
o pdev/mdev which doesn't support enqcmd (intel)
o pdev/mdev which supports enqcmd (intel)
* I/O page fault (stage-1)

5. Miscellaneous
* I/O page fault (stage-2), for on-demand paging
* IOMMU dirty bit, for hardware-assisted dirty page tracking
* shared I/O page table (mm, ept, etc.)
* vfio/vdpa shim to avoid code duplication for legacy uAPI
* hardware-assisted vIOMMU

[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/

[Series Overview]
* Basic skeleton:
0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
0002-vfio-Add-vfio-device-class-for-device-nodes.patch
0003-vfio-Add-vfio_-un-register_device.patch
0004-iommu-Add-iommu_device_get_info-interface.patch
0005-vfio-pci-Register-device-centric-interface.patch

* Bind device fd with iommufd:
0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
0009-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
0010-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch

* IOASID [de]attach:
0011-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
0012-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
0013-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* /dev/iommu DMA (un)map:
0014-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
0015-iommu-iommufd-Report-iova-range-to-userspace.patch
0016-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info:
0017-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
0018-Doc-Add-documentation-for-dev-iommu.patch

* Basic skeleton:
0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
0002-vfio-Add-device-class-for-dev-vfio-devices.patch
0003-vfio-Add-vfio_-un-register_device.patch
0004-iommu-Add-iommu_device_get_info-interface.patch
0005-vfio-pci-Register-device-to-dev-vfio-devices.patch

* Bind device fd with iommufd:
0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
0009-iommu-Add-page-size-and-address-width-attributes.patch
0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch

* IOASID [de]attach:
0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* DMA (un)map:
0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
0017-iommu-iommufd-Report-iova-range-to-userspace.patch
0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info in vt-d driver to enable whole series:
0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
0020-Doc-Add-documentation-for-dev-iommu.patch

Complete code can be found in:
https://github.com/luxis1999/dev-iommu/commits/dev-iommu-5.14-rfcv1

Thanks for your time!

Regards,
Yi Liu
---

Liu Yi L (15):
iommu/iommufd: Add /dev/iommu core
vfio: Add device class for /dev/vfio/devices
vfio: Add vfio_[un]register_device()
vfio/pci: Register device to /dev/vfio/devices
iommu/iommufd: Add iommufd_[un]bind_device()
vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
iommu/iommufd: Add IOMMU_CHECK_EXTENSION
iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
vfio/type1: Export symbols for dma [un]map code sharing
iommu/iommufd: Report iova range to userspace
iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
Doc: Add documentation for /dev/iommu

Lu Baolu (5):
iommu: Add iommu_device_get_info interface
iommu: Add iommu_device_init[exit]_user_dma interfaces
iommu: Add page size and address width attributes
iommu: Extend iommu_at[de]tach_device() for multiple devices group
iommu/vt-d: Implement device_info iommu_ops callback

Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/iommufd.rst | 183 ++++++
drivers/iommu/Kconfig | 1 +
drivers/iommu/Makefile | 1 +
drivers/iommu/intel/iommu.c | 35 +
drivers/iommu/iommu.c | 188 +++++-
drivers/iommu/iommufd/Kconfig | 11 +
drivers/iommu/iommufd/Makefile | 2 +
drivers/iommu/iommufd/iommufd.c | 832 ++++++++++++++++++++++++
drivers/vfio/pci/Kconfig | 1 +
drivers/vfio/pci/vfio_pci.c | 179 ++++-
drivers/vfio/pci/vfio_pci_private.h | 10 +
drivers/vfio/vfio.c | 366 ++++++++++-
drivers/vfio/vfio_iommu_type1.c | 246 ++++++-
include/linux/iommu.h | 35 +
include/linux/iommufd.h | 71 ++
include/linux/vfio.h | 27 +
include/uapi/linux/iommu.h | 162 +++++
include/uapi/linux/vfio.h | 56 ++
19 files changed, 2358 insertions(+), 49 deletions(-)
create mode 100644 Documentation/userspace-api/iommufd.rst
create mode 100644 drivers/iommu/iommufd/Kconfig
create mode 100644 drivers/iommu/iommufd/Makefile
create mode 100644 drivers/iommu/iommufd/iommufd.c
create mode 100644 include/linux/iommufd.h

--
2.25.1

2021-09-19 14:46:11

by Yi Liu

[permalink] [raw]

Subject: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

This patch adds interface for userspace to attach device to specified
IOASID.

Note:
One device can only be attached to one IOASID in this version. This is
on par with what vfio provides today. In the future this restriction can
be relaxed when multiple I/O address spaces are supported per device

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 82 +++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h | 1 +
include/linux/iommufd.h | 1 +
include/uapi/linux/vfio.h | 26 +++++++++
4 files changed, 110 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 20006bb66430..5b1fda333122 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -557,6 +557,11 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
if (vdev->videv) {
struct vfio_iommufd_device *videv = vdev->videv;

+ if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+ iommufd_device_detach_ioasid(videv->idev,
+ videv->ioasid);
+ videv->ioasid = IOMMUFD_INVALID_IOASID;
+ }
vdev->videv = NULL;
iommufd_unbind_device(videv->idev);
kfree(videv);
@@ -839,6 +844,7 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
}
videv->idev = idev;
videv->iommu_fd = bind_data.iommu_fd;
+ videv->ioasid = IOMMUFD_INVALID_IOASID;
/*
* A security context has been established. Unblock
* user access.
@@ -848,6 +854,82 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
vdev->videv = videv;
mutex_unlock(&vdev->videv_lock);

+ return 0;
+ } else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
+ struct vfio_device_attach_ioasid attach;
+ unsigned long minsz;
+ struct vfio_iommufd_device *videv;
+ int ret = 0;
+
+ /* not allowed if the device is opened in legacy interface */
+ if (vfio_device_in_container(core_vdev))
+ return -ENOTTY;
+
+ minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+ if (copy_from_user(&attach, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (attach.argsz < minsz || attach.flags ||
+ attach.iommu_fd < 0 || attach.ioasid < 0)
+ return -EINVAL;
+
+ mutex_lock(&vdev->videv_lock);
+
+ videv = vdev->videv;
+ if (!videv || videv->iommu_fd != attach.iommu_fd) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EINVAL;
+ }
+
+ /* Currently only allows one IOASID attach */
+ if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EBUSY;
+ }
+
+ ret = __pci_iommufd_device_attach_ioasid(vdev->pdev,
+ videv->idev,
+ attach.ioasid);
+ if (!ret)
+ videv->ioasid = attach.ioasid;
+ mutex_unlock(&vdev->videv_lock);
+
+ return ret;
+ } else if (cmd == VFIO_DEVICE_DETACH_IOASID) {
+ struct vfio_device_attach_ioasid attach;
+ unsigned long minsz;
+ struct vfio_iommufd_device *videv;
+
+ /* not allowed if the device is opened in legacy interface */
+ if (vfio_device_in_container(core_vdev))
+ return -ENOTTY;
+
+ minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+ if (copy_from_user(&attach, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (attach.argsz < minsz || attach.flags ||
+ attach.iommu_fd < 0 || attach.ioasid < 0)
+ return -EINVAL;
+
+ mutex_lock(&vdev->videv_lock);
+
+ videv = vdev->videv;
+ if (!videv || videv->iommu_fd != attach.iommu_fd) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EINVAL;
+ }
+
+ if (videv->ioasid == IOMMUFD_INVALID_IOASID ||
+ videv->ioasid != attach.ioasid) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EINVAL;
+ }
+
+ videv->ioasid = IOMMUFD_INVALID_IOASID;
+ iommufd_device_detach_ioasid(videv->idev, attach.ioasid);
+ mutex_unlock(&vdev->videv_lock);
+
return 0;
} else if (cmd == VFIO_DEVICE_GET_INFO) {
struct vfio_device_info info;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index bd784accac35..daa0f08ac835 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -103,6 +103,7 @@ struct vfio_pci_mmap_vma {
struct vfio_iommufd_device {
struct iommufd_device *idev;
int iommu_fd;
+ int ioasid;
};

struct vfio_pci_device {
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 01a4fe934143..36d8d2fd22bb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -17,6 +17,7 @@

#define IOMMUFD_IOASID_MAX ((unsigned int)(0x7FFFFFFF))
#define IOMMUFD_IOASID_MIN 0
+#define IOMMUFD_INVALID_IOASID -1

#define IOMMUFD_DEVID_MAX ((unsigned int)(0x7FFFFFFF))
#define IOMMUFD_DEVID_MIN 0
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c902abd60339..61493ab03038 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -220,6 +220,32 @@ struct vfio_device_iommu_bind_data {

#define VFIO_DEVICE_BIND_IOMMUFD _IO(VFIO_TYPE, VFIO_BASE + 19)

+/*
+ * VFIO_DEVICE_ATTACH_IOASID - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ * struct vfio_device_attach_ioasid)
+ *
+ * Attach a vfio device to the specified IOASID
+ *
+ * Multiple vfio devices can be attached to the same IOASID. One device can
+ * be attached to only one ioasid at this point.
+ *
+ * @argsz: user filled size of this data.
+ * @flags: reserved for future extension.
+ * @iommu_fd: iommufd where the ioasid comes from.
+ * @ioasid: target I/O address space.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_ioasid {
+ __u32 argsz;
+ __u32 flags;
+ __s32 iommu_fd;
+ __s32 ioasid;
+};
+
+#define VFIO_DEVICE_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 20)
+#define VFIO_DEVICE_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 21)
+
/**
* VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
* struct vfio_device_info)
--
2.25.1

2021-09-19 15:27:38

by Yi Liu

[permalink] [raw]

Subject: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

This patch exposes the device-centric interface for vfio-pci devices. To
be compatiable with existing users, vfio-pci exposes both legacy group
interface and device-centric interface.

As explained in last patch, this change doesn't apply to devices which
cannot be forced to snoop cache by their upstream iommu. Such devices
are still expected to be opened via the legacy group interface.

When the device is opened via /dev/vfio/devices, vfio-pci should prevent
the user from accessing the assigned device because the device is still
attached to the default domain which may allow user-initiated DMAs to
touch arbitrary place. The user access must be blocked until the device
is later bound to an iommufd (see patch 08). The binding acts as the
contract for putting the device in a security context which ensures user-
initiated DMAs via this device cannot harm the rest of the system.

This patch introduces a vdev->block_access flag for this purpose. It's set
when the device is opened via /dev/vfio/devices and cleared after binding
to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
user access should be blocked or not.

An alternative option is to use a dummy fops when the device is opened and
then switch to the real fops (replace_fops()) after binding. Appreciate
inputs on which option is better.

The legacy group interface doesn't have this problem. Its uAPI requires the
user to first put the device into a security context via container/group
attaching process, before opening the device through the groupfd.

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 25 +++++++++++++++++++++++--
drivers/vfio/pci/vfio_pci_private.h | 1 +
drivers/vfio/vfio.c | 3 ++-
include/linux/vfio.h | 1 +
4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..145addde983b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -572,6 +572,10 @@ static int vfio_pci_open(struct vfio_device *core_vdev)

vfio_spapr_pci_eeh_open(vdev->pdev);
vfio_pci_vf_token_user_add(vdev, 1);
+ if (!vfio_device_in_container(core_vdev))
+ atomic_set(&vdev->block_access, 1);
+ else
+ atomic_set(&vdev->block_access, 0);
}
vdev->refcnt++;
error:
@@ -1374,6 +1378,9 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
{
unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);

+ if (atomic_read(&vdev->block_access))
+ return -ENODEV;
+
if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
return -EINVAL;

@@ -1640,6 +1647,9 @@ static int vfio_pci_mmap(struct vfio_device *core_vdev, struct vm_area_struct *v
u64 phys_len, req_len, pgoff, req_start;
int ret;

+ if (atomic_read(&vdev->block_access))
+ return -ENODEV;
+
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
@@ -1978,6 +1988,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
struct vfio_pci_device *vdev;
struct iommu_group *group;
int ret;
+ u32 flags;
+ bool snoop = false;

if (vfio_pci_is_denylisted(pdev))
return -EINVAL;
@@ -2046,9 +2058,18 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
vfio_pci_set_power_state(vdev, PCI_D3hot);
}

- ret = vfio_register_group_dev(&vdev->vdev);
- if (ret)
+ flags = VFIO_DEVNODE_GROUP;
+ ret = iommu_device_get_info(&pdev->dev,
+ IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+ if (!ret && snoop)
+ flags |= VFIO_DEVNODE_NONGROUP;
+
+ ret = vfio_register_device(&vdev->vdev, flags);
+ if (ret) {
+ pr_debug("Failed to register device interface\n");
goto out_power;
+ }
+
dev_set_drvdata(&pdev->dev, vdev);
return 0;

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5a36272cecbf..f12012e30b53 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -143,6 +143,7 @@ struct vfio_pci_device {
struct mutex vma_lock;
struct list_head vma_list;
struct rw_semaphore memory_lock;
+ atomic_t block_access;
};

#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1e87b25962f1..22851747e92c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1789,10 +1789,11 @@ static int vfio_device_fops_open(struct inode *inode, struct file *filep)
return ret;
}

-static bool vfio_device_in_container(struct vfio_device *device)
+bool vfio_device_in_container(struct vfio_device *device)
{
return !!(device->group && device->group->container);
}
+EXPORT_SYMBOL_GPL(vfio_device_in_container);

static int vfio_device_fops_release(struct inode *inode, struct file *filep)
{
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 9448b751b663..fd0629acb948 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -81,6 +81,7 @@ enum vfio_iommu_notify_type {

extern int vfio_register_device(struct vfio_device *device, u32 flags);
extern void vfio_unregister_device(struct vfio_device *device);
+extern bool vfio_device_in_container(struct vfio_device *device);

/**
* struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
--
2.25.1

2021-09-19 15:39:25

by Yi Liu

[permalink] [raw]

Subject: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Under the /dev/iommu model, iommufd provides the interface for I/O page
tables management such as dma map/unmap. However, it cannot work
independently since the device is still owned by the device-passthrough
frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
should build a connection between its device and the iommufd to delegate
the I/O page table management affairs to iommufd.

This patch introduces iommufd_[un]bind_device() helpers for the device-
passthrough framework to build such connection. The helper functions then
invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
security context for the bound device. Each successfully bound device is
internally tracked by an iommufd_device object. This object is returned
to the caller for subsequent attaching operations on the device as well.

The caller should pass a user-provided cookie to mark the device in the
iommufd. Later this cookie will be used to represent the device in iommufd
uAPI, e.g. when querying device capabilities or handling per-device I/O
page faults. One alternative is to have iommufd allocate a device label
and return to the user. Either way works, but cookie is slightly preferred
per earlier discussion as it may allow the user to inject faults slightly
faster without ID->vRID lookup.

iommu_[un]bind_device() functions are only used for physical devices. Other
variants will be introduced in the future, e.g.:

- iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
DMA isolation;
- iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
instead of iommu to isolate DMA;

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/iommu/iommufd/iommufd.c | 160 +++++++++++++++++++++++++++++++-
include/linux/iommufd.h | 38 ++++++++
2 files changed, 196 insertions(+), 2 deletions(-)
create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 710b7e62988b..e16ca21e4534 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -16,10 +16,30 @@
#include <linux/miscdevice.h>
#include <linux/mutex.h>
#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/xarray.h>
+#include <asm-generic/bug.h>

/* Per iommufd */
struct iommufd_ctx {
refcount_t refs;
+ struct mutex lock;
+ struct xarray device_xa; /* xarray of bound devices */
+};
+
+/*
+ * A iommufd_device object represents the binding relationship
+ * between iommufd and device. It is created per a successful
+ * binding request from device driver. The bound device must be
+ * a physical device so far. Subdevice will be supported later
+ * (with additional PASID information). An user-assigned cookie
+ * is also recorded to mark the device in the /dev/iommu uAPI.
+ */
+struct iommufd_device {
+ unsigned int id;
+ struct iommufd_ctx *ictx;
+ struct device *dev; /* always be the physical device */
+ u64 dev_cookie;
};

static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
return -ENOMEM;

refcount_set(&ictx->refs, 1);
+ mutex_init(&ictx->lock);
+ xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
filep->private_data = ictx;

return ret;
}

+static void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+ refcount_inc(&ictx->refs);
+}
+
+static const struct file_operations iommufd_fops;
+
+/**
+ * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
+ * @fd: [in] iommufd file descriptor.
+ *
+ * Returns a pointer to the iommufd context, otherwise NULL;
+ *
+ */
+static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
+{
+ struct fd f = fdget(fd);
+ struct file *file = f.file;
+ struct iommufd_ctx *ictx;
+
+ if (!file)
+ return NULL;
+
+ if (file->f_op != &iommufd_fops)
+ return NULL;
+
+ ictx = file->private_data;
+ if (ictx)
+ iommufd_ctx_get(ictx);
+ fdput(f);
+ return ictx;
+}
+
+/**
+ * iommufd_ctx_put - Releases a reference to the internal iommufd context.
+ * @ictx: [in] Pointer to iommufd context.
+ *
+ */
static void iommufd_ctx_put(struct iommufd_ctx *ictx)
{
- if (refcount_dec_and_test(&ictx->refs))
- kfree(ictx);
+ if (!refcount_dec_and_test(&ictx->refs))
+ return;
+
+ WARN_ON(!xa_empty(&ictx->device_xa));
+ kfree(ictx);
}

static int iommufd_fops_release(struct inode *inode, struct file *filep)
@@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
.mode = 0666,
};

+/**
+ * iommufd_bind_device - Bind a physical device marked by a device
+ * cookie to an iommu fd.
+ * @fd: [in] iommufd file descriptor.
+ * @dev: [in] Pointer to a physical device struct.
+ * @dev_cookie: [in] A cookie to mark the device in /dev/iommu uAPI.
+ *
+ * A successful bind establishes a security context for the device
+ * and returns struct iommufd_device pointer. Otherwise returns
+ * error pointer.
+ *
+ */
+struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
+ u64 dev_cookie)
+{
+ struct iommufd_ctx *ictx;
+ struct iommufd_device *idev;
+ unsigned long index;
+ unsigned int id;
+ int ret;
+
+ ictx = iommufd_ctx_fdget(fd);
+ if (!ictx)
+ return ERR_PTR(-EINVAL);
+
+ mutex_lock(&ictx->lock);
+
+ /* check duplicate registration */
+ xa_for_each(&ictx->device_xa, index, idev) {
+ if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
+ idev = ERR_PTR(-EBUSY);
+ goto out_unlock;
+ }
+ }
+
+ idev = kzalloc(sizeof(*idev), GFP_KERNEL);
+ if (!idev) {
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+
+ /* Establish the security context */
+ ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
+ if (ret)
+ goto out_free;
+
+ ret = xa_alloc(&ictx->device_xa, &id, idev,
+ XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
+ GFP_KERNEL);
+ if (ret) {
+ idev = ERR_PTR(ret);
+ goto out_user_dma;
+ }
+
+ idev->ictx = ictx;
+ idev->dev = dev;
+ idev->dev_cookie = dev_cookie;
+ idev->id = id;
+ mutex_unlock(&ictx->lock);
+
+ return idev;
+out_user_dma:
+ iommu_device_exit_user_dma(idev->dev);
+out_free:
+ kfree(idev);
+out_unlock:
+ mutex_unlock(&ictx->lock);
+ iommufd_ctx_put(ictx);
+
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(iommufd_bind_device);
+
+/**
+ * iommufd_unbind_device - Unbind a physical device from iommufd
+ *
+ * @idev: [in] Pointer to the internal iommufd_device struct.
+ *
+ */
+void iommufd_unbind_device(struct iommufd_device *idev)
+{
+ struct iommufd_ctx *ictx = idev->ictx;
+
+ mutex_lock(&ictx->lock);
+ xa_erase(&ictx->device_xa, idev->id);
+ mutex_unlock(&ictx->lock);
+ /* Exit the security context */
+ iommu_device_exit_user_dma(idev->dev);
+ kfree(idev);
+ iommufd_ctx_put(ictx);
+}
+EXPORT_SYMBOL_GPL(iommufd_unbind_device);
+
static int __init iommufd_init(void)
{
int ret;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 000000000000..1603a13937e9
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * IOMMUFD API definition
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <[email protected]>
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/device.h>
+
+#define IOMMUFD_DEVID_MAX ((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_DEVID_MIN 0
+
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
+{
+ return ERR_PTR(-ENODEV);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif /* __LINUX_IOMMUFD_H */
--
2.25.1

2021-09-19 15:39:45

by Yi Liu

[permalink] [raw]

Subject: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

From: Lu Baolu <[email protected]>

This extends iommu core to manage security context for passthrough
devices. Please bear a long explanation for how we reach this design
instead of managing it solely in iommufd like what vfio does today.

Devices which cannot be isolated from each other are organized into an
iommu group. When a device is assigned to the user space, the entire
group must be put in a security context so that user-initiated DMAs via
the assigned device cannot harm the rest of the system. No user access
should be granted on a device before the security context is established
for the group which the device belongs to.

Managing the security context must meet below criteria:

1) The group is viable for user-initiated DMAs. This implies that the
devices in the group must be either bound to a device-passthrough
framework, or driver-less, or bound to a driver which is known safe
(not do DMA).

2) The security context should only allow DMA to the user's memory and
devices in this group;

3) After the security context is established for the group, the group
viability must be continuously monitored before the user relinquishes
all devices belonging to the group. The viability might be broken e.g.
when a driver-less device is later bound to a driver which does DMA.

4) The security context should not be destroyed before user access
permission is withdrawn.

Existing vfio introduces explicit container/group semantics in its uAPI
to meet above requirements. A single security context (iommu domain)
is created per container. Attaching group to container moves the entire
group into the associated security context, and vice versa. The user can
open the device only after group attach. A group can be detached only
after all devices in the group are closed. Group viability is monitored
by listening to iommu group events.

Unlike vfio, iommufd adopts a device-centric design with all group
logistics hidden behind the fd. Binding a device to iommufd serves
as the contract to get security context established (and vice versa
for unbinding). One additional requirement in iommufd is to manage the
switch between multiple security contexts due to decoupled bind/attach:

1) Open a device in "/dev/vfio/devices" with user access blocked;

2) Bind the device to an iommufd with an initial security context
(an empty iommu domain which blocks dma) established for its
group, with user access unblocked;

3) Attach the device to a user-specified ioasid (shared by all devices
attached to this ioasid). Before attaching, the device should be first
detached from the initial context;

4) Detach the device from the ioasid and switch it back to the initial
security context;

5) Unbind the device from the iommufd, back to access blocked state and
move its group out of the initial security context if it's the last
unbound device in the group;

(multiple attach/detach could happen between 2 and 5).

However existing iommu core has problem with above transition. Detach
in step 3/4 makes the device/group re-attached to the default domain
automatically, which opens the door for user-initiated DMAs to attack
the rest of the system. The existing vfio doesn't have this problem as
it combines 2/3 in one step (so does 4/5).

Fixing this problem requires the iommu core to also participate in the
security context management. Following this direction we also move group
viability check into the iommu core, which allows iommufd to stay fully
device-centric w/o keeping any group knowledge (combining with the
extension to iommu_at[de]tach_device() in a latter patch).

Basically two new interfaces are provided:

int iommu_device_init_user_dma(struct device *dev,
unsigned long owner);
void iommu_device_exit_user_dma(struct device *dev);

iommufd calls them respectively when handling device binding/unbinding
requests.

The init_user_dma() for the 1st device in a group marks the entire group
for user-dma and establishes the initial security context (dma blocked)
according to aforementioned criteria. As long as the group is marked for
user-dma, auto-reattaching to default domain is disabled. Instead, upon
detaching the group is moved back to the initial security context.

The caller also provides an owner id to mark the ownership so inadvertent
attempt from another caller on the same device can be captured. In this
RFC iommufd will use the fd context pointer as the owner id.

The exit_user_dma() for the last device in the group clears the user-dma
mark and moves the group back to the default domain.

Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/iommu.c | 145 +++++++++++++++++++++++++++++++++++++++++-
include/linux/iommu.h | 12 ++++
2 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5ea3a007fd7c..bffd84e978fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -45,6 +45,8 @@ struct iommu_group {
struct iommu_domain *default_domain;
struct iommu_domain *domain;
struct list_head entry;
+ unsigned long user_dma_owner_id;
+ refcount_t owner_cnt;
};

struct group_device {
@@ -86,6 +88,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
static ssize_t iommu_group_store_type(struct iommu_group *group,
const char *buf, size_t count);
+static bool iommu_group_user_dma_viable(struct iommu_group *group);

#define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \
struct iommu_group_attribute iommu_group_attr_##_name = \
@@ -275,7 +278,11 @@ int iommu_probe_device(struct device *dev)
*/
iommu_alloc_default_domain(group, dev);

- if (group->default_domain) {
+ /*
+ * If any device in the group has been initialized for user dma,
+ * avoid attaching the default domain.
+ */
+ if (group->default_domain && !group->user_dma_owner_id) {
ret = __iommu_attach_device(group->default_domain, dev);
if (ret) {
iommu_group_put(group);
@@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
break;
case BUS_NOTIFY_BOUND_DRIVER:
+ /*
+ * FIXME: Alternatively the attached drivers could generically
+ * indicate to the iommu layer that they are safe for keeping
+ * the iommu group user viable by calling some function around
+ * probe(). We could eliminate this gross BUG_ON() by denying
+ * probe to non-iommu-safe driver.
+ */
+ mutex_lock(&group->mutex);
+ if (group->user_dma_owner_id)
+ BUG_ON(!iommu_group_user_dma_viable(group));
+ mutex_unlock(&group->mutex);
group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
break;
case BUS_NOTIFY_UNBIND_DRIVER:
@@ -2304,7 +2322,11 @@ static int __iommu_attach_group(struct iommu_domain *domain,
{
int ret;

- if (group->default_domain && group->domain != group->default_domain)
+ /*
+ * group->domain could be NULL when a domain is detached from the
+ * group but the default_domain is not re-attached.
+ */
+ if (group->domain && group->domain != group->default_domain)
return -EBUSY;

ret = __iommu_group_for_each_dev(group, domain,
@@ -2341,7 +2363,11 @@ static void __iommu_detach_group(struct iommu_domain *domain,
{
int ret;

- if (!group->default_domain) {
+ /*
+ * If any device in the group has been initialized for user dma,
+ * avoid re-attaching the default domain.
+ */
+ if (!group->default_domain || group->user_dma_owner_id) {
__iommu_group_for_each_dev(group, domain,
iommu_group_do_detach_device);
group->domain = NULL;
@@ -3276,3 +3302,116 @@ int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *dat
return ops->device_info(dev, attr, data);
}
EXPORT_SYMBOL_GPL(iommu_device_get_info);
+
+/*
+ * IOMMU core interfaces for iommufd.
+ */
+
+/*
+ * FIXME: We currently simply follow vifo policy to mantain the group's
+ * viability to user. Eventually, we should avoid below hard-coded list
+ * by letting drivers indicate to the iommu layer that they are safe for
+ * keeping the iommu group's user aviability.
+ */
+static const char * const iommu_driver_allowed[] = {
+ "vfio-pci",
+ "pci-stub"
+};
+
+/*
+ * An iommu group is viable for use by userspace if all devices are in
+ * one of the following states:
+ * - driver-less
+ * - bound to an allowed driver
+ * - a PCI interconnect device
+ */
+static int device_user_dma_viable(struct device *dev, void *data)
+{
+ struct device_driver *drv = READ_ONCE(dev->driver);
+
+ if (!drv)
+ return 0;
+
+ if (dev_is_pci(dev)) {
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+ return 0;
+ }
+
+ return match_string(iommu_driver_allowed,
+ ARRAY_SIZE(iommu_driver_allowed),
+ drv->name) < 0;
+}
+
+static bool iommu_group_user_dma_viable(struct iommu_group *group)
+{
+ return !__iommu_group_for_each_dev(group, NULL, device_user_dma_viable);
+}
+
+static int iommu_group_init_user_dma(struct iommu_group *group,
+ unsigned long owner)
+{
+ if (group->user_dma_owner_id) {
+ if (group->user_dma_owner_id != owner)
+ return -EBUSY;
+
+ refcount_inc(&group->owner_cnt);
+ return 0;
+ }
+
+ if (group->domain && group->domain != group->default_domain)
+ return -EBUSY;
+
+ if (!iommu_group_user_dma_viable(group))
+ return -EINVAL;
+
+ group->user_dma_owner_id = owner;
+ refcount_set(&group->owner_cnt, 1);
+
+ /* default domain is unsafe for user-initiated dma */
+ if (group->domain == group->default_domain)
+ __iommu_detach_group(group->default_domain, group);
+
+ return 0;
+}
+
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+ int ret;
+
+ if (!group || !owner)
+ return -ENODEV;
+
+ mutex_lock(&group->mutex);
+ ret = iommu_group_init_user_dma(group, owner);
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_device_init_user_dma);
+
+static void iommu_group_exit_user_dma(struct iommu_group *group)
+{
+ if (refcount_dec_and_test(&group->owner_cnt)) {
+ group->user_dma_owner_id = 0;
+ if (group->default_domain)
+ __iommu_attach_group(group->default_domain, group);
+ }
+}
+
+void iommu_device_exit_user_dma(struct device *dev)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+
+ if (WARN_ON(!group || !group->user_dma_owner_id))
+ return;
+
+ mutex_lock(&group->mutex);
+ iommu_group_exit_user_dma(group);
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_device_exit_user_dma);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 52a6d33c82dc..943de6897f56 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -617,6 +617,9 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle);

int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);

+int iommu_device_init_user_dma(struct device *dev, unsigned long owner);
+void iommu_device_exit_user_dma(struct device *dev);
+
#else /* CONFIG_IOMMU_API */

struct iommu_ops {};
@@ -1018,6 +1021,15 @@ static inline int iommu_device_get_info(struct device *dev,
{
return -ENODEV;
}
+
+static inline int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+ return -ENODEV;
+}
+
+static inline void iommu_device_exit_user_dma(struct device *dev)
+{
+}
#endif /* CONFIG_IOMMU_API */

/**
--
2.25.1

2021-09-19 15:41:22

by Yi Liu

[permalink] [raw]

Subject: [RFC 20/20] Doc: Add documentation for /dev/iommu

Document the /dev/iommu framework for user.

Open:
Do we want to document /dev/iommu in Documentation/userspace-api/iommu.rst?
Existing iommu.rst is for the vSVA interfaces, honestly, may need to rewrite
this doc entirely.

Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
---
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/iommufd.rst | 183 ++++++++++++++++++++++++
2 files changed, 184 insertions(+)
create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..54df5a278023 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
ebpf/index
ioctl/index
iommu
+ iommufd
media/index
sysfs-platform_profile

diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 000000000000..abffbb47dc02
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. iommu:
+
+===================
+IOMMU Userspace API
+===================
+
+Direct device access from userspace has been a crtical feature in
+high performance computing and virtualization usages. Linux now
+includes multiple device-passthrough frameworks (e.g. VFIO and vDPA)
+to manage secure device access from the userspace. One critical
+task of those frameworks is to put the assigned device in a secure,
+IOMMU-protected context so the device is prevented from doing harm
+to the rest of the system.
+
+Currently those frameworks implement their own logic for managing
+I/O page tables to isolate user-initiated DMAs. This doesn't scale
+to support many new IOMMU features, such as PASID-granular DMA
+remapping, nested translation, I/O page fault, IOMMU dirty bit, etc.
+
+The /dev/iommu framework provides an unified interface for managing
+I/O page tables for passthrough devices. Existing passthrough
+frameworks are expected to use this interface instead of continuing
+their ad-hoc implementations.
+
+IOMMUFDs, IOASIDs, Devices and Groups
+-------------------------------------
+
+The core concepts in /dev/iommu are IOMMUFDs and IOASIDs. IOMMUFD (by
+opening /dev/iommu) is the container holding multiple I/O address
+spaces for a user, while IOASID is the fd-local software handle
+representing an I/O address space and associated with a single I/O
+page table. User manages those address spaces through fd operations,
+e.g. by using vfio type1v2 mapping semantics to manage respective
+I/O page tables.
+
+IOASID is comparable to the conatiner concept in VFIO. The latter
+is also associated to a single I/O address space. A main difference
+between them is that multiple IOASIDs in the same IOMMUFD can be
+nested together (not supported yet) to allow centralized accounting
+of locked pages, while multiple containers are disconnected thus
+duplicated accounting is incurred. Typically one IOMMUFD is
+sufficient for all intended IOMMU usages for a user.
+
+An I/O address space takes effect in the IOMMU only after it is
+attached by a device. One I/O address space can be attached by
+multiple devices. One device can be only attached to a single I/O
+address space at this point (on par with current vfio behavior).
+
+Device must be bound to an iommufd before the attach operation can
+be conducted. The binding operation builds the connection between
+the devicefd (opened via device-passthrough framework) and IOMMUFD.
+IOMMU-protected security context is esbliashed when the binding
+operation is completed. The passthrough framework must block user
+access to the assigned device until bind() returns success.
+
+The entire /dev/iommu framework adopts a device-centric model w/o
+carrying any container/group legacy as current vfio does. However
+the group is the minimum granularity that must be used to ensure
+secure user access (refer to vfio.rst). This framework relies on
+the IOMMU core layer to map device-centric model into group-granular
+isolation.
+
+Managing I/O Address Spaces
+---------------------------
+
+When creating an I/O address space (by allocating IOASID), the user
+must specify the type of underlying I/O page table. Currently only
+one type (kernel-managed) is supported. In the future other types
+will be introduced, e.g. to support user-managed I/O page table or
+a shared I/O page table which is managed by another kernel sub-
+system (mm, ept, etc.). Kernel-managed I/O page table is currently
+managed via vfio type1v2 equivalent mapping semantics.
+
+The user also needs to specify the format of the I/O page table
+when allocating an IOASID. The format must be compatible to the
+attached devices (or more specifically to the IOMMU which serves
+the DMA from the attached devices). User can query the device IOMMU
+format via IOMMUFD once a device is successfully bound. Attaching a
+device to an IOASID with incompatible format is simply rejected.
+
+Currently no-snoop DMA is not supported yet. This implies that
+IOASID must be created in an enforce-snoop format and only devices
+which can be forced to snoop cache by IOMMU are allowed to be
+attached to IOASID. The user should check uAPI extension and get
+device info via IOMMUFD to handle such restriction.
+
+Usage Example
+-------------
+
+Assume user wants to access PCI device 0000:06:0d.0, which is
+exposed under the new /dev/vfio/devices directory by VFIO:
+
+ /* Open device-centric interface and /dev/iommu interface */
+ device_fd = open("/dev/vfio/devices/0000:06:0d.0", O_RDWR);
+ iommu_fd = open("/dev/iommu", O_RDWR);
+
+ /* Bind device to IOMMUFD */
+ bind_data = { .iommu_fd = iommu_fd, .dev_cookie = cookie };
+ ioctl(device_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind_data);
+
+ /* Query per-device IOMMU capability/format */
+ info = { .dev_cookie = cookie, };
+ ioctl(iommu_fd, IOMMU_DEVICE_GET_INFO, &info);
+
+ if (!(info.flags & IOMMU_DEVICE_INFO_ENFORCE_SNOOP)) {
+ if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION,
+ EXT_DMA_NO_SNOOP))
+ /* No support of no-snoop DMA */
+ }
+
+ if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION, EXT_MAP_TYPE1V2))
+ /* No support of vfio type1v2 mapping semantics */
+
+ /* Decides IOASID alloc fields based on info */
+ alloc_data = { .type = IOMMU_IOASID_TYPE_KERNEL,
+ .flags = IOMMU_IOASID_ENFORCE_SNOOP,
+ .addr_width = info.addr_width, };
+
+ /* Allocate IOASID */
+ gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
+
+ /* Attach device to an IOASID */
+ at_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+ ioctl(device_fd, VFIO_DEVICE_ATTACH_IOASID, &at_data);
+
+ /* Setup GPA mapping [0 - 1GB] */
+ dma_map = {
+ .ioasid = gpa_ioasid,
+ .data {
+ .flags = R/W /* permission */
+ .iova = 0, /* GPA */
+ .vaddr = 0x40000000, /* HVA */
+ .size = 1GB,
+ },
+ };
+ ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
+
+ /* DMA */
+
+ /* Unmap GPA mapping [0 - 1GB] */
+ dma_unmap = {
+ .ioasid = gpa_ioasid,
+ .data {
+ .iova = 0, /* GPA */
+ .size = 1GB,
+ },
+ };
+ ioctl(iommu_fd, IOMMU_UNMAP_DMA, &dma_unmap);
+
+ /* Detach device from an IOASID */
+ dt_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+ ioctl(device_fd, VFIO_DEVICE_DETACH_IOASID, &dt_data);
+
+ /* Free IOASID */
+ ioctl(iommu_fd, IOMMU_IOASID_FREE, gpa_ioasid);
+
+ close(device_fd);
+ close(iommu_fd);
+
+API for device-passthrough frameworks
+-------------------------------------
+
+iommufd binding and IOASID attach/detach are initiated via the device-
+passthrough framework uAPI.
+
+When a binding operation is requested by the user, the passthrough
+framework should call iommufd_bind_device(). When the device fd is
+closed by the user, iommufd_unbind_device() should be called
+automatically::
+
+ struct iommufd_device *
+ iommufd_bind_device(int fd, struct device *dev,
+ u64 dev_cookie);
+ void iommufd_unbind_device(struct iommufd_device *idev);
+
+IOASID attach/detach operations are per iommufd_device which is
+returned by iommufd_bind_device():
+
+ int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+ int ioasid);
+ void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+ int ioasid);
--
2.25.1

2021-09-21 13:49:11

Subject: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

Subject: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

Subject: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Subject: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: [RFC 20/20] Doc: Add documentation for /dev/iommu

Subject: Re: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Subject: Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

Subject: RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Attachments:

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Attachments:

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Attachments:

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Attachments:

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Attachments:

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Attachments:

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Attachments:

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Attachments:

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Attachments:

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Subject: Re: [RFC 20/20] Doc: Add documentation for /dev/iommu

Attachments:

Subject: Re: [RFC 20/20] Doc: Add documentation for /dev/iommu