2021-09-19 14:44:49

by Yi Liu

[permalink] [raw]
Subject: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
vDPA) to manage secure device access from the userspace. One critical task
of those frameworks is to put the assigned device in a secure, IOMMU-
protected context so user-initiated DMAs are prevented from doing harm to
the rest of the system.

Currently those frameworks implement their own logic for managing I/O page
tables to isolate user-initiated DMAs. This doesn't scale to support many
new IOMMU features, such as PASID-granular DMA remapping, nested translation,
I/O page fault, IOMMU dirty bit, etc.

/dev/iommu is introduced as an unified interface for managing I/O address
spaces and DMA isolation for passthrough devices. It's originated from the
upstream discussion for the vSVA enabling work[1].

This RFC aims to provide a basic skeleton for above proposal, w/o adding
any new feature beyond what vfio type1 provides today. For an overview of
future extensions, please refer to the full design proposal [2].

The core concepts in /dev/iommu are iommufd and ioasid. iommufd (by opening
/dev/iommu) is the container holding multiple I/O address spaces, while
ioasid is the fd-local software handle representing an I/O address space and
associated with a single I/O page table. User manages those address spaces
through fd operations, e.g. by using vfio type1v2 mapping semantics to manage
respective I/O page tables.

An I/O address space takes effect in the iommu only after it is attached by
a device. One I/O address space can be attached by multiple devices. One
device can be only attached to a single I/O address space in this RFC, to
match vfio type1 behavior as the starting point.

Device must be bound to an iommufd before attach operation can be conducted.
The binding operation builds the connection between the devicefd (opened via
device-passthrough framework) and iommufd. Most importantly, the entire
/dev/iommu framework adopts a device-centric model w/o carrying any container/
group legacy as current vfio does. This requires the binding operation also
establishes a security context which prevents the bound device from accessing
the rest of the system, as the contract for vfio to grant user access to the
assigned device. Detail explanation of this aspect can be found in patch 06.

Last, the format of an I/O page table must be compatible to the attached
devices (or more specifically to the IOMMU which serves the DMA from the
attached devices). User is responsible for specifying the format when
allocating an IOASID, according to one or multiple devices which will be
attached right after. The device IOMMU format can be queried via iommufd
once a device is successfully bound to the iommufd. Attaching a device to
an IOASID with incompatible format is simply rejected.

The skeleton is mostly implemented in iommufd, except that bind_iommufd/
ioasid_attach operations are initiated via device-passthrough framework
specific uAPIs. This RFC only changes vfio to work with iommufd. vdpa
support can be added in a later stage.

Basically iommufd provides following uAPIs and helper functions:

- IOMMU_DEVICE_GET_INFO, for querying per-device iommu capability/format;
- IOMMU_IOASID_ALLOC/FREE, as the name stands;
- IOMMU_[UN]MAP_DMA, providing vfio type1v2 semantics for managing a
specific I/O page table;
- helper functions for vfio to bind_iommufd/attach_ioasid with devices;

vfio extensions include:
- A new interface for user to open a device w/o using container/group uAPI;
- VFIO_DEVICE_BIND_IOMMUFD, for binding a vfio device to an iommufd;
* unbind is automatically done when devicefd is closed;
- VFIO_DEVICE_[DE]ATTACH_IOASID, for attaching/detaching a vfio device
to/from an ioasid in the specified iommufd;

[TODO in RFC v2]

We did one temporary hack in v1 by reusing vfio_iommu_type1.c to implement
IOMMU_[UN]MAP_DMA. This leads to some dirty code in patch 16/17/18. We
estimated almost 80% of the current type1 code are related to map/unmap.
It needs non-trivial effort for either duplicating it in iommufd or making
it shared by both vfio and iommufd. We hope this hack doesn't affect the
review of the overall skeleton, since the role of this part is very clear.
Based on the received feedback we will make a clean implementation in v2.

For userspace our time doesn't afford a clean implementation in Qemu.
Instead, we just wrote a simple application (similar to the example in
iommufd.rst) and verified the basic work flow (bind/unbind, alloc/free
ioasid, attach/detach, map/unmap, multi-devices group, etc.). We did
verify the I/O page table mappings established as expected, though no
DMA is conducted. We plan to have a clean implementation in Qemu and
provide a public link for reference when v2 is sending out.

[TODO out of this RFC]

The entire /dev/iommu project involves lots of tasks. It has to grow in
a staging approach. Below is a rough list of TODO features. Most of them
can be developed in parallel after this skeleton is accepted. For more
detail please refer to the design proposal [2]:

1. Move more vfio device types to iommufd:
* device which does no-snoop DMA
* software mdev
* PPC device
* platform device

2. New vfio device type
* hardware mdev/subdev (with PASID)

3. vDPA adoption

4. User-managed I/O page table
* ioasid nesting (hardware)
* ioasid nesting (software)
* pasid virtualization
o pdev (arm/amd)
o pdev/mdev which doesn't support enqcmd (intel)
o pdev/mdev which supports enqcmd (intel)
* I/O page fault (stage-1)

5. Miscellaneous
* I/O page fault (stage-2), for on-demand paging
* IOMMU dirty bit, for hardware-assisted dirty page tracking
* shared I/O page table (mm, ept, etc.)
* vfio/vdpa shim to avoid code duplication for legacy uAPI
* hardware-assisted vIOMMU

[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/

[Series Overview]
* Basic skeleton:
0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
0002-vfio-Add-vfio-device-class-for-device-nodes.patch
0003-vfio-Add-vfio_-un-register_device.patch
0004-iommu-Add-iommu_device_get_info-interface.patch
0005-vfio-pci-Register-device-centric-interface.patch

* Bind device fd with iommufd:
0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
0009-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
0010-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch

* IOASID [de]attach:
0011-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
0012-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
0013-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* /dev/iommu DMA (un)map:
0014-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
0015-iommu-iommufd-Report-iova-range-to-userspace.patch
0016-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info:
0017-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
0018-Doc-Add-documentation-for-dev-iommu.patch

* Basic skeleton:
0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
0002-vfio-Add-device-class-for-dev-vfio-devices.patch
0003-vfio-Add-vfio_-un-register_device.patch
0004-iommu-Add-iommu_device_get_info-interface.patch
0005-vfio-pci-Register-device-to-dev-vfio-devices.patch

* Bind device fd with iommufd:
0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
0009-iommu-Add-page-size-and-address-width-attributes.patch
0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch

* IOASID [de]attach:
0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* DMA (un)map:
0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
0017-iommu-iommufd-Report-iova-range-to-userspace.patch
0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info in vt-d driver to enable whole series:
0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
0020-Doc-Add-documentation-for-dev-iommu.patch

Complete code can be found in:
https://github.com/luxis1999/dev-iommu/commits/dev-iommu-5.14-rfcv1

Thanks for your time!

Regards,
Yi Liu
---

Liu Yi L (15):
iommu/iommufd: Add /dev/iommu core
vfio: Add device class for /dev/vfio/devices
vfio: Add vfio_[un]register_device()
vfio/pci: Register device to /dev/vfio/devices
iommu/iommufd: Add iommufd_[un]bind_device()
vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
iommu/iommufd: Add IOMMU_CHECK_EXTENSION
iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
vfio/type1: Export symbols for dma [un]map code sharing
iommu/iommufd: Report iova range to userspace
iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
Doc: Add documentation for /dev/iommu

Lu Baolu (5):
iommu: Add iommu_device_get_info interface
iommu: Add iommu_device_init[exit]_user_dma interfaces
iommu: Add page size and address width attributes
iommu: Extend iommu_at[de]tach_device() for multiple devices group
iommu/vt-d: Implement device_info iommu_ops callback

Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/iommufd.rst | 183 ++++++
drivers/iommu/Kconfig | 1 +
drivers/iommu/Makefile | 1 +
drivers/iommu/intel/iommu.c | 35 +
drivers/iommu/iommu.c | 188 +++++-
drivers/iommu/iommufd/Kconfig | 11 +
drivers/iommu/iommufd/Makefile | 2 +
drivers/iommu/iommufd/iommufd.c | 832 ++++++++++++++++++++++++
drivers/vfio/pci/Kconfig | 1 +
drivers/vfio/pci/vfio_pci.c | 179 ++++-
drivers/vfio/pci/vfio_pci_private.h | 10 +
drivers/vfio/vfio.c | 366 ++++++++++-
drivers/vfio/vfio_iommu_type1.c | 246 ++++++-
include/linux/iommu.h | 35 +
include/linux/iommufd.h | 71 ++
include/linux/vfio.h | 27 +
include/uapi/linux/iommu.h | 162 +++++
include/uapi/linux/vfio.h | 56 ++
19 files changed, 2358 insertions(+), 49 deletions(-)
create mode 100644 Documentation/userspace-api/iommufd.rst
create mode 100644 drivers/iommu/iommufd/Kconfig
create mode 100644 drivers/iommu/iommufd/Makefile
create mode 100644 drivers/iommu/iommufd/iommufd.c
create mode 100644 include/linux/iommufd.h

--
2.25.1


2021-09-19 14:46:11

by Yi Liu

[permalink] [raw]
Subject: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

This patch adds interface for userspace to attach device to specified
IOASID.

Note:
One device can only be attached to one IOASID in this version. This is
on par with what vfio provides today. In the future this restriction can
be relaxed when multiple I/O address spaces are supported per device

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 82 +++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h | 1 +
include/linux/iommufd.h | 1 +
include/uapi/linux/vfio.h | 26 +++++++++
4 files changed, 110 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 20006bb66430..5b1fda333122 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -557,6 +557,11 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
if (vdev->videv) {
struct vfio_iommufd_device *videv = vdev->videv;

+ if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+ iommufd_device_detach_ioasid(videv->idev,
+ videv->ioasid);
+ videv->ioasid = IOMMUFD_INVALID_IOASID;
+ }
vdev->videv = NULL;
iommufd_unbind_device(videv->idev);
kfree(videv);
@@ -839,6 +844,7 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
}
videv->idev = idev;
videv->iommu_fd = bind_data.iommu_fd;
+ videv->ioasid = IOMMUFD_INVALID_IOASID;
/*
* A security context has been established. Unblock
* user access.
@@ -848,6 +854,82 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
vdev->videv = videv;
mutex_unlock(&vdev->videv_lock);

+ return 0;
+ } else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
+ struct vfio_device_attach_ioasid attach;
+ unsigned long minsz;
+ struct vfio_iommufd_device *videv;
+ int ret = 0;
+
+ /* not allowed if the device is opened in legacy interface */
+ if (vfio_device_in_container(core_vdev))
+ return -ENOTTY;
+
+ minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+ if (copy_from_user(&attach, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (attach.argsz < minsz || attach.flags ||
+ attach.iommu_fd < 0 || attach.ioasid < 0)
+ return -EINVAL;
+
+ mutex_lock(&vdev->videv_lock);
+
+ videv = vdev->videv;
+ if (!videv || videv->iommu_fd != attach.iommu_fd) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EINVAL;
+ }
+
+ /* Currently only allows one IOASID attach */
+ if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EBUSY;
+ }
+
+ ret = __pci_iommufd_device_attach_ioasid(vdev->pdev,
+ videv->idev,
+ attach.ioasid);
+ if (!ret)
+ videv->ioasid = attach.ioasid;
+ mutex_unlock(&vdev->videv_lock);
+
+ return ret;
+ } else if (cmd == VFIO_DEVICE_DETACH_IOASID) {
+ struct vfio_device_attach_ioasid attach;
+ unsigned long minsz;
+ struct vfio_iommufd_device *videv;
+
+ /* not allowed if the device is opened in legacy interface */
+ if (vfio_device_in_container(core_vdev))
+ return -ENOTTY;
+
+ minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+ if (copy_from_user(&attach, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (attach.argsz < minsz || attach.flags ||
+ attach.iommu_fd < 0 || attach.ioasid < 0)
+ return -EINVAL;
+
+ mutex_lock(&vdev->videv_lock);
+
+ videv = vdev->videv;
+ if (!videv || videv->iommu_fd != attach.iommu_fd) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EINVAL;
+ }
+
+ if (videv->ioasid == IOMMUFD_INVALID_IOASID ||
+ videv->ioasid != attach.ioasid) {
+ mutex_unlock(&vdev->videv_lock);
+ return -EINVAL;
+ }
+
+ videv->ioasid = IOMMUFD_INVALID_IOASID;
+ iommufd_device_detach_ioasid(videv->idev, attach.ioasid);
+ mutex_unlock(&vdev->videv_lock);
+
return 0;
} else if (cmd == VFIO_DEVICE_GET_INFO) {
struct vfio_device_info info;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index bd784accac35..daa0f08ac835 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -103,6 +103,7 @@ struct vfio_pci_mmap_vma {
struct vfio_iommufd_device {
struct iommufd_device *idev;
int iommu_fd;
+ int ioasid;
};

struct vfio_pci_device {
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 01a4fe934143..36d8d2fd22bb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -17,6 +17,7 @@

#define IOMMUFD_IOASID_MAX ((unsigned int)(0x7FFFFFFF))
#define IOMMUFD_IOASID_MIN 0
+#define IOMMUFD_INVALID_IOASID -1

#define IOMMUFD_DEVID_MAX ((unsigned int)(0x7FFFFFFF))
#define IOMMUFD_DEVID_MIN 0
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c902abd60339..61493ab03038 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -220,6 +220,32 @@ struct vfio_device_iommu_bind_data {

#define VFIO_DEVICE_BIND_IOMMUFD _IO(VFIO_TYPE, VFIO_BASE + 19)

+/*
+ * VFIO_DEVICE_ATTACH_IOASID - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ * struct vfio_device_attach_ioasid)
+ *
+ * Attach a vfio device to the specified IOASID
+ *
+ * Multiple vfio devices can be attached to the same IOASID. One device can
+ * be attached to only one ioasid at this point.
+ *
+ * @argsz: user filled size of this data.
+ * @flags: reserved for future extension.
+ * @iommu_fd: iommufd where the ioasid comes from.
+ * @ioasid: target I/O address space.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_ioasid {
+ __u32 argsz;
+ __u32 flags;
+ __s32 iommu_fd;
+ __s32 ioasid;
+};
+
+#define VFIO_DEVICE_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 20)
+#define VFIO_DEVICE_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 21)
+
/**
* VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
* struct vfio_device_info)
--
2.25.1

2021-09-19 15:27:38

by Yi Liu

[permalink] [raw]
Subject: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

This patch exposes the device-centric interface for vfio-pci devices. To
be compatiable with existing users, vfio-pci exposes both legacy group
interface and device-centric interface.

As explained in last patch, this change doesn't apply to devices which
cannot be forced to snoop cache by their upstream iommu. Such devices
are still expected to be opened via the legacy group interface.

When the device is opened via /dev/vfio/devices, vfio-pci should prevent
the user from accessing the assigned device because the device is still
attached to the default domain which may allow user-initiated DMAs to
touch arbitrary place. The user access must be blocked until the device
is later bound to an iommufd (see patch 08). The binding acts as the
contract for putting the device in a security context which ensures user-
initiated DMAs via this device cannot harm the rest of the system.

This patch introduces a vdev->block_access flag for this purpose. It's set
when the device is opened via /dev/vfio/devices and cleared after binding
to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
user access should be blocked or not.

An alternative option is to use a dummy fops when the device is opened and
then switch to the real fops (replace_fops()) after binding. Appreciate
inputs on which option is better.

The legacy group interface doesn't have this problem. Its uAPI requires the
user to first put the device into a security context via container/group
attaching process, before opening the device through the groupfd.

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 25 +++++++++++++++++++++++--
drivers/vfio/pci/vfio_pci_private.h | 1 +
drivers/vfio/vfio.c | 3 ++-
include/linux/vfio.h | 1 +
4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..145addde983b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -572,6 +572,10 @@ static int vfio_pci_open(struct vfio_device *core_vdev)

vfio_spapr_pci_eeh_open(vdev->pdev);
vfio_pci_vf_token_user_add(vdev, 1);
+ if (!vfio_device_in_container(core_vdev))
+ atomic_set(&vdev->block_access, 1);
+ else
+ atomic_set(&vdev->block_access, 0);
}
vdev->refcnt++;
error:
@@ -1374,6 +1378,9 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
{
unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);

+ if (atomic_read(&vdev->block_access))
+ return -ENODEV;
+
if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
return -EINVAL;

@@ -1640,6 +1647,9 @@ static int vfio_pci_mmap(struct vfio_device *core_vdev, struct vm_area_struct *v
u64 phys_len, req_len, pgoff, req_start;
int ret;

+ if (atomic_read(&vdev->block_access))
+ return -ENODEV;
+
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
@@ -1978,6 +1988,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
struct vfio_pci_device *vdev;
struct iommu_group *group;
int ret;
+ u32 flags;
+ bool snoop = false;

if (vfio_pci_is_denylisted(pdev))
return -EINVAL;
@@ -2046,9 +2058,18 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
vfio_pci_set_power_state(vdev, PCI_D3hot);
}

- ret = vfio_register_group_dev(&vdev->vdev);
- if (ret)
+ flags = VFIO_DEVNODE_GROUP;
+ ret = iommu_device_get_info(&pdev->dev,
+ IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+ if (!ret && snoop)
+ flags |= VFIO_DEVNODE_NONGROUP;
+
+ ret = vfio_register_device(&vdev->vdev, flags);
+ if (ret) {
+ pr_debug("Failed to register device interface\n");
goto out_power;
+ }
+
dev_set_drvdata(&pdev->dev, vdev);
return 0;

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5a36272cecbf..f12012e30b53 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -143,6 +143,7 @@ struct vfio_pci_device {
struct mutex vma_lock;
struct list_head vma_list;
struct rw_semaphore memory_lock;
+ atomic_t block_access;
};

#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1e87b25962f1..22851747e92c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1789,10 +1789,11 @@ static int vfio_device_fops_open(struct inode *inode, struct file *filep)
return ret;
}

-static bool vfio_device_in_container(struct vfio_device *device)
+bool vfio_device_in_container(struct vfio_device *device)
{
return !!(device->group && device->group->container);
}
+EXPORT_SYMBOL_GPL(vfio_device_in_container);

static int vfio_device_fops_release(struct inode *inode, struct file *filep)
{
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 9448b751b663..fd0629acb948 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -81,6 +81,7 @@ enum vfio_iommu_notify_type {

extern int vfio_register_device(struct vfio_device *device, u32 flags);
extern void vfio_unregister_device(struct vfio_device *device);
+extern bool vfio_device_in_container(struct vfio_device *device);

/**
* struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
--
2.25.1

2021-09-19 15:39:25

by Yi Liu

[permalink] [raw]
Subject: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

Under the /dev/iommu model, iommufd provides the interface for I/O page
tables management such as dma map/unmap. However, it cannot work
independently since the device is still owned by the device-passthrough
frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
should build a connection between its device and the iommufd to delegate
the I/O page table management affairs to iommufd.

This patch introduces iommufd_[un]bind_device() helpers for the device-
passthrough framework to build such connection. The helper functions then
invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
security context for the bound device. Each successfully bound device is
internally tracked by an iommufd_device object. This object is returned
to the caller for subsequent attaching operations on the device as well.

The caller should pass a user-provided cookie to mark the device in the
iommufd. Later this cookie will be used to represent the device in iommufd
uAPI, e.g. when querying device capabilities or handling per-device I/O
page faults. One alternative is to have iommufd allocate a device label
and return to the user. Either way works, but cookie is slightly preferred
per earlier discussion as it may allow the user to inject faults slightly
faster without ID->vRID lookup.

iommu_[un]bind_device() functions are only used for physical devices. Other
variants will be introduced in the future, e.g.:

- iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
DMA isolation;
- iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
instead of iommu to isolate DMA;

Signed-off-by: Liu Yi L <[email protected]>
---
drivers/iommu/iommufd/iommufd.c | 160 +++++++++++++++++++++++++++++++-
include/linux/iommufd.h | 38 ++++++++
2 files changed, 196 insertions(+), 2 deletions(-)
create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 710b7e62988b..e16ca21e4534 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -16,10 +16,30 @@
#include <linux/miscdevice.h>
#include <linux/mutex.h>
#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/xarray.h>
+#include <asm-generic/bug.h>

/* Per iommufd */
struct iommufd_ctx {
refcount_t refs;
+ struct mutex lock;
+ struct xarray device_xa; /* xarray of bound devices */
+};
+
+/*
+ * A iommufd_device object represents the binding relationship
+ * between iommufd and device. It is created per a successful
+ * binding request from device driver. The bound device must be
+ * a physical device so far. Subdevice will be supported later
+ * (with additional PASID information). An user-assigned cookie
+ * is also recorded to mark the device in the /dev/iommu uAPI.
+ */
+struct iommufd_device {
+ unsigned int id;
+ struct iommufd_ctx *ictx;
+ struct device *dev; /* always be the physical device */
+ u64 dev_cookie;
};

static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
return -ENOMEM;

refcount_set(&ictx->refs, 1);
+ mutex_init(&ictx->lock);
+ xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
filep->private_data = ictx;

return ret;
}

+static void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+ refcount_inc(&ictx->refs);
+}
+
+static const struct file_operations iommufd_fops;
+
+/**
+ * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
+ * @fd: [in] iommufd file descriptor.
+ *
+ * Returns a pointer to the iommufd context, otherwise NULL;
+ *
+ */
+static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
+{
+ struct fd f = fdget(fd);
+ struct file *file = f.file;
+ struct iommufd_ctx *ictx;
+
+ if (!file)
+ return NULL;
+
+ if (file->f_op != &iommufd_fops)
+ return NULL;
+
+ ictx = file->private_data;
+ if (ictx)
+ iommufd_ctx_get(ictx);
+ fdput(f);
+ return ictx;
+}
+
+/**
+ * iommufd_ctx_put - Releases a reference to the internal iommufd context.
+ * @ictx: [in] Pointer to iommufd context.
+ *
+ */
static void iommufd_ctx_put(struct iommufd_ctx *ictx)
{
- if (refcount_dec_and_test(&ictx->refs))
- kfree(ictx);
+ if (!refcount_dec_and_test(&ictx->refs))
+ return;
+
+ WARN_ON(!xa_empty(&ictx->device_xa));
+ kfree(ictx);
}

static int iommufd_fops_release(struct inode *inode, struct file *filep)
@@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
.mode = 0666,
};

+/**
+ * iommufd_bind_device - Bind a physical device marked by a device
+ * cookie to an iommu fd.
+ * @fd: [in] iommufd file descriptor.
+ * @dev: [in] Pointer to a physical device struct.
+ * @dev_cookie: [in] A cookie to mark the device in /dev/iommu uAPI.
+ *
+ * A successful bind establishes a security context for the device
+ * and returns struct iommufd_device pointer. Otherwise returns
+ * error pointer.
+ *
+ */
+struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
+ u64 dev_cookie)
+{
+ struct iommufd_ctx *ictx;
+ struct iommufd_device *idev;
+ unsigned long index;
+ unsigned int id;
+ int ret;
+
+ ictx = iommufd_ctx_fdget(fd);
+ if (!ictx)
+ return ERR_PTR(-EINVAL);
+
+ mutex_lock(&ictx->lock);
+
+ /* check duplicate registration */
+ xa_for_each(&ictx->device_xa, index, idev) {
+ if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
+ idev = ERR_PTR(-EBUSY);
+ goto out_unlock;
+ }
+ }
+
+ idev = kzalloc(sizeof(*idev), GFP_KERNEL);
+ if (!idev) {
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+
+ /* Establish the security context */
+ ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
+ if (ret)
+ goto out_free;
+
+ ret = xa_alloc(&ictx->device_xa, &id, idev,
+ XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
+ GFP_KERNEL);
+ if (ret) {
+ idev = ERR_PTR(ret);
+ goto out_user_dma;
+ }
+
+ idev->ictx = ictx;
+ idev->dev = dev;
+ idev->dev_cookie = dev_cookie;
+ idev->id = id;
+ mutex_unlock(&ictx->lock);
+
+ return idev;
+out_user_dma:
+ iommu_device_exit_user_dma(idev->dev);
+out_free:
+ kfree(idev);
+out_unlock:
+ mutex_unlock(&ictx->lock);
+ iommufd_ctx_put(ictx);
+
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(iommufd_bind_device);
+
+/**
+ * iommufd_unbind_device - Unbind a physical device from iommufd
+ *
+ * @idev: [in] Pointer to the internal iommufd_device struct.
+ *
+ */
+void iommufd_unbind_device(struct iommufd_device *idev)
+{
+ struct iommufd_ctx *ictx = idev->ictx;
+
+ mutex_lock(&ictx->lock);
+ xa_erase(&ictx->device_xa, idev->id);
+ mutex_unlock(&ictx->lock);
+ /* Exit the security context */
+ iommu_device_exit_user_dma(idev->dev);
+ kfree(idev);
+ iommufd_ctx_put(ictx);
+}
+EXPORT_SYMBOL_GPL(iommufd_unbind_device);
+
static int __init iommufd_init(void)
{
int ret;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 000000000000..1603a13937e9
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * IOMMUFD API definition
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <[email protected]>
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/device.h>
+
+#define IOMMUFD_DEVID_MAX ((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_DEVID_MIN 0
+
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
+{
+ return ERR_PTR(-ENODEV);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif /* __LINUX_IOMMUFD_H */
--
2.25.1

2021-09-19 15:39:45

by Yi Liu

[permalink] [raw]
Subject: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

From: Lu Baolu <[email protected]>

This extends iommu core to manage security context for passthrough
devices. Please bear a long explanation for how we reach this design
instead of managing it solely in iommufd like what vfio does today.

Devices which cannot be isolated from each other are organized into an
iommu group. When a device is assigned to the user space, the entire
group must be put in a security context so that user-initiated DMAs via
the assigned device cannot harm the rest of the system. No user access
should be granted on a device before the security context is established
for the group which the device belongs to.

Managing the security context must meet below criteria:

1) The group is viable for user-initiated DMAs. This implies that the
devices in the group must be either bound to a device-passthrough
framework, or driver-less, or bound to a driver which is known safe
(not do DMA).

2) The security context should only allow DMA to the user's memory and
devices in this group;

3) After the security context is established for the group, the group
viability must be continuously monitored before the user relinquishes
all devices belonging to the group. The viability might be broken e.g.
when a driver-less device is later bound to a driver which does DMA.

4) The security context should not be destroyed before user access
permission is withdrawn.

Existing vfio introduces explicit container/group semantics in its uAPI
to meet above requirements. A single security context (iommu domain)
is created per container. Attaching group to container moves the entire
group into the associated security context, and vice versa. The user can
open the device only after group attach. A group can be detached only
after all devices in the group are closed. Group viability is monitored
by listening to iommu group events.

Unlike vfio, iommufd adopts a device-centric design with all group
logistics hidden behind the fd. Binding a device to iommufd serves
as the contract to get security context established (and vice versa
for unbinding). One additional requirement in iommufd is to manage the
switch between multiple security contexts due to decoupled bind/attach:

1) Open a device in "/dev/vfio/devices" with user access blocked;

2) Bind the device to an iommufd with an initial security context
(an empty iommu domain which blocks dma) established for its
group, with user access unblocked;

3) Attach the device to a user-specified ioasid (shared by all devices
attached to this ioasid). Before attaching, the device should be first
detached from the initial context;

4) Detach the device from the ioasid and switch it back to the initial
security context;

5) Unbind the device from the iommufd, back to access blocked state and
move its group out of the initial security context if it's the last
unbound device in the group;

(multiple attach/detach could happen between 2 and 5).

However existing iommu core has problem with above transition. Detach
in step 3/4 makes the device/group re-attached to the default domain
automatically, which opens the door for user-initiated DMAs to attack
the rest of the system. The existing vfio doesn't have this problem as
it combines 2/3 in one step (so does 4/5).

Fixing this problem requires the iommu core to also participate in the
security context management. Following this direction we also move group
viability check into the iommu core, which allows iommufd to stay fully
device-centric w/o keeping any group knowledge (combining with the
extension to iommu_at[de]tach_device() in a latter patch).

Basically two new interfaces are provided:

int iommu_device_init_user_dma(struct device *dev,
unsigned long owner);
void iommu_device_exit_user_dma(struct device *dev);

iommufd calls them respectively when handling device binding/unbinding
requests.

The init_user_dma() for the 1st device in a group marks the entire group
for user-dma and establishes the initial security context (dma blocked)
according to aforementioned criteria. As long as the group is marked for
user-dma, auto-reattaching to default domain is disabled. Instead, upon
detaching the group is moved back to the initial security context.

The caller also provides an owner id to mark the ownership so inadvertent
attempt from another caller on the same device can be captured. In this
RFC iommufd will use the fd context pointer as the owner id.

The exit_user_dma() for the last device in the group clears the user-dma
mark and moves the group back to the default domain.

Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/iommu.c | 145 +++++++++++++++++++++++++++++++++++++++++-
include/linux/iommu.h | 12 ++++
2 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5ea3a007fd7c..bffd84e978fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -45,6 +45,8 @@ struct iommu_group {
struct iommu_domain *default_domain;
struct iommu_domain *domain;
struct list_head entry;
+ unsigned long user_dma_owner_id;
+ refcount_t owner_cnt;
};

struct group_device {
@@ -86,6 +88,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
static ssize_t iommu_group_store_type(struct iommu_group *group,
const char *buf, size_t count);
+static bool iommu_group_user_dma_viable(struct iommu_group *group);

#define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \
struct iommu_group_attribute iommu_group_attr_##_name = \
@@ -275,7 +278,11 @@ int iommu_probe_device(struct device *dev)
*/
iommu_alloc_default_domain(group, dev);

- if (group->default_domain) {
+ /*
+ * If any device in the group has been initialized for user dma,
+ * avoid attaching the default domain.
+ */
+ if (group->default_domain && !group->user_dma_owner_id) {
ret = __iommu_attach_device(group->default_domain, dev);
if (ret) {
iommu_group_put(group);
@@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
break;
case BUS_NOTIFY_BOUND_DRIVER:
+ /*
+ * FIXME: Alternatively the attached drivers could generically
+ * indicate to the iommu layer that they are safe for keeping
+ * the iommu group user viable by calling some function around
+ * probe(). We could eliminate this gross BUG_ON() by denying
+ * probe to non-iommu-safe driver.
+ */
+ mutex_lock(&group->mutex);
+ if (group->user_dma_owner_id)
+ BUG_ON(!iommu_group_user_dma_viable(group));
+ mutex_unlock(&group->mutex);
group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
break;
case BUS_NOTIFY_UNBIND_DRIVER:
@@ -2304,7 +2322,11 @@ static int __iommu_attach_group(struct iommu_domain *domain,
{
int ret;

- if (group->default_domain && group->domain != group->default_domain)
+ /*
+ * group->domain could be NULL when a domain is detached from the
+ * group but the default_domain is not re-attached.
+ */
+ if (group->domain && group->domain != group->default_domain)
return -EBUSY;

ret = __iommu_group_for_each_dev(group, domain,
@@ -2341,7 +2363,11 @@ static void __iommu_detach_group(struct iommu_domain *domain,
{
int ret;

- if (!group->default_domain) {
+ /*
+ * If any device in the group has been initialized for user dma,
+ * avoid re-attaching the default domain.
+ */
+ if (!group->default_domain || group->user_dma_owner_id) {
__iommu_group_for_each_dev(group, domain,
iommu_group_do_detach_device);
group->domain = NULL;
@@ -3276,3 +3302,116 @@ int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *dat
return ops->device_info(dev, attr, data);
}
EXPORT_SYMBOL_GPL(iommu_device_get_info);
+
+/*
+ * IOMMU core interfaces for iommufd.
+ */
+
+/*
+ * FIXME: We currently simply follow vifo policy to mantain the group's
+ * viability to user. Eventually, we should avoid below hard-coded list
+ * by letting drivers indicate to the iommu layer that they are safe for
+ * keeping the iommu group's user aviability.
+ */
+static const char * const iommu_driver_allowed[] = {
+ "vfio-pci",
+ "pci-stub"
+};
+
+/*
+ * An iommu group is viable for use by userspace if all devices are in
+ * one of the following states:
+ * - driver-less
+ * - bound to an allowed driver
+ * - a PCI interconnect device
+ */
+static int device_user_dma_viable(struct device *dev, void *data)
+{
+ struct device_driver *drv = READ_ONCE(dev->driver);
+
+ if (!drv)
+ return 0;
+
+ if (dev_is_pci(dev)) {
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+ return 0;
+ }
+
+ return match_string(iommu_driver_allowed,
+ ARRAY_SIZE(iommu_driver_allowed),
+ drv->name) < 0;
+}
+
+static bool iommu_group_user_dma_viable(struct iommu_group *group)
+{
+ return !__iommu_group_for_each_dev(group, NULL, device_user_dma_viable);
+}
+
+static int iommu_group_init_user_dma(struct iommu_group *group,
+ unsigned long owner)
+{
+ if (group->user_dma_owner_id) {
+ if (group->user_dma_owner_id != owner)
+ return -EBUSY;
+
+ refcount_inc(&group->owner_cnt);
+ return 0;
+ }
+
+ if (group->domain && group->domain != group->default_domain)
+ return -EBUSY;
+
+ if (!iommu_group_user_dma_viable(group))
+ return -EINVAL;
+
+ group->user_dma_owner_id = owner;
+ refcount_set(&group->owner_cnt, 1);
+
+ /* default domain is unsafe for user-initiated dma */
+ if (group->domain == group->default_domain)
+ __iommu_detach_group(group->default_domain, group);
+
+ return 0;
+}
+
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+ int ret;
+
+ if (!group || !owner)
+ return -ENODEV;
+
+ mutex_lock(&group->mutex);
+ ret = iommu_group_init_user_dma(group, owner);
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_device_init_user_dma);
+
+static void iommu_group_exit_user_dma(struct iommu_group *group)
+{
+ if (refcount_dec_and_test(&group->owner_cnt)) {
+ group->user_dma_owner_id = 0;
+ if (group->default_domain)
+ __iommu_attach_group(group->default_domain, group);
+ }
+}
+
+void iommu_device_exit_user_dma(struct device *dev)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+
+ if (WARN_ON(!group || !group->user_dma_owner_id))
+ return;
+
+ mutex_lock(&group->mutex);
+ iommu_group_exit_user_dma(group);
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_device_exit_user_dma);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 52a6d33c82dc..943de6897f56 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -617,6 +617,9 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle);

int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);

+int iommu_device_init_user_dma(struct device *dev, unsigned long owner);
+void iommu_device_exit_user_dma(struct device *dev);
+
#else /* CONFIG_IOMMU_API */

struct iommu_ops {};
@@ -1018,6 +1021,15 @@ static inline int iommu_device_get_info(struct device *dev,
{
return -ENODEV;
}
+
+static inline int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+ return -ENODEV;
+}
+
+static inline void iommu_device_exit_user_dma(struct device *dev)
+{
+}
#endif /* CONFIG_IOMMU_API */

/**
--
2.25.1

2021-09-19 15:41:22

by Yi Liu

[permalink] [raw]
Subject: [RFC 20/20] Doc: Add documentation for /dev/iommu

Document the /dev/iommu framework for user.

Open:
Do we want to document /dev/iommu in Documentation/userspace-api/iommu.rst?
Existing iommu.rst is for the vSVA interfaces, honestly, may need to rewrite
this doc entirely.

Signed-off-by: Kevin Tian <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
---
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/iommufd.rst | 183 ++++++++++++++++++++++++
2 files changed, 184 insertions(+)
create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..54df5a278023 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
ebpf/index
ioctl/index
iommu
+ iommufd
media/index
sysfs-platform_profile

diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 000000000000..abffbb47dc02
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. iommu:
+
+===================
+IOMMU Userspace API
+===================
+
+Direct device access from userspace has been a crtical feature in
+high performance computing and virtualization usages. Linux now
+includes multiple device-passthrough frameworks (e.g. VFIO and vDPA)
+to manage secure device access from the userspace. One critical
+task of those frameworks is to put the assigned device in a secure,
+IOMMU-protected context so the device is prevented from doing harm
+to the rest of the system.
+
+Currently those frameworks implement their own logic for managing
+I/O page tables to isolate user-initiated DMAs. This doesn't scale
+to support many new IOMMU features, such as PASID-granular DMA
+remapping, nested translation, I/O page fault, IOMMU dirty bit, etc.
+
+The /dev/iommu framework provides an unified interface for managing
+I/O page tables for passthrough devices. Existing passthrough
+frameworks are expected to use this interface instead of continuing
+their ad-hoc implementations.
+
+IOMMUFDs, IOASIDs, Devices and Groups
+-------------------------------------
+
+The core concepts in /dev/iommu are IOMMUFDs and IOASIDs. IOMMUFD (by
+opening /dev/iommu) is the container holding multiple I/O address
+spaces for a user, while IOASID is the fd-local software handle
+representing an I/O address space and associated with a single I/O
+page table. User manages those address spaces through fd operations,
+e.g. by using vfio type1v2 mapping semantics to manage respective
+I/O page tables.
+
+IOASID is comparable to the conatiner concept in VFIO. The latter
+is also associated to a single I/O address space. A main difference
+between them is that multiple IOASIDs in the same IOMMUFD can be
+nested together (not supported yet) to allow centralized accounting
+of locked pages, while multiple containers are disconnected thus
+duplicated accounting is incurred. Typically one IOMMUFD is
+sufficient for all intended IOMMU usages for a user.
+
+An I/O address space takes effect in the IOMMU only after it is
+attached by a device. One I/O address space can be attached by
+multiple devices. One device can be only attached to a single I/O
+address space at this point (on par with current vfio behavior).
+
+Device must be bound to an iommufd before the attach operation can
+be conducted. The binding operation builds the connection between
+the devicefd (opened via device-passthrough framework) and IOMMUFD.
+IOMMU-protected security context is esbliashed when the binding
+operation is completed. The passthrough framework must block user
+access to the assigned device until bind() returns success.
+
+The entire /dev/iommu framework adopts a device-centric model w/o
+carrying any container/group legacy as current vfio does. However
+the group is the minimum granularity that must be used to ensure
+secure user access (refer to vfio.rst). This framework relies on
+the IOMMU core layer to map device-centric model into group-granular
+isolation.
+
+Managing I/O Address Spaces
+---------------------------
+
+When creating an I/O address space (by allocating IOASID), the user
+must specify the type of underlying I/O page table. Currently only
+one type (kernel-managed) is supported. In the future other types
+will be introduced, e.g. to support user-managed I/O page table or
+a shared I/O page table which is managed by another kernel sub-
+system (mm, ept, etc.). Kernel-managed I/O page table is currently
+managed via vfio type1v2 equivalent mapping semantics.
+
+The user also needs to specify the format of the I/O page table
+when allocating an IOASID. The format must be compatible to the
+attached devices (or more specifically to the IOMMU which serves
+the DMA from the attached devices). User can query the device IOMMU
+format via IOMMUFD once a device is successfully bound. Attaching a
+device to an IOASID with incompatible format is simply rejected.
+
+Currently no-snoop DMA is not supported yet. This implies that
+IOASID must be created in an enforce-snoop format and only devices
+which can be forced to snoop cache by IOMMU are allowed to be
+attached to IOASID. The user should check uAPI extension and get
+device info via IOMMUFD to handle such restriction.
+
+Usage Example
+-------------
+
+Assume user wants to access PCI device 0000:06:0d.0, which is
+exposed under the new /dev/vfio/devices directory by VFIO:
+
+ /* Open device-centric interface and /dev/iommu interface */
+ device_fd = open("/dev/vfio/devices/0000:06:0d.0", O_RDWR);
+ iommu_fd = open("/dev/iommu", O_RDWR);
+
+ /* Bind device to IOMMUFD */
+ bind_data = { .iommu_fd = iommu_fd, .dev_cookie = cookie };
+ ioctl(device_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind_data);
+
+ /* Query per-device IOMMU capability/format */
+ info = { .dev_cookie = cookie, };
+ ioctl(iommu_fd, IOMMU_DEVICE_GET_INFO, &info);
+
+ if (!(info.flags & IOMMU_DEVICE_INFO_ENFORCE_SNOOP)) {
+ if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION,
+ EXT_DMA_NO_SNOOP))
+ /* No support of no-snoop DMA */
+ }
+
+ if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION, EXT_MAP_TYPE1V2))
+ /* No support of vfio type1v2 mapping semantics */
+
+ /* Decides IOASID alloc fields based on info */
+ alloc_data = { .type = IOMMU_IOASID_TYPE_KERNEL,
+ .flags = IOMMU_IOASID_ENFORCE_SNOOP,
+ .addr_width = info.addr_width, };
+
+ /* Allocate IOASID */
+ gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
+
+ /* Attach device to an IOASID */
+ at_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+ ioctl(device_fd, VFIO_DEVICE_ATTACH_IOASID, &at_data);
+
+ /* Setup GPA mapping [0 - 1GB] */
+ dma_map = {
+ .ioasid = gpa_ioasid,
+ .data {
+ .flags = R/W /* permission */
+ .iova = 0, /* GPA */
+ .vaddr = 0x40000000, /* HVA */
+ .size = 1GB,
+ },
+ };
+ ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
+
+ /* DMA */
+
+ /* Unmap GPA mapping [0 - 1GB] */
+ dma_unmap = {
+ .ioasid = gpa_ioasid,
+ .data {
+ .iova = 0, /* GPA */
+ .size = 1GB,
+ },
+ };
+ ioctl(iommu_fd, IOMMU_UNMAP_DMA, &dma_unmap);
+
+ /* Detach device from an IOASID */
+ dt_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+ ioctl(device_fd, VFIO_DEVICE_DETACH_IOASID, &dt_data);
+
+ /* Free IOASID */
+ ioctl(iommu_fd, IOMMU_IOASID_FREE, gpa_ioasid);
+
+ close(device_fd);
+ close(iommu_fd);
+
+API for device-passthrough frameworks
+-------------------------------------
+
+iommufd binding and IOASID attach/detach are initiated via the device-
+passthrough framework uAPI.
+
+When a binding operation is requested by the user, the passthrough
+framework should call iommufd_bind_device(). When the device fd is
+closed by the user, iommufd_unbind_device() should be called
+automatically::
+
+ struct iommufd_device *
+ iommufd_bind_device(int fd, struct device *dev,
+ u64 dev_cookie);
+ void iommufd_unbind_device(struct iommufd_device *idev);
+
+IOASID attach/detach operations are per iommufd_device which is
+returned by iommufd_bind_device():
+
+ int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+ int ioasid);
+ void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+ int ioasid);
--
2.25.1

2021-09-21 13:49:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
> vDPA) to manage secure device access from the userspace. One critical task
> of those frameworks is to put the assigned device in a secure, IOMMU-
> protected context so user-initiated DMAs are prevented from doing harm to
> the rest of the system.

Some bot will probably send this too, but it has compile warnings and
needs to be rebased to 5.15-rc1

drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
if (refcount_read(&ioas->refs) > 1) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs here
return ret;
^~~
drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its condition is always true
if (refcount_read(&ioas->refs) > 1) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret' to silence this warning
int ioasid, ret;
^
= 0
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
return ERR_PTR(ret);
^~~
drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its condition is always false
if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
^~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
return ERR_PTR(ret);
^~~
drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its condition is always false
if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
^~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret' to silence this warning
int ret;
^
= 0

Jason

2021-09-21 16:41:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> This patch exposes the device-centric interface for vfio-pci devices. To
> be compatiable with existing users, vfio-pci exposes both legacy group
> interface and device-centric interface.
>
> As explained in last patch, this change doesn't apply to devices which
> cannot be forced to snoop cache by their upstream iommu. Such devices
> are still expected to be opened via the legacy group interface.
>
> When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> the user from accessing the assigned device because the device is still
> attached to the default domain which may allow user-initiated DMAs to
> touch arbitrary place. The user access must be blocked until the device
> is later bound to an iommufd (see patch 08). The binding acts as the
> contract for putting the device in a security context which ensures user-
> initiated DMAs via this device cannot harm the rest of the system.
>
> This patch introduces a vdev->block_access flag for this purpose. It's set
> when the device is opened via /dev/vfio/devices and cleared after binding
> to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> user access should be blocked or not.

This should not be in vfio_pci.

AFAIK there is no condition where a vfio driver can work without being
connected to some kind of iommu back end, so the core code should
handle this interlock globally. A vfio driver's ops should not be
callable until the iommu is connected.

The only vfio_pci patch in this series should be adding a new callback
op to take in an iommufd and register the pci_device as a iommufd
device.

Jason

2021-09-21 17:12:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> From: Lu Baolu <[email protected]>
>
> This extends iommu core to manage security context for passthrough
> devices. Please bear a long explanation for how we reach this design
> instead of managing it solely in iommufd like what vfio does today.
>
> Devices which cannot be isolated from each other are organized into an
> iommu group. When a device is assigned to the user space, the entire
> group must be put in a security context so that user-initiated DMAs via
> the assigned device cannot harm the rest of the system. No user access
> should be granted on a device before the security context is established
> for the group which the device belongs to.

> Managing the security context must meet below criteria:
>
> 1) The group is viable for user-initiated DMAs. This implies that the
> devices in the group must be either bound to a device-passthrough

s/a/the same/

> framework, or driver-less, or bound to a driver which is known safe
> (not do DMA).
>
> 2) The security context should only allow DMA to the user's memory and
> devices in this group;
>
> 3) After the security context is established for the group, the group
> viability must be continuously monitored before the user relinquishes
> all devices belonging to the group. The viability might be broken e.g.
> when a driver-less device is later bound to a driver which does DMA.
>
> 4) The security context should not be destroyed before user access
> permission is withdrawn.
>
> Existing vfio introduces explicit container/group semantics in its uAPI
> to meet above requirements. A single security context (iommu domain)
> is created per container. Attaching group to container moves the entire
> group into the associated security context, and vice versa. The user can
> open the device only after group attach. A group can be detached only
> after all devices in the group are closed. Group viability is monitored
> by listening to iommu group events.
>
> Unlike vfio, iommufd adopts a device-centric design with all group
> logistics hidden behind the fd. Binding a device to iommufd serves
> as the contract to get security context established (and vice versa
> for unbinding). One additional requirement in iommufd is to manage the
> switch between multiple security contexts due to decoupled bind/attach:

This should be a precursor series that actually does clean things up
properly. There is no reason for vfio and iommufd to differ here, if
we are implementing this logic into the iommu layer then it should be
deleted from the VFIO layer, not left duplicated like this.

IIRC in VFIO the container is the IOAS and when the group goes to
create the device fd it should simply do the
iommu_device_init_user_dma() followed immediately by a call to bind
the container IOAS as your #3.

Then delete all the group viability stuff from vfio, relying on the
iommu to do it.

It should have full symmetry with the iommufd.

> @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
> group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
> break;
> case BUS_NOTIFY_BOUND_DRIVER:
> + /*
> + * FIXME: Alternatively the attached drivers could generically
> + * indicate to the iommu layer that they are safe for keeping
> + * the iommu group user viable by calling some function around
> + * probe(). We could eliminate this gross BUG_ON() by denying
> + * probe to non-iommu-safe driver.
> + */
> + mutex_lock(&group->mutex);
> + if (group->user_dma_owner_id)
> + BUG_ON(!iommu_group_user_dma_viable(group));
> + mutex_unlock(&group->mutex);

And the mini-series should fix this BUG_ON properly by interlocking
with the driver core to simply refuse to bind a driver under these
conditions instead of allowing userspace to crash the kernel.

That alone would be justification enough to merge this work.

> +
> +/*
> + * IOMMU core interfaces for iommufd.
> + */
> +
> +/*
> + * FIXME: We currently simply follow vifo policy to mantain the group's
> + * viability to user. Eventually, we should avoid below hard-coded list
> + * by letting drivers indicate to the iommu layer that they are safe for
> + * keeping the iommu group's user aviability.
> + */
> +static const char * const iommu_driver_allowed[] = {
> + "vfio-pci",
> + "pci-stub"
> +};

Yuk. This should be done with some callback in those drivers
'iomm_allow_user_dma()"

Ie the basic flow would see the driver core doing some:

ret = iommu_doing_kernel_dma()
if (ret) do not bind
driver_bind
pci_stub_probe()
iommu_allow_user_dma()

And the various functions are manipulating some atomic.
0 = nothing happening
1 = kernel DMA
2 = user DMA

No BUG_ON.

Jason

2021-09-21 17:17:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:

> +/*
> + * A iommufd_device object represents the binding relationship
> + * between iommufd and device. It is created per a successful
> + * binding request from device driver. The bound device must be
> + * a physical device so far. Subdevice will be supported later
> + * (with additional PASID information). An user-assigned cookie
> + * is also recorded to mark the device in the /dev/iommu uAPI.
> + */
> +struct iommufd_device {
> + unsigned int id;
> + struct iommufd_ctx *ictx;
> + struct device *dev; /* always be the physical device */
> + u64 dev_cookie;
> };
>
> static int iommufd_fops_open(struct inode *inode, struct file *filep)
> @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
> return -ENOMEM;
>
> refcount_set(&ictx->refs, 1);
> + mutex_init(&ictx->lock);
> + xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
> filep->private_data = ictx;
>
> return ret;
> }
>
> +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> + refcount_inc(&ictx->refs);
> +}

See my earlier remarks about how to structure the lifetime logic, this
ref isn't necessary.

> +static const struct file_operations iommufd_fops;
> +
> +/**
> + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
> + * @fd: [in] iommufd file descriptor.
> + *
> + * Returns a pointer to the iommufd context, otherwise NULL;
> + *
> + */
> +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> +{
> + struct fd f = fdget(fd);
> + struct file *file = f.file;
> + struct iommufd_ctx *ictx;
> +
> + if (!file)
> + return NULL;
> +
> + if (file->f_op != &iommufd_fops)
> + return NULL;

Leaks the fdget

> +
> + ictx = file->private_data;
> + if (ictx)
> + iommufd_ctx_get(ictx);

Use success oriented flow

> + fdput(f);
> + return ictx;
> +}

> + */
> +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> + u64 dev_cookie)
> +{
> + struct iommufd_ctx *ictx;
> + struct iommufd_device *idev;
> + unsigned long index;
> + unsigned int id;
> + int ret;
> +
> + ictx = iommufd_ctx_fdget(fd);
> + if (!ictx)
> + return ERR_PTR(-EINVAL);
> +
> + mutex_lock(&ictx->lock);
> +
> + /* check duplicate registration */
> + xa_for_each(&ictx->device_xa, index, idev) {
> + if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> + idev = ERR_PTR(-EBUSY);
> + goto out_unlock;
> + }

I can't think of a reason why this expensive check is needed.

> + }
> +
> + idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> + if (!idev) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + /* Establish the security context */
> + ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> + if (ret)
> + goto out_free;
> +
> + ret = xa_alloc(&ictx->device_xa, &id, idev,
> + XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
> + GFP_KERNEL);

idev should be fully initialized before being placed in the xarray, so
this should be the last thing done.

Why not just use the standard xa_limit_32b instead of special single
use constants?

Jason

2021-09-21 19:31:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> This patch adds interface for userspace to attach device to specified
> IOASID.
>
> Note:
> One device can only be attached to one IOASID in this version. This is
> on par with what vfio provides today. In the future this restriction can
> be relaxed when multiple I/O address spaces are supported per device

?? In VFIO the container is the IOS and the container can be shared
with multiple devices. This needs to start at about the same
functionality.

> + } else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {

This should be in the core code, right? There is nothing PCI specific
here.

Jason

2021-09-21 21:33:00

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

On Tue, 21 Sep 2021 13:40:01 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > This patch exposes the device-centric interface for vfio-pci devices. To
> > be compatiable with existing users, vfio-pci exposes both legacy group
> > interface and device-centric interface.
> >
> > As explained in last patch, this change doesn't apply to devices which
> > cannot be forced to snoop cache by their upstream iommu. Such devices
> > are still expected to be opened via the legacy group interface.

This doesn't make much sense to me. The previous patch indicates
there's work to be done in updating the kvm-vfio contract to understand
DMA coherency, so you're trying to limit use cases to those where the
IOMMU enforces coherency, but there's QEMU work to be done to support
the iommufd uAPI at all. Isn't part of that work to understand how KVM
will be told about non-coherent devices rather than "meh, skip it in the
kernel"? Also let's not forget that vfio is not only for KVM.

> > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > the user from accessing the assigned device because the device is still
> > attached to the default domain which may allow user-initiated DMAs to
> > touch arbitrary place. The user access must be blocked until the device
> > is later bound to an iommufd (see patch 08). The binding acts as the
> > contract for putting the device in a security context which ensures user-
> > initiated DMAs via this device cannot harm the rest of the system.
> >
> > This patch introduces a vdev->block_access flag for this purpose. It's set
> > when the device is opened via /dev/vfio/devices and cleared after binding
> > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > user access should be blocked or not.
>
> This should not be in vfio_pci.
>
> AFAIK there is no condition where a vfio driver can work without being
> connected to some kind of iommu back end, so the core code should
> handle this interlock globally. A vfio driver's ops should not be
> callable until the iommu is connected.
>
> The only vfio_pci patch in this series should be adding a new callback
> op to take in an iommufd and register the pci_device as a iommufd
> device.

Couldn't the same argument be made that registering a $bus device as an
iommufd device is a common interface that shouldn't be the
responsibility of the vfio device driver? Is userspace opening the
non-group device anything more than a reservation of that device if
access is withheld until iommu isolation? I also don't really want to
predict how ioctls might evolve to guess whether only blocking .read,
.write, and .mmap callbacks are sufficient. Thanks,

Alex

2021-09-22 02:01:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:

> the iommufd uAPI at all. Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"? Also let's not forget that vfio is not only for KVM.

vfio is not only for KVM, but AFIACT the wbinv stuff is only for
KVM... But yes, I agree this should be sorted out at this stage

> > > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > > the user from accessing the assigned device because the device is still
> > > attached to the default domain which may allow user-initiated DMAs to
> > > touch arbitrary place. The user access must be blocked until the device
> > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > contract for putting the device in a security context which ensures user-
> > > initiated DMAs via this device cannot harm the rest of the system.
> > >
> > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > when the device is opened via /dev/vfio/devices and cleared after binding
> > > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > > user access should be blocked or not.
> >
> > This should not be in vfio_pci.
> >
> > AFAIK there is no condition where a vfio driver can work without being
> > connected to some kind of iommu back end, so the core code should
> > handle this interlock globally. A vfio driver's ops should not be
> > callable until the iommu is connected.
> >
> > The only vfio_pci patch in this series should be adding a new callback
> > op to take in an iommufd and register the pci_device as a iommufd
> > device.
>
> Couldn't the same argument be made that registering a $bus device as an
> iommufd device is a common interface that shouldn't be the
> responsibility of the vfio device driver?

The driver needs enough involvment to signal what kind of IOMMU
connection it wants, eg attaching to a physical device will use the
iofd_attach_device() path, but attaching to a SW page table should use
a different API call. PASID should also be different.

Possibly a good arrangement is to have the core provide some generic
ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
except mdev drivers can use so the code is all duplicated.

> non-group device anything more than a reservation of that device if
> access is withheld until iommu isolation? I also don't really want to
> predict how ioctls might evolve to guess whether only blocking .read,
> .write, and .mmap callbacks are sufficient. Thanks,

This is why I said the entire fops should be blocked in a dummy fops
so the core code the vfio_device FD parked and userspace unable to
access the ops until device attachment and thus IOMMU ioslation is
completed.

Simple and easy to reason about, a parked FD is very similar to a
closed FD.

Jason

2021-09-22 02:12:26

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 1:10 AM
>
> On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> > From: Lu Baolu <[email protected]>
> >
> > This extends iommu core to manage security context for passthrough
> > devices. Please bear a long explanation for how we reach this design
> > instead of managing it solely in iommufd like what vfio does today.
> >
> > Devices which cannot be isolated from each other are organized into an
> > iommu group. When a device is assigned to the user space, the entire
> > group must be put in a security context so that user-initiated DMAs via
> > the assigned device cannot harm the rest of the system. No user access
> > should be granted on a device before the security context is established
> > for the group which the device belongs to.
>
> > Managing the security context must meet below criteria:
> >
> > 1) The group is viable for user-initiated DMAs. This implies that the
> > devices in the group must be either bound to a device-passthrough
>
> s/a/the same/
>
> > framework, or driver-less, or bound to a driver which is known safe
> > (not do DMA).
> >
> > 2) The security context should only allow DMA to the user's memory and
> > devices in this group;
> >
> > 3) After the security context is established for the group, the group
> > viability must be continuously monitored before the user relinquishes
> > all devices belonging to the group. The viability might be broken e.g.
> > when a driver-less device is later bound to a driver which does DMA.
> >
> > 4) The security context should not be destroyed before user access
> > permission is withdrawn.
> >
> > Existing vfio introduces explicit container/group semantics in its uAPI
> > to meet above requirements. A single security context (iommu domain)
> > is created per container. Attaching group to container moves the entire
> > group into the associated security context, and vice versa. The user can
> > open the device only after group attach. A group can be detached only
> > after all devices in the group are closed. Group viability is monitored
> > by listening to iommu group events.
> >
> > Unlike vfio, iommufd adopts a device-centric design with all group
> > logistics hidden behind the fd. Binding a device to iommufd serves
> > as the contract to get security context established (and vice versa
> > for unbinding). One additional requirement in iommufd is to manage the
> > switch between multiple security contexts due to decoupled bind/attach:
>
> This should be a precursor series that actually does clean things up
> properly. There is no reason for vfio and iommufd to differ here, if
> we are implementing this logic into the iommu layer then it should be
> deleted from the VFIO layer, not left duplicated like this.

make sense

>
> IIRC in VFIO the container is the IOAS and when the group goes to
> create the device fd it should simply do the
> iommu_device_init_user_dma() followed immediately by a call to bind
> the container IOAS as your #3.

a slight correction.

to meet vfio semantics we could do init_user_dma() at group attach
time and then call binding to container IOAS when the device fd
is created. This is because vfio requires the group in a security context
before the device is opened.

>
> Then delete all the group viability stuff from vfio, relying on the
> iommu to do it.
>
> It should have full symmetry with the iommufd.

agree

>
> > @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct
> notifier_block *nb,
> > group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
> > break;
> > case BUS_NOTIFY_BOUND_DRIVER:
> > + /*
> > + * FIXME: Alternatively the attached drivers could generically
> > + * indicate to the iommu layer that they are safe for keeping
> > + * the iommu group user viable by calling some function
> around
> > + * probe(). We could eliminate this gross BUG_ON() by
> denying
> > + * probe to non-iommu-safe driver.
> > + */
> > + mutex_lock(&group->mutex);
> > + if (group->user_dma_owner_id)
> > + BUG_ON(!iommu_group_user_dma_viable(group));
> > + mutex_unlock(&group->mutex);
>
> And the mini-series should fix this BUG_ON properly by interlocking
> with the driver core to simply refuse to bind a driver under these
> conditions instead of allowing userspace to crash the kernel.
>
> That alone would be justification enough to merge this work.

yes

>
> > +
> > +/*
> > + * IOMMU core interfaces for iommufd.
> > + */
> > +
> > +/*
> > + * FIXME: We currently simply follow vifo policy to mantain the group's
> > + * viability to user. Eventually, we should avoid below hard-coded list
> > + * by letting drivers indicate to the iommu layer that they are safe for
> > + * keeping the iommu group's user aviability.
> > + */
> > +static const char * const iommu_driver_allowed[] = {
> > + "vfio-pci",
> > + "pci-stub"
> > +};
>
> Yuk. This should be done with some callback in those drivers
> 'iomm_allow_user_dma()"
>
> Ie the basic flow would see the driver core doing some:

Just double confirm. Is there concern on having the driver core to
call iommu functions?

>
> ret = iommu_doing_kernel_dma()
> if (ret) do not bind
> driver_bind
> pci_stub_probe()
> iommu_allow_user_dma()
>
> And the various functions are manipulating some atomic.
> 0 = nothing happening
> 1 = kernel DMA
> 2 = user DMA
>
> No BUG_ON.
>
> Jason

2021-09-22 02:12:51

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

> From: Alex Williamson <[email protected]>
> Sent: Wednesday, September 22, 2021 5:09 AM
>
> On Tue, 21 Sep 2021 13:40:01 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > interface and device-centric interface.
> > >
> > > As explained in last patch, this change doesn't apply to devices which
> > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > are still expected to be opened via the legacy group interface.
>
> This doesn't make much sense to me. The previous patch indicates
> there's work to be done in updating the kvm-vfio contract to understand
> DMA coherency, so you're trying to limit use cases to those where the
> IOMMU enforces coherency, but there's QEMU work to be done to support
> the iommufd uAPI at all. Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"? Also let's not forget that vfio is not only for KVM.

The policy here is that VFIO will not expose such devices (no enforce-snoop)
in the new device hierarchy at all. In this case QEMU will fall back to the
group interface automatically and then rely on the existing contract to connect
vfio and QEMU. It doesn't need to care about the whatever new contract
until such devices are exposed in the new interface.

yes, vfio is not only for KVM. But here it's more a task split based on staging
consideration. imo it's not necessary to further split task into supporting
non-snoop device for userspace driver and then for kvm.

2021-09-22 02:26:14

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 5:59 AM
>
> On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:
>
> > the iommufd uAPI at all. Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"? Also let's not forget that vfio is not only for KVM.
>
> vfio is not only for KVM, but AFIACT the wbinv stuff is only for
> KVM... But yes, I agree this should be sorted out at this stage

If such devices are even not exposed in the new hierarchy at this stage,
suppose sorting it out later should be fine?

>
> > > > When the device is opened via /dev/vfio/devices, vfio-pci should
> prevent
> > > > the user from accessing the assigned device because the device is still
> > > > attached to the default domain which may allow user-initiated DMAs to
> > > > touch arbitrary place. The user access must be blocked until the device
> > > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > > contract for putting the device in a security context which ensures user-
> > > > initiated DMAs via this device cannot harm the rest of the system.
> > > >
> > > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > > when the device is opened via /dev/vfio/devices and cleared after
> binding
> > > > to iommufd succeeds. mmap and r/w handlers check this flag to decide
> whether
> > > > user access should be blocked or not.
> > >
> > > This should not be in vfio_pci.
> > >
> > > AFAIK there is no condition where a vfio driver can work without being
> > > connected to some kind of iommu back end, so the core code should
> > > handle this interlock globally. A vfio driver's ops should not be
> > > callable until the iommu is connected.
> > >
> > > The only vfio_pci patch in this series should be adding a new callback
> > > op to take in an iommufd and register the pci_device as a iommufd
> > > device.
> >
> > Couldn't the same argument be made that registering a $bus device as an
> > iommufd device is a common interface that shouldn't be the
> > responsibility of the vfio device driver?
>
> The driver needs enough involvment to signal what kind of IOMMU
> connection it wants, eg attaching to a physical device will use the
> iofd_attach_device() path, but attaching to a SW page table should use
> a different API call. PASID should also be different.

Exactly

>
> Possibly a good arrangement is to have the core provide some generic
> ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
> except mdev drivers can use so the code is all duplicated.

Could this be an future enhancement when we have multiple device
types supporting iommufd?

>
> > non-group device anything more than a reservation of that device if
> > access is withheld until iommu isolation? I also don't really want to
> > predict how ioctls might evolve to guess whether only blocking .read,
> > .write, and .mmap callbacks are sufficient. Thanks,
>
> This is why I said the entire fops should be blocked in a dummy fops
> so the core code the vfio_device FD parked and userspace unable to
> access the ops until device attachment and thus IOMMU ioslation is
> completed.
>
> Simple and easy to reason about, a parked FD is very similar to a
> closed FD.
>

This rationale makes sense. Just the open how to handle exclusive
open between group and nongroup interfaces still needs some
more clarification here, especially about what a parked FD means
for the group interface (where parking is unnecessary since the
security context is already established before the device is opened)

Thanks
Kevin

2021-09-22 03:26:27

by Yi Liu

[permalink] [raw]
Subject: RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, September 21, 2021 9:45 PM
>
> On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> > Linux now includes multiple device-passthrough frameworks (e.g. VFIO
> and
> > vDPA) to manage secure device access from the userspace. One critical
> task
> > of those frameworks is to put the assigned device in a secure, IOMMU-
> > protected context so user-initiated DMAs are prevented from doing harm
> to
> > the rest of the system.
>
> Some bot will probably send this too, but it has compile warnings and
> needs to be rebased to 5.15-rc1

thanks Jason, will fix the warnings. yeah, I was using 5.14 in the test, will
rebase to 5.15-rc# in next version.

Regards,
Yi Liu

> drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
> if (refcount_read(&ioas->refs) > 1) {
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs
> here
> return ret;
> ^~~
> drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its
> condition is always true
> if (refcount_read(&ioas->refs) > 1) {
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret'
> to silence this warning
> int ioasid, ret;
> ^
> = 0
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
> return ERR_PTR(ret);
> ^~~
> drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its
> condition is always false
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> ^~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
> return ERR_PTR(ret);
> ^~~
> drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its
> condition is always false
> if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> ^~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret'
> to silence this warning
> int ret;
> ^
> = 0
>
> Jason

2021-09-22 04:00:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 2:04 AM
>
> On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > This patch adds interface for userspace to attach device to specified
> > IOASID.
> >
> > Note:
> > One device can only be attached to one IOASID in this version. This is
> > on par with what vfio provides today. In the future this restriction can
> > be relaxed when multiple I/O address spaces are supported per device
>
> ?? In VFIO the container is the IOS and the container can be shared
> with multiple devices. This needs to start at about the same
> functionality.

a device can be only attached to one container. One container can be
shared by multiple devices.

a device can be only attached to one IOASID. One IOASID can be shared
by multiple devices.

it does start at the same functionality.

>
> > + } else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
>
> This should be in the core code, right? There is nothing PCI specific
> here.
>

but if you insist on a pci-wrapper attach function, we still need something
here (e.g. with .attach_ioasid() callback)?

2021-09-22 12:41:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:

> > IIRC in VFIO the container is the IOAS and when the group goes to
> > create the device fd it should simply do the
> > iommu_device_init_user_dma() followed immediately by a call to bind
> > the container IOAS as your #3.
>
> a slight correction.
>
> to meet vfio semantics we could do init_user_dma() at group attach
> time and then call binding to container IOAS when the device fd
> is created. This is because vfio requires the group in a security context
> before the device is opened.

Is it? Until a device FD is opened the group fd is kind of idle, right?

> > Ie the basic flow would see the driver core doing some:
>
> Just double confirm. Is there concern on having the driver core to
> call iommu functions?

It is always an interesting question, but I'd say iommu is
foundantional to Linux and if it needs driver core help it shouldn't
be any different from PM, pinctl, or other subsystems that have
inserted themselves into the driver core.

Something kind of like the below.

If I recall, once it is done like this then the entire iommu notifier
infrastructure can be ripped out which is a lot of code.


diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f949daa90..e39612c99c6123 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
goto done;
}

+ ret = iommu_set_kernel_ownership(dev);
+ if (ret)
+ return ret;
+
re_probe:
dev->driver = drv;

@@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
dev->pm_domain->dismiss(dev);
pm_runtime_reinit(dev);
dev_pm_set_driver_flags(dev, 0);
+ iommu_release_kernel_ownership(dev);
done:
return ret;
}
@@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device *dev, struct device *parent)
dev->pm_domain->dismiss(dev);
pm_runtime_reinit(dev);
dev_pm_set_driver_flags(dev, 0);
+ iommu_release_kernel_ownership(dev);

klist_remove(&dev->p->knode_driver);
device_pm_check_callbacks(dev);

2021-09-22 13:00:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 22, 2021 2:04 AM
> >
> > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > This patch adds interface for userspace to attach device to specified
> > > IOASID.
> > >
> > > Note:
> > > One device can only be attached to one IOASID in this version. This is
> > > on par with what vfio provides today. In the future this restriction can
> > > be relaxed when multiple I/O address spaces are supported per device
> >
> > ?? In VFIO the container is the IOS and the container can be shared
> > with multiple devices. This needs to start at about the same
> > functionality.
>
> a device can be only attached to one container. One container can be
> shared by multiple devices.
>
> a device can be only attached to one IOASID. One IOASID can be shared
> by multiple devices.
>
> it does start at the same functionality.
>
> >
> > > + } else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> >
> > This should be in the core code, right? There is nothing PCI specific
> > here.
> >
>
> but if you insist on a pci-wrapper attach function, we still need something
> here (e.g. with .attach_ioasid() callback)?

I would like to stop adding ioctls to this switch, the core code
should decode the ioctl and call an per-ioctl op like every other
subsystem does..

If you do that then you could have an op

.attach_ioasid = vfio_full_device_attach,

And that is it for driver changes.

Every driver that use type1 today should be updated to have the above
line and will work with iommufd. mdevs will not be updated and won't
work.

Jason

2021-09-22 13:58:36

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 8:40 PM
>
> On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:
>
> > > IIRC in VFIO the container is the IOAS and when the group goes to
> > > create the device fd it should simply do the
> > > iommu_device_init_user_dma() followed immediately by a call to bind
> > > the container IOAS as your #3.
> >
> > a slight correction.
> >
> > to meet vfio semantics we could do init_user_dma() at group attach
> > time and then call binding to container IOAS when the device fd
> > is created. This is because vfio requires the group in a security context
> > before the device is opened.
>
> Is it? Until a device FD is opened the group fd is kind of idle, right?

yes, then there is no user-tangible difference between init_user_dma()
at group attach time vs. doing it when opening fd(). But the latter does
require more change than the former, as it also needs the vfio iommu
driver to provide a .device_attach callback.

What's in my mind now is to keep existing group attach sequence
which further calls a group-version init_user_dma(). Then when
device fd is created, just create a iommu_dev object and switch to
normal fops.

>
> > > Ie the basic flow would see the driver core doing some:
> >
> > Just double confirm. Is there concern on having the driver core to
> > call iommu functions?
>
> It is always an interesting question, but I'd say iommu is
> foundantional to Linux and if it needs driver core help it shouldn't
> be any different from PM, pinctl, or other subsystems that have
> inserted themselves into the driver core.
>
> Something kind of like the below.
>
> If I recall, once it is done like this then the entire iommu notifier
> infrastructure can be ripped out which is a lot of code.

thanks for the guidance. will think more along this direction...

>
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f949daa90..e39612c99c6123 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
> goto done;
> }
>
> + ret = iommu_set_kernel_ownership(dev);
> + if (ret)
> + return ret;
> +
> re_probe:
> dev->driver = drv;
>
> @@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
> dev->pm_domain->dismiss(dev);
> pm_runtime_reinit(dev);
> dev_pm_set_driver_flags(dev, 0);
> + iommu_release_kernel_ownership(dev);
> done:
> return ret;
> }
> @@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device
> *dev, struct device *parent)
> dev->pm_domain->dismiss(dev);
> pm_runtime_reinit(dev);
> dev_pm_set_driver_flags(dev, 0);
> + iommu_release_kernel_ownership(dev);
>
> klist_remove(&dev->p->knode_driver);
> device_pm_check_callbacks(dev);

2021-09-22 14:19:29

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 8:59 PM
>
> On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, September 22, 2021 2:04 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > > This patch adds interface for userspace to attach device to specified
> > > > IOASID.
> > > >
> > > > Note:
> > > > One device can only be attached to one IOASID in this version. This is
> > > > on par with what vfio provides today. In the future this restriction can
> > > > be relaxed when multiple I/O address spaces are supported per device
> > >
> > > ?? In VFIO the container is the IOS and the container can be shared
> > > with multiple devices. This needs to start at about the same
> > > functionality.
> >
> > a device can be only attached to one container. One container can be
> > shared by multiple devices.
> >
> > a device can be only attached to one IOASID. One IOASID can be shared
> > by multiple devices.
> >
> > it does start at the same functionality.
> >
> > >
> > > > + } else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> > >
> > > This should be in the core code, right? There is nothing PCI specific
> > > here.
> > >
> >
> > but if you insist on a pci-wrapper attach function, we still need something
> > here (e.g. with .attach_ioasid() callback)?
>
> I would like to stop adding ioctls to this switch, the core code
> should decode the ioctl and call an per-ioctl op like every other
> subsystem does..
>
> If you do that then you could have an op
>
> .attach_ioasid = vfio_full_device_attach,
>
> And that is it for driver changes.
>
> Every driver that use type1 today should be updated to have the above
> line and will work with iommufd. mdevs will not be updated and won't
> work.
>

will do.

2021-09-22 21:21:03

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

On Wed, 22 Sep 2021 01:19:08 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Wednesday, September 22, 2021 5:09 AM
> >
> > On Tue, 21 Sep 2021 13:40:01 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > > interface and device-centric interface.
> > > >
> > > > As explained in last patch, this change doesn't apply to devices which
> > > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > > are still expected to be opened via the legacy group interface.
> >
> > This doesn't make much sense to me. The previous patch indicates
> > there's work to be done in updating the kvm-vfio contract to understand
> > DMA coherency, so you're trying to limit use cases to those where the
> > IOMMU enforces coherency, but there's QEMU work to be done to support
> > the iommufd uAPI at all. Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"? Also let's not forget that vfio is not only for KVM.
>
> The policy here is that VFIO will not expose such devices (no enforce-snoop)
> in the new device hierarchy at all. In this case QEMU will fall back to the
> group interface automatically and then rely on the existing contract to connect
> vfio and QEMU. It doesn't need to care about the whatever new contract
> until such devices are exposed in the new interface.
>
> yes, vfio is not only for KVM. But here it's more a task split based on staging
> consideration. imo it's not necessary to further split task into supporting
> non-snoop device for userspace driver and then for kvm.

Patch 10 introduces an iommufd interface for QEMU to learn whether the
IOMMU enforces DMA coherency, at that point QEMU could revert to the
legacy interface, or register the iommufd with KVM, or otherwise
establish non-coherent DMA with KVM as necessary. We're adding cruft
to the kernel here to enforce an unnecessary limitation.

If there are reasons the kernel can't support the device interface,
that's a valid reason not to present the interface, but this seems like
picking a specific gap that userspace is already able to detect from
this series at the expense of other use cases. Thanks,

Alex

2021-09-22 23:51:18

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices

> From: Alex Williamson <[email protected]>
> Sent: Thursday, September 23, 2021 5:17 AM
>
> On Wed, 22 Sep 2021 01:19:08 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Wednesday, September 22, 2021 5:09 AM
> > >
> > > On Tue, 21 Sep 2021 13:40:01 -0300
> > > Jason Gunthorpe <[email protected]> wrote:
> > >
> > > > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > > > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > > > interface and device-centric interface.
> > > > >
> > > > > As explained in last patch, this change doesn't apply to devices which
> > > > > cannot be forced to snoop cache by their upstream iommu. Such
> devices
> > > > > are still expected to be opened via the legacy group interface.
> > >
> > > This doesn't make much sense to me. The previous patch indicates
> > > there's work to be done in updating the kvm-vfio contract to understand
> > > DMA coherency, so you're trying to limit use cases to those where the
> > > IOMMU enforces coherency, but there's QEMU work to be done to
> support
> > > the iommufd uAPI at all. Isn't part of that work to understand how KVM
> > > will be told about non-coherent devices rather than "meh, skip it in the
> > > kernel"? Also let's not forget that vfio is not only for KVM.
> >
> > The policy here is that VFIO will not expose such devices (no enforce-snoop)
> > in the new device hierarchy at all. In this case QEMU will fall back to the
> > group interface automatically and then rely on the existing contract to
> connect
> > vfio and QEMU. It doesn't need to care about the whatever new contract
> > until such devices are exposed in the new interface.
> >
> > yes, vfio is not only for KVM. But here it's more a task split based on staging
> > consideration. imo it's not necessary to further split task into supporting
> > non-snoop device for userspace driver and then for kvm.
>
> Patch 10 introduces an iommufd interface for QEMU to learn whether the
> IOMMU enforces DMA coherency, at that point QEMU could revert to the
> legacy interface, or register the iommufd with KVM, or otherwise
> establish non-coherent DMA with KVM as necessary. We're adding cruft
> to the kernel here to enforce an unnecessary limitation.
>
> If there are reasons the kernel can't support the device interface,
> that's a valid reason not to present the interface, but this seems like
> picking a specific gap that userspace is already able to detect from
> this series at the expense of other use cases. Thanks,
>

I see your point now. Yes I agree that the kernel cruft is unnecessary
limitation here. The user should rely on the device/iommufd capability
to decide whether non-coherent DMA should go through legacy or
new interface.

Thanks
Kevin

2021-09-27 09:45:11

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 8:40 PM
>
> > > Ie the basic flow would see the driver core doing some:
> >
> > Just double confirm. Is there concern on having the driver core to
> > call iommu functions?
>
> It is always an interesting question, but I'd say iommu is
> foundantional to Linux and if it needs driver core help it shouldn't
> be any different from PM, pinctl, or other subsystems that have
> inserted themselves into the driver core.
>
> Something kind of like the below.
>
> If I recall, once it is done like this then the entire iommu notifier
> infrastructure can be ripped out which is a lot of code.

Currently vfio is the only user of this notifier mechanism. Now
three events are handled in vfio_iommu_group_notifier():

NOTIFY_ADD_DEVICE: this is basically for some sanity check. suppose
not required once we handle it cleanly in the iommu/driver core.

NOTIFY_BOUND_DRIVER: the BUG_ON() logic to be fixed by this change.

NOTIFY_UNBOUND_DRIVER: still needs some thoughts. Based on
the comments the group->unbound_list is used to avoid breaking
group viability check between vfio_unregister_group_dev() and
final dev/drv teardown. within that small window the device is
not tracked by vfio group but is still bound to a driver (e.g. vfio-pci
itself), while an external group user may hold a reference to the
group. Possibly it's not required now with the new mechanism as
we rely on init/exit_user_dma() as the single switch to claim/
withdraw the group ownership. As long as exit_user_dma() is not
called until vfio_group_release(), above small window is covered
thus no need to maintain a unbound_list.

But anyway since this corner case is tricky, will think more in case
of any oversight.

>
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f949daa90..e39612c99c6123 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
> goto done;
> }
>
> + ret = iommu_set_kernel_ownership(dev);
> + if (ret)
> + return ret;
> +
> re_probe:
> dev->driver = drv;
>
> @@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
> dev->pm_domain->dismiss(dev);
> pm_runtime_reinit(dev);
> dev_pm_set_driver_flags(dev, 0);
> + iommu_release_kernel_ownership(dev);
> done:
> return ret;
> }
> @@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device
> *dev, struct device *parent)
> dev->pm_domain->dismiss(dev);
> pm_runtime_reinit(dev);
> dev_pm_set_driver_flags(dev, 0);
> + iommu_release_kernel_ownership(dev);
>
> klist_remove(&dev->p->knode_driver);
> device_pm_check_callbacks(dev);

I expanded above into below conceptual draft. Please help check whether
it matches your thought:

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f9..826a651 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
goto done;
}

+ ret = iommu_device_set_dma_hint(dev, drv->dma_hint);
+ if (ret)
+ return ret;
+
re_probe:
dev->driver = drv;

@@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
dev->pm_domain->dismiss(dev);
pm_runtime_reinit(dev);
dev_pm_set_driver_flags(dev, 0);
+ iommu_device_clear_dma_hint(dev);
done:
return ret;
}
@@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device *dev, struct device *parent)
dev->pm_domain->dismiss(dev);
pm_runtime_reinit(dev);
dev_pm_set_driver_flags(dev, 0);
+ iommu_device_clear_dma_hint(dev);

klist_remove(&dev->p->knode_driver);
device_pm_check_callbacks(dev);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3303d70..b12f335 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1064,6 +1064,104 @@ void iommu_group_put(struct iommu_group *group)
}
EXPORT_SYMBOL_GPL(iommu_group_put);

+static int iommu_dev_viable(struct device *dev, void *data)
+{
+ enum dma_hint hint = *data;
+ struct device_driver *drv = READ_ONCE(dev->driver);
+
+ /* no conflict if the new device doesn't do DMA */
+ if (hint == DMA_FOR_NONE)
+ return 0;
+
+ /* no conflict if this device is driver-less, or doesn't do DMA */
+ if (!drv || (drv->dma_hint == DMA_FOR_NONE))
+ return 0;
+
+ /* kernel dma and user dma are exclusive */
+ if (hint != drv->dma_hint)
+ return -EINVAL;
+
+ /*
+ * devices in the group could be bound to different user-dma
+ * drivers (e.g. vfio-pci, vdpa, etc.), or even bound to the
+ * same driver but eventually opened via different mechanisms
+ * (e.g. vfio group vs. nongroup interfaces). We rely on
+ * iommu_{group/device}_init_user_dma() to ensure exclusive
+ * user-dma ownership (iommufd ctx, vfio container ctx, etc.)
+ * in such scenario.
+ */
+ return 0;
+}
+
+static int __iommu_group_viable(struct iommu_group *group, enum dma_hint hint)
+{
+ return (__iommu_group_for_each_dev(group, &hint,
+ iommu_dev_viable) == 0);
+}
+
+int iommu_device_set_dma_hint(struct device *dev, enum dma_hint hint)
+{
+ struct iommu_group *group;
+ int ret;
+
+ group = iommu_group_get(dev);
+ /* not an iommu-probed device */
+ if (!group)
+ return 0;
+
+ mutex_lock(&group->mutex);
+ ret = __iommu_group_viable(group, hint);
+ mutex_unlock(&group->mutex);
+
+ iommu_group_put(group);
+ return ret;
+}
+
+/* any housekeeping? */
+void iommu_device_clear_dma_hint(struct device *dev) {}
+
+/* claim group ownership for user-dma */
+int __iommu_group_init_user_dma(struct iommu_group *group,
+ unsigned long owner)
+{
+ int ret;
+
+ ret = __iommu_group_viable(group, DMA_FOR_USER);
+ if (ret)
+ goto out;
+
+ /* other logic for exclusive user_dma ownership and refcounting */
+out:
+ return ret;
+}
+
+int iommu_group_init_user_dma(struct iommu_group *group,
+ unsigned long owner)
+{
+ int ret;
+
+ mutex_lock(&group->mutex);
+ ret = __iommu_group_init_user_dma(group, owner);
+ mutex_unlock(&group->mutex);
+ return ret;
+}
+
+int iommu_device_init_user_dma(struct device *dev,
+ unsigned long owner)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+ int ret;
+
+ if (!group)
+ return -ENODEV;
+
+ mutex_lock(&group->mutex);
+ ret = __iommu_group_init_user_dma(group, owner);
+ mutex_unlock(&group->mutex);
+ iommu_grou_put(group);
+ return ret;
+}
+
/**
* iommu_group_register_notifier - Register a notifier for group changes
* @group: the group to watch
diff --git a/drivers/pci/pci-stub.c b/drivers/pci/pci-stub.c
index e408099..4568811 100644
--- a/drivers/pci/pci-stub.c
+++ b/drivers/pci/pci-stub.c
@@ -36,6 +36,9 @@ static int pci_stub_probe(struct pci_dev *dev, const struct pci_device_id *id)
.name = "pci-stub",
.id_table = NULL, /* only dynamic id's */
.probe = pci_stub_probe,
+ .driver = {
+ .dma_hint = DMA_FOR_NONE,
+ },
};

static int __init pci_stub_init(void)
diff --git a/drivers/vdpa/ifcvf/ifcvf_main.c b/drivers/vdpa/ifcvf/ifcvf_main.c
index dcd648e..a613b78 100644
--- a/drivers/vdpa/ifcvf/ifcvf_main.c
+++ b/drivers/vdpa/ifcvf/ifcvf_main.c
@@ -678,6 +678,9 @@ static void ifcvf_remove(struct pci_dev *pdev)
.id_table = ifcvf_pci_ids,
.probe = ifcvf_probe,
.remove = ifcvf_remove,
+ .driver = {
+ .dma_hint = DMA_FOR_USER,
+ },
};

module_pci_driver(ifcvf_driver);
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index a5ce92b..61c422d 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -193,6 +193,9 @@ static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
.remove = vfio_pci_remove,
.sriov_configure = vfio_pci_sriov_configure,
.err_handler = &vfio_pci_core_err_handlers,
+ .driver = {
+ .dma_hint = DMA_FOR_USER,
+ },
};

static void __init vfio_pci_fill_ids(void)
diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
index a498ebc..6bddfd2 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -48,6 +48,17 @@ enum probe_type {
};

/**
+ * enum dma_hint - device driver dma hint
+ * Device drivers may provide hints for whether dma is
+ * intended for kernel driver, user driver, not not required.
+ */
+enum dma_hint {
+ DMA_FOR_KERNEL,
+ DMA_FOR_USER,
+ DMA_FOR_NONE,
+};
+
+/**
* struct device_driver - The basic device driver structure
* @name: Name of the device driver.
* @bus: The bus which the device of this driver belongs to.
@@ -101,6 +112,7 @@ struct device_driver {

bool suppress_bind_attrs; /* disables bind/unbind via sysfs */
enum probe_type probe_type;
+ enum dma_type dma_type;

const struct of_device_id *of_match_table;
const struct acpi_device_id *acpi_match_table;

Thanks
Kevin


2021-09-27 11:36:01

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On 2021/9/27 17:42, Tian, Kevin wrote:
> +int iommu_device_set_dma_hint(struct device *dev, enum dma_hint hint)
> +{
> + struct iommu_group *group;
> + int ret;
> +
> + group = iommu_group_get(dev);
> + /* not an iommu-probed device */
> + if (!group)
> + return 0;
> +
> + mutex_lock(&group->mutex);
> + ret = __iommu_group_viable(group, hint);
> + mutex_unlock(&group->mutex);
> +
> + iommu_group_put(group);
> + return ret;
> +}

Conceptually, we could also move iommu_deferred_attach() from
iommu_dma_ops here to save unnecessary checks in the hot DMA API
paths?

Best regards,
baolu

2021-09-27 11:56:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 22, 2021 8:40 PM
> >
> > > > Ie the basic flow would see the driver core doing some:
> > >
> > > Just double confirm. Is there concern on having the driver core to
> > > call iommu functions?
> >
> > It is always an interesting question, but I'd say iommu is
> > foundantional to Linux and if it needs driver core help it shouldn't
> > be any different from PM, pinctl, or other subsystems that have
> > inserted themselves into the driver core.
> >
> > Something kind of like the below.
> >
> > If I recall, once it is done like this then the entire iommu notifier
> > infrastructure can be ripped out which is a lot of code.
>
> Currently vfio is the only user of this notifier mechanism. Now
> three events are handled in vfio_iommu_group_notifier():
>
> NOTIFY_ADD_DEVICE: this is basically for some sanity check. suppose
> not required once we handle it cleanly in the iommu/driver core.
>
> NOTIFY_BOUND_DRIVER: the BUG_ON() logic to be fixed by this change.
>
> NOTIFY_UNBOUND_DRIVER: still needs some thoughts. Based on
> the comments the group->unbound_list is used to avoid breaking

I have a patch series to delete the unbound_list, the scenario you
describe is handled by the device_lock()

> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f9..826a651 100644
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
> goto done;
> }
>
> + ret = iommu_device_set_dma_hint(dev, drv->dma_hint);
> + if (ret)
> + return ret;

I think for such a narrow usage you should not change the struct
device_driver. Just have pci_stub call a function to flip back to user
mode.

> +static int iommu_dev_viable(struct device *dev, void *data)
> +{
> + enum dma_hint hint = *data;
> + struct device_driver *drv = READ_ONCE(dev->driver);

Especially since this isn't locked properly or safe.

Jason

2021-09-27 13:04:01

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, September 27, 2021 7:54 PM
>
> On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, September 22, 2021 8:40 PM
> > >
> > > > > Ie the basic flow would see the driver core doing some:
> > > >
> > > > Just double confirm. Is there concern on having the driver core to
> > > > call iommu functions?
> > >
> > > It is always an interesting question, but I'd say iommu is
> > > foundantional to Linux and if it needs driver core help it shouldn't
> > > be any different from PM, pinctl, or other subsystems that have
> > > inserted themselves into the driver core.
> > >
> > > Something kind of like the below.
> > >
> > > If I recall, once it is done like this then the entire iommu notifier
> > > infrastructure can be ripped out which is a lot of code.
> >
> > Currently vfio is the only user of this notifier mechanism. Now
> > three events are handled in vfio_iommu_group_notifier():
> >
> > NOTIFY_ADD_DEVICE: this is basically for some sanity check. suppose
> > not required once we handle it cleanly in the iommu/driver core.
> >
> > NOTIFY_BOUND_DRIVER: the BUG_ON() logic to be fixed by this change.
> >
> > NOTIFY_UNBOUND_DRIVER: still needs some thoughts. Based on
> > the comments the group->unbound_list is used to avoid breaking
>
> I have a patch series to delete the unbound_list, the scenario you
> describe is handled by the device_lock()

that's great!

>
> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index 68ea1f9..826a651 100644
> > +++ b/drivers/base/dd.c
> > @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
> > goto done;
> > }
> >
> > + ret = iommu_device_set_dma_hint(dev, drv->dma_hint);
> > + if (ret)
> > + return ret;
>
> I think for such a narrow usage you should not change the struct
> device_driver. Just have pci_stub call a function to flip back to user
> mode.

Here we want to ensure that kernel dma should be blocked
if the group is already marked for user-dma. If we just blindly
do it for any driver at this point (as you commented earlier):

+ ret = iommu_set_kernel_ownership(dev);
+ if (ret)
+ return ret;

how would pci-stub reach its function to indicate that it doesn't
do dma and flip back?

Do you envision a simpler policy that no driver can be bound
to the group if it's already set for user-dma? what about vfio-pci
itself?

>
> > +static int iommu_dev_viable(struct device *dev, void *data)
> > +{
> > + enum dma_hint hint = *data;
> > + struct device_driver *drv = READ_ONCE(dev->driver);
>
> Especially since this isn't locked properly or safe.

I have the same worry when copying from vfio. Not sure how
vfio gets safe with this approach...

Thanks
Kevin

2021-09-27 13:11:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Lu Baolu <[email protected]>
> Sent: Monday, September 27, 2021 7:34 PM
>
> On 2021/9/27 17:42, Tian, Kevin wrote:
> > +int iommu_device_set_dma_hint(struct device *dev, enum dma_hint hint)
> > +{
> > + struct iommu_group *group;
> > + int ret;
> > +
> > + group = iommu_group_get(dev);
> > + /* not an iommu-probed device */
> > + if (!group)
> > + return 0;
> > +
> > + mutex_lock(&group->mutex);
> > + ret = __iommu_group_viable(group, hint);
> > + mutex_unlock(&group->mutex);
> > +
> > + iommu_group_put(group);
> > + return ret;
> > +}
>
> Conceptually, we could also move iommu_deferred_attach() from
> iommu_dma_ops here to save unnecessary checks in the hot DMA API
> paths?
>

Yes, it's possible. But just be curious, why doesn't iommu core
manage deferred_attach when receiving BOUND_DRIVER event?
Is there other implication that deferred attach cannot be done
at driver binding time?

Thanks
Kevin

2021-09-27 13:11:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:

> > I think for such a narrow usage you should not change the struct
> > device_driver. Just have pci_stub call a function to flip back to user
> > mode.
>
> Here we want to ensure that kernel dma should be blocked
> if the group is already marked for user-dma. If we just blindly
> do it for any driver at this point (as you commented earlier):
>
> + ret = iommu_set_kernel_ownership(dev);
> + if (ret)
> + return ret;
>
> how would pci-stub reach its function to indicate that it doesn't
> do dma and flip back?

> Do you envision a simpler policy that no driver can be bound
> to the group if it's already set for user-dma? what about vfio-pci
> itself?

Yes.. I'm not sure there is a good use case to allow the stub drivers
to load/unload while a VFIO is running. At least, not a strong enough
one to justify a global change to the driver core..

> > > +static int iommu_dev_viable(struct device *dev, void *data)
> > > +{
> > > + enum dma_hint hint = *data;
> > > + struct device_driver *drv = READ_ONCE(dev->driver);
> >
> > Especially since this isn't locked properly or safe.
>
> I have the same worry when copying from vfio. Not sure how
> vfio gets safe with this approach...

Fixing the locking in vfio_dev_viable is part of deleting the unbound
list. Once it properly uses the device_lock and doesn't race with the
driver core like this things are much better. Don't copy this stuff
into the iommu core without fixing it.

https://github.com/jgunthorpe/linux/commit/fa6abb318ccca114da12c0b5b123c99131ace926
https://github.com/jgunthorpe/linux/commit/45980bd90b023d1eea56df70d1c395bdf4cc7cf1

I can't remember if the above is contingent on some of the mdev
cleanups or not.. Have to get back to it.

Jason

2021-09-27 13:36:46

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe
> Sent: Monday, September 27, 2021 9:10 PM
>
> On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
>
> > > I think for such a narrow usage you should not change the struct
> > > device_driver. Just have pci_stub call a function to flip back to user
> > > mode.
> >
> > Here we want to ensure that kernel dma should be blocked
> > if the group is already marked for user-dma. If we just blindly
> > do it for any driver at this point (as you commented earlier):
> >
> > + ret = iommu_set_kernel_ownership(dev);
> > + if (ret)
> > + return ret;
> >
> > how would pci-stub reach its function to indicate that it doesn't
> > do dma and flip back?
>
> > Do you envision a simpler policy that no driver can be bound
> > to the group if it's already set for user-dma? what about vfio-pci
> > itself?
>
> Yes.. I'm not sure there is a good use case to allow the stub drivers
> to load/unload while a VFIO is running. At least, not a strong enough
> one to justify a global change to the driver core..

I'm fine with not loading pci-stub. From the very 1st commit msg
looks pci-stub was introduced before vfio to prevent host driver
loading when doing device assignment with KVM. I'm not sure
whether other usages are built on pci-stub later, but in general it's
not good to position devices in a same group into different usages.

but I'm little worried that even vfio-pci itself cannot be bound now,
which implies that all devices in a group which are intended to be
used by the user must be bound to vfio-pci in a breath before the
user attempts to open any of them, i.e. late-binding and device-
hotplug is disallowed after the initial open. I'm not sure how
important such an usage would be, but it does cause user-tangible
semantics change.

Alex?

>
> > > > +static int iommu_dev_viable(struct device *dev, void *data)
> > > > +{
> > > > + enum dma_hint hint = *data;
> > > > + struct device_driver *drv = READ_ONCE(dev->driver);
> > >
> > > Especially since this isn't locked properly or safe.
> >
> > I have the same worry when copying from vfio. Not sure how
> > vfio gets safe with this approach...
>
> Fixing the locking in vfio_dev_viable is part of deleting the unbound
> list. Once it properly uses the device_lock and doesn't race with the
> driver core like this things are much better. Don't copy this stuff
> into the iommu core without fixing it.

sure. Above was just a quickly-baked sample code to match your
thought.

>
> https://github.com/jgunthorpe/linux/commit/fa6abb318ccca114da12c0b5b1
> 23c99131ace926
> https://github.com/jgunthorpe/linux/commit/45980bd90b023d1eea56df70d
> 1c395bdf4cc7cf1
>
> I can't remember if the above is contingent on some of the mdev
> cleanups or not.. Have to get back to it.
>

my home network has some problem to access above links. Will check it
tomorrow and follow the fix when working on the formal change in
iommu core.

Thanks
Kevin

2021-09-27 14:42:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:

> but I'm little worried that even vfio-pci itself cannot be bound now,
> which implies that all devices in a group which are intended to be
> used by the user must be bound to vfio-pci in a breath before the
> user attempts to open any of them, i.e. late-binding and device-
> hotplug is disallowed after the initial open. I'm not sure how
> important such an usage would be, but it does cause user-tangible
> semantics change.

Oh, that's bad..

I guess your approach is the only way forward, it will have to be
extensively justified in the commit message for Greg et al.

Jason

2021-09-27 15:11:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:

> +static int iommu_dev_viable(struct device *dev, void *data)
> +{
> + enum dma_hint hint = *data;
> + struct device_driver *drv = READ_ONCE(dev->driver);
> +
> + /* no conflict if the new device doesn't do DMA */
> + if (hint == DMA_FOR_NONE)
> + return 0;
> +
> + /* no conflict if this device is driver-less, or doesn't do DMA */
> + if (!drv || (drv->dma_hint == DMA_FOR_NONE))
> + return 0;

While it is kind of clever to fetch this in the drv like this, the
locking just doesn't work right.

The group itself needs to have an atomic that encodes what state it is
in. You can read the initial state from the drv, under the
device_lock, and update the atomic state

Also, don't call it "hint", there is nothing hinty about this, it has
definitive functional impacts.

Greg will want to see a definiate benefit from this extra global code,
so be sure to explain about why the BUG_ON is bad, and how driver core
involvement is needed to fix it properly.

Jason

2021-09-27 19:21:06

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Mon, 27 Sep 2021 13:32:34 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Jason Gunthorpe
> > Sent: Monday, September 27, 2021 9:10 PM
> >
> > On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
> >
> > > > I think for such a narrow usage you should not change the struct
> > > > device_driver. Just have pci_stub call a function to flip back to user
> > > > mode.
> > >
> > > Here we want to ensure that kernel dma should be blocked
> > > if the group is already marked for user-dma. If we just blindly
> > > do it for any driver at this point (as you commented earlier):
> > >
> > > + ret = iommu_set_kernel_ownership(dev);
> > > + if (ret)
> > > + return ret;
> > >
> > > how would pci-stub reach its function to indicate that it doesn't
> > > do dma and flip back?
> >
> > > Do you envision a simpler policy that no driver can be bound
> > > to the group if it's already set for user-dma? what about vfio-pci
> > > itself?
> >
> > Yes.. I'm not sure there is a good use case to allow the stub drivers
> > to load/unload while a VFIO is running. At least, not a strong enough
> > one to justify a global change to the driver core..
>
> I'm fine with not loading pci-stub. From the very 1st commit msg
> looks pci-stub was introduced before vfio to prevent host driver
> loading when doing device assignment with KVM. I'm not sure
> whether other usages are built on pci-stub later, but in general it's
> not good to position devices in a same group into different usages.

IIRC, pci-stub was invented for legacy KVM device assignment because
KVM was never an actual device driver, it just latched onto and started
using the device. If there was an existing driver for the device then
KVM would fail to get device resources. Therefore the device needed to
be unbound from its standard host driver, but that left it susceptible
to driver loads usurping the device. Therefore pci-stub came along to
essentially claim the device on behalf of KVM.

With vfio, there are a couple use cases of pci-stub that can be
interesting. The first is that pci-stub is generally built into the
kernel, not as a module, which provides users the ability to specify a
list of ids for pci-stub to claim on the kernel command line with
higher priority than loadable modules. This can prevent default driver
bindings to devices until tools like driverctl or boot time scripting
gets a shot to load the user designated driver for a device.

The other use case, is that if a group is composed of multiple devices
and all those devices are bound to vfio drivers, then the user can gain
direct access to each of those devices. If we wanted to insert a
barrier to restrict user access to certain devices within a group, we'd
suggest binding those devices to pci-stub. Obviously within a group, it
may still be possible to manipulate the device via p2p DMA, but the
barrier is much higher and device, if not platform, specific to
manipulate such devices. An example use case might be a chipset
Ethernet controller grouped among system management function in a
multi-function root complex integrated endpoint.

> but I'm little worried that even vfio-pci itself cannot be bound now,
> which implies that all devices in a group which are intended to be
> used by the user must be bound to vfio-pci in a breath before the
> user attempts to open any of them, i.e. late-binding and device-
> hotplug is disallowed after the initial open. I'm not sure how
> important such an usage would be, but it does cause user-tangible
> semantics change.

Yep, a high potential to break userspace, especially as pci-stub has
been recommended for the cases noted above. I don't expect that tools
like libvirt manage unassigned devices within a group, but that
probably means that there are all sorts of ad-hoc user mechanisms
beyond simply assigning all the devices. Thanks,

Alex

2021-09-28 07:14:53

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, September 27, 2021 10:40 PM
>
> On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:
>
> > but I'm little worried that even vfio-pci itself cannot be bound now,
> > which implies that all devices in a group which are intended to be
> > used by the user must be bound to vfio-pci in a breath before the
> > user attempts to open any of them, i.e. late-binding and device-
> > hotplug is disallowed after the initial open. I'm not sure how
> > important such an usage would be, but it does cause user-tangible
> > semantics change.
>
> Oh, that's bad..
>
> I guess your approach is the only way forward, it will have to be
> extensively justified in the commit message for Greg et al.
>

Just thought about another alternative. What about having driver
core to call iommu after call_driver_probe()?

call_driver_probe()
pci_stub_probe()
iommu_set_dma_mode(dev, DMA_NONE);
iommu_check_dma_mode(dev);

The default dma mode is DMA_KERNEL. Above allows driver to opt
for DMA_NONE or DMA_USER w/o changing the device_driver
structure. Right after probe() is completed, we check whether dma
mode of this device is allowed by the iommu core (based on recorded
dma mode info of sibling devices in the same group). If not, then fail
the binding.

Thanks
Kevin

2021-09-28 07:32:38

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, September 27, 2021 11:09 PM
>
> On Mon, Sep 27, 2021 at 09:42:58AM +0000, Tian, Kevin wrote:
>
> > +static int iommu_dev_viable(struct device *dev, void *data)
> > +{
> > + enum dma_hint hint = *data;
> > + struct device_driver *drv = READ_ONCE(dev->driver);
> > +
> > + /* no conflict if the new device doesn't do DMA */
> > + if (hint == DMA_FOR_NONE)
> > + return 0;
> > +
> > + /* no conflict if this device is driver-less, or doesn't do DMA */
> > + if (!drv || (drv->dma_hint == DMA_FOR_NONE))
> > + return 0;
>
> While it is kind of clever to fetch this in the drv like this, the
> locking just doesn't work right.
>
> The group itself needs to have an atomic that encodes what state it is
> in. You can read the initial state from the drv, under the
> device_lock, and update the atomic state

will do.

>
> Also, don't call it "hint", there is nothing hinty about this, it has
> definitive functional impacts.

possibly dma_mode (too broad?) or dma_usage

>
> Greg will want to see a definiate benefit from this extra global code,
> so be sure to explain about why the BUG_ON is bad, and how driver core
> involvement is needed to fix it properly.
>

Sure. and I plan to at least have the patches aligned in this loop first,
before rolling up to Greg. Better we all confirm it's the right approach
with all corner cases covered and then involve Greg to help judge
a clean driver core change. ????

Thanks
Kevin

2021-09-28 07:45:28

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Alex Williamson <[email protected]>
> Sent: Tuesday, September 28, 2021 3:20 AM
>
> On Mon, 27 Sep 2021 13:32:34 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Jason Gunthorpe
> > > Sent: Monday, September 27, 2021 9:10 PM
> > >
> > > On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
> > >
> > > > > I think for such a narrow usage you should not change the struct
> > > > > device_driver. Just have pci_stub call a function to flip back to user
> > > > > mode.
> > > >
> > > > Here we want to ensure that kernel dma should be blocked
> > > > if the group is already marked for user-dma. If we just blindly
> > > > do it for any driver at this point (as you commented earlier):
> > > >
> > > > + ret = iommu_set_kernel_ownership(dev);
> > > > + if (ret)
> > > > + return ret;
> > > >
> > > > how would pci-stub reach its function to indicate that it doesn't
> > > > do dma and flip back?
> > >
> > > > Do you envision a simpler policy that no driver can be bound
> > > > to the group if it's already set for user-dma? what about vfio-pci
> > > > itself?
> > >
> > > Yes.. I'm not sure there is a good use case to allow the stub drivers
> > > to load/unload while a VFIO is running. At least, not a strong enough
> > > one to justify a global change to the driver core..
> >
> > I'm fine with not loading pci-stub. From the very 1st commit msg
> > looks pci-stub was introduced before vfio to prevent host driver
> > loading when doing device assignment with KVM. I'm not sure
> > whether other usages are built on pci-stub later, but in general it's
> > not good to position devices in a same group into different usages.
>
> IIRC, pci-stub was invented for legacy KVM device assignment because
> KVM was never an actual device driver, it just latched onto and started
> using the device. If there was an existing driver for the device then
> KVM would fail to get device resources. Therefore the device needed to
> be unbound from its standard host driver, but that left it susceptible
> to driver loads usurping the device. Therefore pci-stub came along to
> essentially claim the device on behalf of KVM.
>
> With vfio, there are a couple use cases of pci-stub that can be
> interesting. The first is that pci-stub is generally built into the
> kernel, not as a module, which provides users the ability to specify a
> list of ids for pci-stub to claim on the kernel command line with
> higher priority than loadable modules. This can prevent default driver
> bindings to devices until tools like driverctl or boot time scripting
> gets a shot to load the user designated driver for a device.
>
> The other use case, is that if a group is composed of multiple devices
> and all those devices are bound to vfio drivers, then the user can gain
> direct access to each of those devices. If we wanted to insert a
> barrier to restrict user access to certain devices within a group, we'd
> suggest binding those devices to pci-stub. Obviously within a group, it
> may still be possible to manipulate the device via p2p DMA, but the
> barrier is much higher and device, if not platform, specific to
> manipulate such devices. An example use case might be a chipset
> Ethernet controller grouped among system management function in a
> multi-function root complex integrated endpoint.

Thanks for the background. It perfectly reflects how many tricky things
that vfio has evolved to deal with and we'll dig them out again in this
refactoring process with your help. ????

just a nit on the last example. If a system management function is
in such group, isn't the right policy is to disallow assigning any device
in this group? Even the barrier is high, any chance of allowing the guest
to control a system management function is dangerous...

>
> > but I'm little worried that even vfio-pci itself cannot be bound now,
> > which implies that all devices in a group which are intended to be
> > used by the user must be bound to vfio-pci in a breath before the
> > user attempts to open any of them, i.e. late-binding and device-
> > hotplug is disallowed after the initial open. I'm not sure how
> > important such an usage would be, but it does cause user-tangible
> > semantics change.
>
> Yep, a high potential to break userspace, especially as pci-stub has
> been recommended for the cases noted above. I don't expect that tools
> like libvirt manage unassigned devices within a group, but that
> probably means that there are all sorts of ad-hoc user mechanisms
> beyond simply assigning all the devices. Thanks,
>

Thanks
Kevin

2021-09-28 11:55:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Tue, Sep 28, 2021 at 07:13:01AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Monday, September 27, 2021 10:40 PM
> >
> > On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:
> >
> > > but I'm little worried that even vfio-pci itself cannot be bound now,
> > > which implies that all devices in a group which are intended to be
> > > used by the user must be bound to vfio-pci in a breath before the
> > > user attempts to open any of them, i.e. late-binding and device-
> > > hotplug is disallowed after the initial open. I'm not sure how
> > > important such an usage would be, but it does cause user-tangible
> > > semantics change.
> >
> > Oh, that's bad..
> >
> > I guess your approach is the only way forward, it will have to be
> > extensively justified in the commit message for Greg et al.
> >
>
> Just thought about another alternative. What about having driver
> core to call iommu after call_driver_probe()?

Then the kernel is now already exposed to an insecure scenario, we
must not do probe if any user device is attached at all.

Jason

2021-09-28 12:01:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Tue, Sep 28, 2021 at 07:30:41AM +0000, Tian, Kevin wrote:

> > Also, don't call it "hint", there is nothing hinty about this, it has
> > definitive functional impacts.
>
> possibly dma_mode (too broad?) or dma_usage

You just need a flag to specify if the driver manages DMA ownership
itself, or if it requires the driver core to setup kernel ownership

DMA_OWNER_KERNEL
DMA_OWNER_DRIVER_CONTROLLED

?

There is a bool 'suprress_bind_attrs' already so it could be done like
this:

bool suppress_bind_attrs:1;

/* If set the driver must call iommu_XX as the first action in probe() */
bool suppress_dma_owner:1;

Which is pretty low cost.

Jason

2021-09-28 13:39:26

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Hi Jason,

On 2021/9/28 19:57, Jason Gunthorpe wrote:
> On Tue, Sep 28, 2021 at 07:30:41AM +0000, Tian, Kevin wrote:
>
>>> Also, don't call it "hint", there is nothing hinty about this, it has
>>> definitive functional impacts.
>>
>> possibly dma_mode (too broad?) or dma_usage
>
> You just need a flag to specify if the driver manages DMA ownership
> itself, or if it requires the driver core to setup kernel ownership
>
> DMA_OWNER_KERNEL
> DMA_OWNER_DRIVER_CONTROLLED
>
> ?
>
> There is a bool 'suprress_bind_attrs' already so it could be done like
> this:
>
> bool suppress_bind_attrs:1;
>
> /* If set the driver must call iommu_XX as the first action in probe() */
> bool suppress_dma_owner:1;
>
> Which is pretty low cost.

Yes. Pretty low cost to fix the BUG_ON() issue. Any kernel-DMA driver
binding is blocked if the device's iommu group has been put into user-
dma mode.

Another issue is, when putting a device into user-dma mode, all devices
belonging to the same iommu group shouldn't be bound with a kernel-dma
driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
not lock safe as discussed below,

https://lore.kernel.org/linux-iommu/[email protected]/

Any guidance on this?

Best regards,
baolu

2021-09-28 14:09:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
> Another issue is, when putting a device into user-dma mode, all devices
> belonging to the same iommu group shouldn't be bound with a kernel-dma
> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
> not lock safe as discussed below,
>
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> Any guidance on this?

Something like this?


int iommu_set_device_dma_owner(struct device *dev, enum device_dma_owner mode,
struct file *user_owner)
{
struct iommu_group *group = group_from_dev(dev);

spin_lock(&iommu_group->dma_owner_lock);
switch (mode) {
case DMA_OWNER_KERNEL:
if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
return -EBUSY;
break;
case DMA_OWNER_SHARED:
break;
case DMA_OWNER_USERSPACE:
if (iommu_group->dma_users[DMA_OWNER_KERNEL])
return -EBUSY;
if (iommu_group->dma_owner_file != user_owner) {
if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
return -EPERM;
get_file(user_owner);
iommu_group->dma_owner_file = user_owner;
}
break;
default:
spin_unlock(&iommu_group->dma_owner_lock);
return -EINVAL;
}
iommu_group->dma_users[mode]++;
spin_unlock(&iommu_group->dma_owner_lock);
return 0;
}

int iommu_release_device_dma_owner(struct device *dev,
enum device_dma_owner mode)
{
struct iommu_group *group = group_from_dev(dev);

spin_lock(&iommu_group->dma_owner_lock);
if (WARN_ON(!iommu_group->dma_users[mode]))
goto err_unlock;
if (!iommu_group->dma_users[mode]--) {
if (mode == DMA_OWNER_USERSPACE) {
fput(iommu_group->dma_owner_file);
iommu_group->dma_owner_file = NULL;
}
}
err_unlock:
spin_unlock(&iommu_group->dma_owner_lock);
}


Where, the driver core does before probe:

iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)

pci_stub/etc does in their probe func:

iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)

And vfio/iommfd does when a struct vfio_device FD is attached:

iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)

Jason

2021-09-28 16:29:32

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Tue, 28 Sep 2021 07:43:36 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Tuesday, September 28, 2021 3:20 AM
> >
> > On Mon, 27 Sep 2021 13:32:34 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> >
> > > > From: Jason Gunthorpe
> > > > Sent: Monday, September 27, 2021 9:10 PM
> > > >
> > > > On Mon, Sep 27, 2021 at 01:00:08PM +0000, Tian, Kevin wrote:
> > > >
> > > > > > I think for such a narrow usage you should not change the struct
> > > > > > device_driver. Just have pci_stub call a function to flip back to user
> > > > > > mode.
> > > > >
> > > > > Here we want to ensure that kernel dma should be blocked
> > > > > if the group is already marked for user-dma. If we just blindly
> > > > > do it for any driver at this point (as you commented earlier):
> > > > >
> > > > > + ret = iommu_set_kernel_ownership(dev);
> > > > > + if (ret)
> > > > > + return ret;
> > > > >
> > > > > how would pci-stub reach its function to indicate that it doesn't
> > > > > do dma and flip back?
> > > >
> > > > > Do you envision a simpler policy that no driver can be bound
> > > > > to the group if it's already set for user-dma? what about vfio-pci
> > > > > itself?
> > > >
> > > > Yes.. I'm not sure there is a good use case to allow the stub drivers
> > > > to load/unload while a VFIO is running. At least, not a strong enough
> > > > one to justify a global change to the driver core..
> > >
> > > I'm fine with not loading pci-stub. From the very 1st commit msg
> > > looks pci-stub was introduced before vfio to prevent host driver
> > > loading when doing device assignment with KVM. I'm not sure
> > > whether other usages are built on pci-stub later, but in general it's
> > > not good to position devices in a same group into different usages.
> >
> > IIRC, pci-stub was invented for legacy KVM device assignment because
> > KVM was never an actual device driver, it just latched onto and started
> > using the device. If there was an existing driver for the device then
> > KVM would fail to get device resources. Therefore the device needed to
> > be unbound from its standard host driver, but that left it susceptible
> > to driver loads usurping the device. Therefore pci-stub came along to
> > essentially claim the device on behalf of KVM.
> >
> > With vfio, there are a couple use cases of pci-stub that can be
> > interesting. The first is that pci-stub is generally built into the
> > kernel, not as a module, which provides users the ability to specify a
> > list of ids for pci-stub to claim on the kernel command line with
> > higher priority than loadable modules. This can prevent default driver
> > bindings to devices until tools like driverctl or boot time scripting
> > gets a shot to load the user designated driver for a device.
> >
> > The other use case, is that if a group is composed of multiple devices
> > and all those devices are bound to vfio drivers, then the user can gain
> > direct access to each of those devices. If we wanted to insert a
> > barrier to restrict user access to certain devices within a group, we'd
> > suggest binding those devices to pci-stub. Obviously within a group, it
> > may still be possible to manipulate the device via p2p DMA, but the
> > barrier is much higher and device, if not platform, specific to
> > manipulate such devices. An example use case might be a chipset
> > Ethernet controller grouped among system management function in a
> > multi-function root complex integrated endpoint.
>
> Thanks for the background. It perfectly reflects how many tricky things
> that vfio has evolved to deal with and we'll dig them out again in this
> refactoring process with your help. ????
>
> just a nit on the last example. If a system management function is
> in such group, isn't the right policy is to disallow assigning any device
> in this group? Even the barrier is high, any chance of allowing the guest
> to control a system management function is dangerous...

We can advise that it's a risk, but we generally refrain from making
such policy decisions. Ideally the chipset vendor avoids
configurations that require their users to choose between functionality
and security ;) Thanks,

Alex

2021-09-29 00:02:04

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, September 28, 2021 7:55 PM
>
> On Tue, Sep 28, 2021 at 07:13:01AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Monday, September 27, 2021 10:40 PM
> > >
> > > On Mon, Sep 27, 2021 at 01:32:34PM +0000, Tian, Kevin wrote:
> > >
> > > > but I'm little worried that even vfio-pci itself cannot be bound now,
> > > > which implies that all devices in a group which are intended to be
> > > > used by the user must be bound to vfio-pci in a breath before the
> > > > user attempts to open any of them, i.e. late-binding and device-
> > > > hotplug is disallowed after the initial open. I'm not sure how
> > > > important such an usage would be, but it does cause user-tangible
> > > > semantics change.
> > >
> > > Oh, that's bad..
> > >
> > > I guess your approach is the only way forward, it will have to be
> > > extensively justified in the commit message for Greg et al.
> > >
> >
> > Just thought about another alternative. What about having driver
> > core to call iommu after call_driver_probe()?
>
> Then the kernel is now already exposed to an insecure scenario, we
> must not do probe if any user device is attached at all.
>

Originally I thought it's fine as long as the entire probe process
is not completed. Based on your comment I feel your concern is
that no guarantee that the driver won't do any iommu related
work in its probe function thus it's insecure?

Thanks
Kevin

2021-09-29 00:40:35

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, September 28, 2021 10:07 PM
>
> On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
> > Another issue is, when putting a device into user-dma mode, all devices
> > belonging to the same iommu group shouldn't be bound with a kernel-dma
> > driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
> > not lock safe as discussed below,
> >
> > https://lore.kernel.org/linux-
> iommu/[email protected]/
> >
> > Any guidance on this?
>
> Something like this?
>
>

yes, with this group level atomics we don't need loop every dev->driver
respectively.

> int iommu_set_device_dma_owner(struct device *dev, enum
> device_dma_owner mode,
> struct file *user_owner)
> {
> struct iommu_group *group = group_from_dev(dev);
>
> spin_lock(&iommu_group->dma_owner_lock);
> switch (mode) {
> case DMA_OWNER_KERNEL:
> if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> return -EBUSY;
> break;
> case DMA_OWNER_SHARED:
> break;
> case DMA_OWNER_USERSPACE:
> if (iommu_group-
> >dma_users[DMA_OWNER_KERNEL])
> return -EBUSY;
> if (iommu_group->dma_owner_file != user_owner) {
> if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> return -EPERM;
> get_file(user_owner);
> iommu_group->dma_owner_file =
> user_owner;
> }
> break;
> default:
> spin_unlock(&iommu_group->dma_owner_lock);
> return -EINVAL;
> }
> iommu_group->dma_users[mode]++;
> spin_unlock(&iommu_group->dma_owner_lock);
> return 0;
> }
>
> int iommu_release_device_dma_owner(struct device *dev,
> enum device_dma_owner mode)
> {
> struct iommu_group *group = group_from_dev(dev);
>
> spin_lock(&iommu_group->dma_owner_lock);
> if (WARN_ON(!iommu_group->dma_users[mode]))
> goto err_unlock;
> if (!iommu_group->dma_users[mode]--) {
> if (mode == DMA_OWNER_USERSPACE) {
> fput(iommu_group->dma_owner_file);
> iommu_group->dma_owner_file = NULL;
> }
> }
> err_unlock:
> spin_unlock(&iommu_group->dma_owner_lock);
> }
>
>
> Where, the driver core does before probe:
>
> iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
>
> pci_stub/etc does in their probe func:
>
> iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
>
> And vfio/iommfd does when a struct vfio_device FD is attached:
>
> iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE,
> group_file/iommu_file)
>

Just a nit. Per your comment in previous mail:

/* If set the driver must call iommu_XX as the first action in probe() */
bool suppress_dma_owner:1;

Following above logic userspace drivers won't call iommu_XX in probe().
Just want to double confirm whether you see any issue here with this
relaxed behavior. If no problem:

/* If set the driver must call iommu_XX as the first action in probe() or
* before it attempts to do DMA
*/
bool suppress_dma_owner:1;

Thanks
Kevin

2021-09-29 02:28:30

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On 9/28/21 10:07 PM, Jason Gunthorpe wrote:
> On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
>> Another issue is, when putting a device into user-dma mode, all devices
>> belonging to the same iommu group shouldn't be bound with a kernel-dma
>> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
>> not lock safe as discussed below,
>>
>> https://lore.kernel.org/linux-iommu/[email protected]/
>>
>> Any guidance on this?
>
> Something like this?
>
>
> int iommu_set_device_dma_owner(struct device *dev, enum device_dma_owner mode,
> struct file *user_owner)
> {
> struct iommu_group *group = group_from_dev(dev);
>
> spin_lock(&iommu_group->dma_owner_lock);
> switch (mode) {
> case DMA_OWNER_KERNEL:
> if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
> return -EBUSY;
> break;
> case DMA_OWNER_SHARED:
> break;
> case DMA_OWNER_USERSPACE:
> if (iommu_group->dma_users[DMA_OWNER_KERNEL])
> return -EBUSY;
> if (iommu_group->dma_owner_file != user_owner) {
> if (iommu_group->dma_users[DMA_OWNER_USERSPACE])
> return -EPERM;
> get_file(user_owner);
> iommu_group->dma_owner_file = user_owner;
> }
> break;
> default:
> spin_unlock(&iommu_group->dma_owner_lock);
> return -EINVAL;
> }
> iommu_group->dma_users[mode]++;
> spin_unlock(&iommu_group->dma_owner_lock);
> return 0;
> }
>
> int iommu_release_device_dma_owner(struct device *dev,
> enum device_dma_owner mode)
> {
> struct iommu_group *group = group_from_dev(dev);
>
> spin_lock(&iommu_group->dma_owner_lock);
> if (WARN_ON(!iommu_group->dma_users[mode]))
> goto err_unlock;
> if (!iommu_group->dma_users[mode]--) {
> if (mode == DMA_OWNER_USERSPACE) {
> fput(iommu_group->dma_owner_file);
> iommu_group->dma_owner_file = NULL;
> }
> }
> err_unlock:
> spin_unlock(&iommu_group->dma_owner_lock);
> }
>
>
> Where, the driver core does before probe:
>
> iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
>
> pci_stub/etc does in their probe func:
>
> iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
>
> And vfio/iommfd does when a struct vfio_device FD is attached:
>
> iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)

Really good design. It also helps alleviating some pains elsewhere in
the iommu core.

Just a nit comment, we also need DMA_OWNER_NONE which will be set when
the driver core unbinds the driver from the device.

>
> Jason
>

Best regards,
baolu

2021-09-29 02:34:15

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Lu Baolu <[email protected]>
> Sent: Wednesday, September 29, 2021 10:22 AM
>
> On 9/28/21 10:07 PM, Jason Gunthorpe wrote:
> > On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
> >> Another issue is, when putting a device into user-dma mode, all devices
> >> belonging to the same iommu group shouldn't be bound with a kernel-
> dma
> >> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
> >> not lock safe as discussed below,
> >>
> >> https://lore.kernel.org/linux-
> iommu/[email protected]/
> >>
> >> Any guidance on this?
> >
> > Something like this?
> >
> >
> > int iommu_set_device_dma_owner(struct device *dev, enum
> device_dma_owner mode,
> > struct file *user_owner)
> > {
> > struct iommu_group *group = group_from_dev(dev);
> >
> > spin_lock(&iommu_group->dma_owner_lock);
> > switch (mode) {
> > case DMA_OWNER_KERNEL:
> > if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> > return -EBUSY;
> > break;
> > case DMA_OWNER_SHARED:
> > break;
> > case DMA_OWNER_USERSPACE:
> > if (iommu_group-
> >dma_users[DMA_OWNER_KERNEL])
> > return -EBUSY;
> > if (iommu_group->dma_owner_file != user_owner) {
> > if (iommu_group-
> >dma_users[DMA_OWNER_USERSPACE])
> > return -EPERM;
> > get_file(user_owner);
> > iommu_group->dma_owner_file =
> user_owner;
> > }
> > break;
> > default:
> > spin_unlock(&iommu_group->dma_owner_lock);
> > return -EINVAL;
> > }
> > iommu_group->dma_users[mode]++;
> > spin_unlock(&iommu_group->dma_owner_lock);
> > return 0;
> > }
> >
> > int iommu_release_device_dma_owner(struct device *dev,
> > enum device_dma_owner mode)
> > {
> > struct iommu_group *group = group_from_dev(dev);
> >
> > spin_lock(&iommu_group->dma_owner_lock);
> > if (WARN_ON(!iommu_group->dma_users[mode]))
> > goto err_unlock;
> > if (!iommu_group->dma_users[mode]--) {
> > if (mode == DMA_OWNER_USERSPACE) {
> > fput(iommu_group->dma_owner_file);
> > iommu_group->dma_owner_file = NULL;
> > }
> > }
> > err_unlock:
> > spin_unlock(&iommu_group->dma_owner_lock);
> > }
> >
> >
> > Where, the driver core does before probe:
> >
> > iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
> >
> > pci_stub/etc does in their probe func:
> >
> > iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
> >
> > And vfio/iommfd does when a struct vfio_device FD is attached:
> >
> > iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE,
> group_file/iommu_file)
>
> Really good design. It also helps alleviating some pains elsewhere in
> the iommu core.
>
> Just a nit comment, we also need DMA_OWNER_NONE which will be set
> when
> the driver core unbinds the driver from the device.
>

Not necessarily. NONE is represented by none of dma_user[mode]
is valid.

2021-09-29 02:44:51

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On 9/29/21 10:29 AM, Tian, Kevin wrote:
>> From: Lu Baolu <[email protected]>
>> Sent: Wednesday, September 29, 2021 10:22 AM
>>
>> On 9/28/21 10:07 PM, Jason Gunthorpe wrote:
>>> On Tue, Sep 28, 2021 at 09:35:05PM +0800, Lu Baolu wrote:
>>>> Another issue is, when putting a device into user-dma mode, all devices
>>>> belonging to the same iommu group shouldn't be bound with a kernel-
>> dma
>>>> driver. Kevin's prototype checks this by READ_ONCE(dev->driver). This is
>>>> not lock safe as discussed below,
>>>>
>>>> https://lore.kernel.org/linux-
>> iommu/[email protected]/
>>>>
>>>> Any guidance on this?
>>>
>>> Something like this?
>>>
>>>
>>> int iommu_set_device_dma_owner(struct device *dev, enum
>> device_dma_owner mode,
>>> struct file *user_owner)
>>> {
>>> struct iommu_group *group = group_from_dev(dev);
>>>
>>> spin_lock(&iommu_group->dma_owner_lock);
>>> switch (mode) {
>>> case DMA_OWNER_KERNEL:
>>> if (iommu_group-
>>> dma_users[DMA_OWNER_USERSPACE])
>>> return -EBUSY;
>>> break;
>>> case DMA_OWNER_SHARED:
>>> break;
>>> case DMA_OWNER_USERSPACE:
>>> if (iommu_group-
>>> dma_users[DMA_OWNER_KERNEL])
>>> return -EBUSY;
>>> if (iommu_group->dma_owner_file != user_owner) {
>>> if (iommu_group-
>>> dma_users[DMA_OWNER_USERSPACE])
>>> return -EPERM;
>>> get_file(user_owner);
>>> iommu_group->dma_owner_file =
>> user_owner;
>>> }
>>> break;
>>> default:
>>> spin_unlock(&iommu_group->dma_owner_lock);
>>> return -EINVAL;
>>> }
>>> iommu_group->dma_users[mode]++;
>>> spin_unlock(&iommu_group->dma_owner_lock);
>>> return 0;
>>> }
>>>
>>> int iommu_release_device_dma_owner(struct device *dev,
>>> enum device_dma_owner mode)
>>> {
>>> struct iommu_group *group = group_from_dev(dev);
>>>
>>> spin_lock(&iommu_group->dma_owner_lock);
>>> if (WARN_ON(!iommu_group->dma_users[mode]))
>>> goto err_unlock;
>>> if (!iommu_group->dma_users[mode]--) {
>>> if (mode == DMA_OWNER_USERSPACE) {
>>> fput(iommu_group->dma_owner_file);
>>> iommu_group->dma_owner_file = NULL;
>>> }
>>> }
>>> err_unlock:
>>> spin_unlock(&iommu_group->dma_owner_lock);
>>> }
>>>
>>>
>>> Where, the driver core does before probe:
>>>
>>> iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
>>>
>>> pci_stub/etc does in their probe func:
>>>
>>> iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
>>>
>>> And vfio/iommfd does when a struct vfio_device FD is attached:
>>>
>>> iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE,
>> group_file/iommu_file)
>>
>> Really good design. It also helps alleviating some pains elsewhere in
>> the iommu core.
>>
>> Just a nit comment, we also need DMA_OWNER_NONE which will be set
>> when
>> the driver core unbinds the driver from the device.
>>
>
> Not necessarily. NONE is represented by none of dma_user[mode]
> is valid.
>

Fair enough.

Best regards,
baolu

2021-09-29 05:19:15

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> From: Lu Baolu <[email protected]>
>
> This extends iommu core to manage security context for passthrough
> devices. Please bear a long explanation for how we reach this design
> instead of managing it solely in iommufd like what vfio does today.
>
> Devices which cannot be isolated from each other are organized into an
> iommu group. When a device is assigned to the user space, the entire
> group must be put in a security context so that user-initiated DMAs via
> the assigned device cannot harm the rest of the system. No user access
> should be granted on a device before the security context is established
> for the group which the device belongs to.
>
> Managing the security context must meet below criteria:
>
> 1) The group is viable for user-initiated DMAs. This implies that the
> devices in the group must be either bound to a device-passthrough
> framework, or driver-less, or bound to a driver which is known safe
> (not do DMA).
>
> 2) The security context should only allow DMA to the user's memory and
> devices in this group;
>
> 3) After the security context is established for the group, the group
> viability must be continuously monitored before the user relinquishes
> all devices belonging to the group. The viability might be broken e.g.
> when a driver-less device is later bound to a driver which does DMA.
>
> 4) The security context should not be destroyed before user access
> permission is withdrawn.
>
> Existing vfio introduces explicit container/group semantics in its uAPI
> to meet above requirements. A single security context (iommu domain)
> is created per container. Attaching group to container moves the entire
> group into the associated security context, and vice versa. The user can
> open the device only after group attach. A group can be detached only
> after all devices in the group are closed. Group viability is monitored
> by listening to iommu group events.
>
> Unlike vfio, iommufd adopts a device-centric design with all group
> logistics hidden behind the fd. Binding a device to iommufd serves
> as the contract to get security context established (and vice versa
> for unbinding). One additional requirement in iommufd is to manage the
> switch between multiple security contexts due to decoupled bind/attach:
>
> 1) Open a device in "/dev/vfio/devices" with user access blocked;

Probably worth clarifying that (1) must happen for *all* devices in
the group before (2) happens for any device in the group.

> 2) Bind the device to an iommufd with an initial security context
> (an empty iommu domain which blocks dma) established for its
> group, with user access unblocked;
>
> 3) Attach the device to a user-specified ioasid (shared by all devices
> attached to this ioasid). Before attaching, the device should be first
> detached from the initial context;

So, this step can implicitly but observably change the behaviour for
other devices in the group as well. I don't love that kind of
difficult to predict side effect, which is why I'm *still* not totally
convinced by the device-centric model.

> 4) Detach the device from the ioasid and switch it back to the initial
> security context;

Same non-local side effect at this step, of course.

Btw, explicitly naming the "no DMA" context is probably a good idea,
rather than referring to the "initial security context" (it's
"initial" from the PoV of the iommufd, but not from the PoV of the
device fd which was likely bound to the default kernel context before
(2)).

> 5) Unbind the device from the iommufd, back to access blocked state and
> move its group out of the initial security context if it's the last
> unbound device in the group;

Maybe worth clarifying that again (5) must happen for all devices in
the group before rebiding any devices to regular kernel drivers.
>
> (multiple attach/detach could happen between 2 and 5).
>
> However existing iommu core has problem with above transition. Detach
> in step 3/4 makes the device/group re-attached to the default domain
> automatically, which opens the door for user-initiated DMAs to attack
> the rest of the system. The existing vfio doesn't have this problem as
> it combines 2/3 in one step (so does 4/5).
>
> Fixing this problem requires the iommu core to also participate in the
> security context management. Following this direction we also move group
> viability check into the iommu core, which allows iommufd to stay fully
> device-centric w/o keeping any group knowledge (combining with the
> extension to iommu_at[de]tach_device() in a latter patch).
>
> Basically two new interfaces are provided:
>
> int iommu_device_init_user_dma(struct device *dev,
> unsigned long owner);
> void iommu_device_exit_user_dma(struct device *dev);
>
> iommufd calls them respectively when handling device binding/unbinding
> requests.
>
> The init_user_dma() for the 1st device in a group marks the entire group
> for user-dma and establishes the initial security context (dma blocked)
> according to aforementioned criteria. As long as the group is marked for
> user-dma, auto-reattaching to default domain is disabled. Instead, upon
> detaching the group is moved back to the initial security context.
>
> The caller also provides an owner id to mark the ownership so inadvertent
> attempt from another caller on the same device can be captured. In this
> RFC iommufd will use the fd context pointer as the owner id.
>
> The exit_user_dma() for the last device in the group clears the user-dma
> mark and moves the group back to the default domain.
>
> Signed-off-by: Kevin Tian <[email protected]>
> Signed-off-by: Lu Baolu <[email protected]>
> ---
> drivers/iommu/iommu.c | 145 +++++++++++++++++++++++++++++++++++++++++-
> include/linux/iommu.h | 12 ++++
> 2 files changed, 154 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 5ea3a007fd7c..bffd84e978fb 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -45,6 +45,8 @@ struct iommu_group {
> struct iommu_domain *default_domain;
> struct iommu_domain *domain;
> struct list_head entry;
> + unsigned long user_dma_owner_id;

Using an opaque integer doesn't seem like a good idea. I think you
probably want a pointer to a suitable struct dma_owner or the like
(you could have one embedded in each iommufd struct, plus a global
static kernel_default_owner).

> + refcount_t owner_cnt;
> };
>
> struct group_device {
> @@ -86,6 +88,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
> static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
> static ssize_t iommu_group_store_type(struct iommu_group *group,
> const char *buf, size_t count);
> +static bool iommu_group_user_dma_viable(struct iommu_group *group);
>
> #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \
> struct iommu_group_attribute iommu_group_attr_##_name = \
> @@ -275,7 +278,11 @@ int iommu_probe_device(struct device *dev)
> */
> iommu_alloc_default_domain(group, dev);
>
> - if (group->default_domain) {
> + /*
> + * If any device in the group has been initialized for user dma,
> + * avoid attaching the default domain.
> + */
> + if (group->default_domain && !group->user_dma_owner_id) {
> ret = __iommu_attach_device(group->default_domain, dev);
> if (ret) {
> iommu_group_put(group);
> @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
> group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
> break;
> case BUS_NOTIFY_BOUND_DRIVER:
> + /*
> + * FIXME: Alternatively the attached drivers could generically
> + * indicate to the iommu layer that they are safe for keeping
> + * the iommu group user viable by calling some function around
> + * probe(). We could eliminate this gross BUG_ON() by denying
> + * probe to non-iommu-safe driver.
> + */
> + mutex_lock(&group->mutex);
> + if (group->user_dma_owner_id)
> + BUG_ON(!iommu_group_user_dma_viable(group));
> + mutex_unlock(&group->mutex);
> group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
> break;
> case BUS_NOTIFY_UNBIND_DRIVER:
> @@ -2304,7 +2322,11 @@ static int __iommu_attach_group(struct iommu_domain *domain,
> {
> int ret;
>
> - if (group->default_domain && group->domain != group->default_domain)
> + /*
> + * group->domain could be NULL when a domain is detached from the
> + * group but the default_domain is not re-attached.
> + */
> + if (group->domain && group->domain != group->default_domain)
> return -EBUSY;
>
> ret = __iommu_group_for_each_dev(group, domain,
> @@ -2341,7 +2363,11 @@ static void __iommu_detach_group(struct iommu_domain *domain,
> {
> int ret;
>
> - if (!group->default_domain) {
> + /*
> + * If any device in the group has been initialized for user dma,
> + * avoid re-attaching the default domain.
> + */
> + if (!group->default_domain || group->user_dma_owner_id) {
> __iommu_group_for_each_dev(group, domain,
> iommu_group_do_detach_device);
> group->domain = NULL;
> @@ -3276,3 +3302,116 @@ int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *dat
> return ops->device_info(dev, attr, data);
> }
> EXPORT_SYMBOL_GPL(iommu_device_get_info);
> +
> +/*
> + * IOMMU core interfaces for iommufd.
> + */
> +
> +/*
> + * FIXME: We currently simply follow vifo policy to mantain the group's
> + * viability to user. Eventually, we should avoid below hard-coded list
> + * by letting drivers indicate to the iommu layer that they are safe for
> + * keeping the iommu group's user aviability.
> + */
> +static const char * const iommu_driver_allowed[] = {
> + "vfio-pci",
> + "pci-stub"
> +};
> +
> +/*
> + * An iommu group is viable for use by userspace if all devices are in
> + * one of the following states:
> + * - driver-less
> + * - bound to an allowed driver
> + * - a PCI interconnect device
> + */
> +static int device_user_dma_viable(struct device *dev, void *data)

I think this wants a "less friendly" more obviously local name.
Really the only safe way to call this is via
iommu_group_user_dma_viable(), which isn't obvious from this name.

> +{
> + struct device_driver *drv = READ_ONCE(dev->driver);
> +
> + if (!drv)
> + return 0;
> +
> + if (dev_is_pci(dev)) {
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> + return 0;
> + }
> +
> + return match_string(iommu_driver_allowed,
> + ARRAY_SIZE(iommu_driver_allowed),
> + drv->name) < 0;
> +}

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (10.93 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-29 06:50:08

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: David Gibson <[email protected]>
> Sent: Wednesday, September 29, 2021 12:56 PM
>
> >
> > Unlike vfio, iommufd adopts a device-centric design with all group
> > logistics hidden behind the fd. Binding a device to iommufd serves
> > as the contract to get security context established (and vice versa
> > for unbinding). One additional requirement in iommufd is to manage the
> > switch between multiple security contexts due to decoupled bind/attach:
> >
> > 1) Open a device in "/dev/vfio/devices" with user access blocked;
>
> Probably worth clarifying that (1) must happen for *all* devices in
> the group before (2) happens for any device in the group.

No. User access is naturally blocked for other devices as long as they
are not opened yet.

>
> > 2) Bind the device to an iommufd with an initial security context
> > (an empty iommu domain which blocks dma) established for its
> > group, with user access unblocked;
> >
> > 3) Attach the device to a user-specified ioasid (shared by all devices
> > attached to this ioasid). Before attaching, the device should be first
> > detached from the initial context;
>
> So, this step can implicitly but observably change the behaviour for
> other devices in the group as well. I don't love that kind of
> difficult to predict side effect, which is why I'm *still* not totally
> convinced by the device-centric model.

which side-effect is predicted here? The user anyway needs to be
aware of such group restriction regardless whether it uses group
or nongroup interface.

> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 5ea3a007fd7c..bffd84e978fb 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -45,6 +45,8 @@ struct iommu_group {
> > struct iommu_domain *default_domain;
> > struct iommu_domain *domain;
> > struct list_head entry;
> > + unsigned long user_dma_owner_id;
>
> Using an opaque integer doesn't seem like a good idea. I think you
> probably want a pointer to a suitable struct dma_owner or the like
> (you could have one embedded in each iommufd struct, plus a global
> static kernel_default_owner).
>

For remaining comments you may want to look at the latest discussion
here:

https://lore.kernel.org/kvm/[email protected]/

It relies on driver core change to manage group ownership gracefully.
No BUG_ON() is triggered any more for driver binding. There a fd will
be passed in to mark the ownership.

Thanks
Kevin

2021-09-29 06:53:21

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:
> Under the /dev/iommu model, iommufd provides the interface for I/O page
> tables management such as dma map/unmap. However, it cannot work
> independently since the device is still owned by the device-passthrough
> frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
> should build a connection between its device and the iommufd to delegate
> the I/O page table management affairs to iommufd.
>
> This patch introduces iommufd_[un]bind_device() helpers for the device-
> passthrough framework to build such connection. The helper functions then
> invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
> security context for the bound device. Each successfully bound device is
> internally tracked by an iommufd_device object. This object is returned
> to the caller for subsequent attaching operations on the device as well.
>
> The caller should pass a user-provided cookie to mark the device in the
> iommufd. Later this cookie will be used to represent the device in iommufd
> uAPI, e.g. when querying device capabilities or handling per-device I/O
> page faults. One alternative is to have iommufd allocate a device label
> and return to the user. Either way works, but cookie is slightly preferred
> per earlier discussion as it may allow the user to inject faults slightly
> faster without ID->vRID lookup.
>
> iommu_[un]bind_device() functions are only used for physical devices. Other
> variants will be introduced in the future, e.g.:
>
> - iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
> DMA isolation;
> - iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
> instead of iommu to isolate DMA;
>
> Signed-off-by: Liu Yi L <[email protected]>
> ---
> drivers/iommu/iommufd/iommufd.c | 160 +++++++++++++++++++++++++++++++-
> include/linux/iommufd.h | 38 ++++++++
> 2 files changed, 196 insertions(+), 2 deletions(-)
> create mode 100644 include/linux/iommufd.h
>
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 710b7e62988b..e16ca21e4534 100644
> --- a/drivers/iommu/iommufd/iommufd.c
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -16,10 +16,30 @@
> #include <linux/miscdevice.h>
> #include <linux/mutex.h>
> #include <linux/iommu.h>
> +#include <linux/iommufd.h>
> +#include <linux/xarray.h>
> +#include <asm-generic/bug.h>
>
> /* Per iommufd */
> struct iommufd_ctx {
> refcount_t refs;
> + struct mutex lock;
> + struct xarray device_xa; /* xarray of bound devices */
> +};
> +
> +/*
> + * A iommufd_device object represents the binding relationship
> + * between iommufd and device. It is created per a successful
> + * binding request from device driver. The bound device must be
> + * a physical device so far. Subdevice will be supported later
> + * (with additional PASID information). An user-assigned cookie
> + * is also recorded to mark the device in the /dev/iommu uAPI.
> + */
> +struct iommufd_device {
> + unsigned int id;
> + struct iommufd_ctx *ictx;
> + struct device *dev; /* always be the physical device */
> + u64 dev_cookie;

Why do you need both an 'id' and a 'dev_cookie'? Since they're both
unique, couldn't you just use the cookie directly as the index into
the xarray?

> };
>
> static int iommufd_fops_open(struct inode *inode, struct file *filep)
> @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
> return -ENOMEM;
>
> refcount_set(&ictx->refs, 1);
> + mutex_init(&ictx->lock);
> + xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
> filep->private_data = ictx;
>
> return ret;
> }
>
> +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> + refcount_inc(&ictx->refs);
> +}
> +
> +static const struct file_operations iommufd_fops;
> +
> +/**
> + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
> + * @fd: [in] iommufd file descriptor.
> + *
> + * Returns a pointer to the iommufd context, otherwise NULL;
> + *
> + */
> +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> +{
> + struct fd f = fdget(fd);
> + struct file *file = f.file;
> + struct iommufd_ctx *ictx;
> +
> + if (!file)
> + return NULL;
> +
> + if (file->f_op != &iommufd_fops)
> + return NULL;
> +
> + ictx = file->private_data;
> + if (ictx)
> + iommufd_ctx_get(ictx);
> + fdput(f);
> + return ictx;
> +}
> +
> +/**
> + * iommufd_ctx_put - Releases a reference to the internal iommufd context.
> + * @ictx: [in] Pointer to iommufd context.
> + *
> + */
> static void iommufd_ctx_put(struct iommufd_ctx *ictx)
> {
> - if (refcount_dec_and_test(&ictx->refs))
> - kfree(ictx);
> + if (!refcount_dec_and_test(&ictx->refs))
> + return;
> +
> + WARN_ON(!xa_empty(&ictx->device_xa));
> + kfree(ictx);
> }
>
> static int iommufd_fops_release(struct inode *inode, struct file *filep)
> @@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
> .mode = 0666,
> };
>
> +/**
> + * iommufd_bind_device - Bind a physical device marked by a device
> + * cookie to an iommu fd.
> + * @fd: [in] iommufd file descriptor.
> + * @dev: [in] Pointer to a physical device struct.
> + * @dev_cookie: [in] A cookie to mark the device in /dev/iommu uAPI.
> + *
> + * A successful bind establishes a security context for the device
> + * and returns struct iommufd_device pointer. Otherwise returns
> + * error pointer.
> + *
> + */
> +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> + u64 dev_cookie)
> +{
> + struct iommufd_ctx *ictx;
> + struct iommufd_device *idev;
> + unsigned long index;
> + unsigned int id;
> + int ret;
> +
> + ictx = iommufd_ctx_fdget(fd);
> + if (!ictx)
> + return ERR_PTR(-EINVAL);
> +
> + mutex_lock(&ictx->lock);
> +
> + /* check duplicate registration */
> + xa_for_each(&ictx->device_xa, index, idev) {
> + if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> + idev = ERR_PTR(-EBUSY);
> + goto out_unlock;
> + }
> + }
> +
> + idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> + if (!idev) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + /* Establish the security context */
> + ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> + if (ret)
> + goto out_free;
> +
> + ret = xa_alloc(&ictx->device_xa, &id, idev,
> + XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
> + GFP_KERNEL);
> + if (ret) {
> + idev = ERR_PTR(ret);
> + goto out_user_dma;
> + }
> +
> + idev->ictx = ictx;
> + idev->dev = dev;
> + idev->dev_cookie = dev_cookie;
> + idev->id = id;
> + mutex_unlock(&ictx->lock);
> +
> + return idev;
> +out_user_dma:
> + iommu_device_exit_user_dma(idev->dev);
> +out_free:
> + kfree(idev);
> +out_unlock:
> + mutex_unlock(&ictx->lock);
> + iommufd_ctx_put(ictx);
> +
> + return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_bind_device);
> +
> +/**
> + * iommufd_unbind_device - Unbind a physical device from iommufd
> + *
> + * @idev: [in] Pointer to the internal iommufd_device struct.
> + *
> + */
> +void iommufd_unbind_device(struct iommufd_device *idev)
> +{
> + struct iommufd_ctx *ictx = idev->ictx;
> +
> + mutex_lock(&ictx->lock);
> + xa_erase(&ictx->device_xa, idev->id);
> + mutex_unlock(&ictx->lock);
> + /* Exit the security context */
> + iommu_device_exit_user_dma(idev->dev);
> + kfree(idev);
> + iommufd_ctx_put(ictx);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_unbind_device);
> +
> static int __init iommufd_init(void)
> {
> int ret;
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> new file mode 100644
> index 000000000000..1603a13937e9
> --- /dev/null
> +++ b/include/linux/iommufd.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * IOMMUFD API definition
> + *
> + * Copyright (C) 2021 Intel Corporation
> + *
> + * Author: Liu Yi L <[email protected]>
> + */
> +#ifndef __LINUX_IOMMUFD_H
> +#define __LINUX_IOMMUFD_H
> +
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/device.h>
> +
> +#define IOMMUFD_DEVID_MAX ((unsigned int)(0x7FFFFFFF))
> +#define IOMMUFD_DEVID_MIN 0
> +
> +struct iommufd_device;
> +
> +#if IS_ENABLED(CONFIG_IOMMUFD)
> +struct iommufd_device *
> +iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
> +void iommufd_unbind_device(struct iommufd_device *idev);
> +
> +#else /* !CONFIG_IOMMUFD */
> +static inline struct iommufd_device *
> +iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
> +{
> + return ERR_PTR(-ENODEV);
> +}
> +
> +static inline void iommufd_unbind_device(struct iommufd_device *idev)
> +{
> +}
> +#endif /* CONFIG_IOMMUFD */
> +#endif /* __LINUX_IOMMUFD_H */

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (8.97 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-29 07:14:44

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Wed, Sep 29, 2021 at 05:38:56AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Wednesday, September 29, 2021 12:56 PM
> >
> > >
> > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > as the contract to get security context established (and vice versa
> > > for unbinding). One additional requirement in iommufd is to manage the
> > > switch between multiple security contexts due to decoupled bind/attach:
> > >
> > > 1) Open a device in "/dev/vfio/devices" with user access blocked;
> >
> > Probably worth clarifying that (1) must happen for *all* devices in
> > the group before (2) happens for any device in the group.
>
> No. User access is naturally blocked for other devices as long as they
> are not opened yet.

Uh... my point is that everything in the group has to be removed from
regular kernel drivers before we reach step (2). Is the plan that you
must do that before you can even open them? That's a reasonable
choice, but then I think you should show that step in this description
as well.

> > > 2) Bind the device to an iommufd with an initial security context
> > > (an empty iommu domain which blocks dma) established for its
> > > group, with user access unblocked;
> > >
> > > 3) Attach the device to a user-specified ioasid (shared by all devices
> > > attached to this ioasid). Before attaching, the device should be first
> > > detached from the initial context;
> >
> > So, this step can implicitly but observably change the behaviour for
> > other devices in the group as well. I don't love that kind of
> > difficult to predict side effect, which is why I'm *still* not totally
> > convinced by the device-centric model.
>
> which side-effect is predicted here? The user anyway needs to be
> aware of such group restriction regardless whether it uses group
> or nongroup interface.

Yes, exactly. And with a group interface it's obvious it has to
understand it. With the non-group interface, you can get to this
stage in ignorance of groups. It will even work as long as you are
lucky enough only to try with singleton-group devices. Then you try
it with two devices in the one group and doing (3) on device A will
implicitly change the DMA environment of device B.

(or at least, it will if they share a group because they don't have
distinguishable RIDs. That's not the only multi-device group case,
but it's one of them).

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.71 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-29 07:34:52

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: David Gibson
> Sent: Wednesday, September 29, 2021 2:35 PM
>
> On Wed, Sep 29, 2021 at 05:38:56AM +0000, Tian, Kevin wrote:
> > > From: David Gibson <[email protected]>
> > > Sent: Wednesday, September 29, 2021 12:56 PM
> > >
> > > >
> > > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > > as the contract to get security context established (and vice versa
> > > > for unbinding). One additional requirement in iommufd is to manage
> the
> > > > switch between multiple security contexts due to decoupled
> bind/attach:
> > > >
> > > > 1) Open a device in "/dev/vfio/devices" with user access blocked;
> > >
> > > Probably worth clarifying that (1) must happen for *all* devices in
> > > the group before (2) happens for any device in the group.
> >
> > No. User access is naturally blocked for other devices as long as they
> > are not opened yet.
>
> Uh... my point is that everything in the group has to be removed from
> regular kernel drivers before we reach step (2). Is the plan that you
> must do that before you can even open them? That's a reasonable
> choice, but then I think you should show that step in this description
> as well.

Agree. I think below proposal can meet above requirement and ensure
it's not broken in the whole process when the group is operated by the
userspace:

https://lore.kernel.org/kvm/[email protected]/

and definitely an updated description will be provided when sending out
the new proposal.

>
> > > > 2) Bind the device to an iommufd with an initial security context
> > > > (an empty iommu domain which blocks dma) established for its
> > > > group, with user access unblocked;
> > > >
> > > > 3) Attach the device to a user-specified ioasid (shared by all devices
> > > > attached to this ioasid). Before attaching, the device should be first
> > > > detached from the initial context;
> > >
> > > So, this step can implicitly but observably change the behaviour for
> > > other devices in the group as well. I don't love that kind of
> > > difficult to predict side effect, which is why I'm *still* not totally
> > > convinced by the device-centric model.
> >
> > which side-effect is predicted here? The user anyway needs to be
> > aware of such group restriction regardless whether it uses group
> > or nongroup interface.
>
> Yes, exactly. And with a group interface it's obvious it has to
> understand it. With the non-group interface, you can get to this
> stage in ignorance of groups. It will even work as long as you are
> lucky enough only to try with singleton-group devices. Then you try
> it with two devices in the one group and doing (3) on device A will
> implicitly change the DMA environment of device B.

for non-group we can also document it obviously in uAPI that the user
must understand group restriction and violating it will get failure
when attaching to different IOAS's for devices in the same group.

Thanks
Kevin

2021-09-29 12:27:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:

> > +struct iommufd_device {
> > + unsigned int id;
> > + struct iommufd_ctx *ictx;
> > + struct device *dev; /* always be the physical device */
> > + u64 dev_cookie;
>
> Why do you need both an 'id' and a 'dev_cookie'? Since they're both
> unique, couldn't you just use the cookie directly as the index into
> the xarray?

ID is the kernel value in the xarray - xarray is much more efficient &
safe with small kernel controlled values.

dev_cookie is a user assigned value that may not be unique. It's
purpose is to allow userspace to receive and event and go back to its
structure. Most likely userspace will store a pointer here, but it is
also possible userspace could not use it.

It is a pretty normal pattern

Jason

2021-09-29 12:59:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote:

> Yes, exactly. And with a group interface it's obvious it has to
> understand it. With the non-group interface, you can get to this
> stage in ignorance of groups. It will even work as long as you are
> lucky enough only to try with singleton-group devices. Then you try
> it with two devices in the one group and doing (3) on device A will
> implicitly change the DMA environment of device B.

The security model here says this is fine.

This idea to put the iommu code in charge of security is quite clean,
as I said in the other mail drivers attached to 'struct devices *'
tell the iommu layer what they are are doing:

iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)

And it decides if it is allowed.

If device A is allowed to go to userspace then security wise it is
deemed fine that B is impacted. That is what we have defined already
today.

This proposal does not free userpace from having to understand this!
The iommu_group sysfs is still there and still must be understood.

The *admin* the one responsible to understand the groups, not the
applications. The admin has no idea what a group FD is - they should
be looking at the sysfs and seeing the iommu_group directories.

Jason

2021-09-29 13:02:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Wed, Sep 29, 2021 at 12:38:35AM +0000, Tian, Kevin wrote:

> /* If set the driver must call iommu_XX as the first action in probe() or
> * before it attempts to do DMA
> */
> bool suppress_dma_owner:1;

It is not "attempts to do DMA" but more "operates the physical device
in any away"

Not having ownership means another entity could be using user space
DMA to manipulate the device state and attack the integrity of the
kernel's programming of the device.

Jason

2021-09-30 04:22:36

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
65;6402;1c> On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
>
> > > +struct iommufd_device {
> > > + unsigned int id;
> > > + struct iommufd_ctx *ictx;
> > > + struct device *dev; /* always be the physical device */
> > > + u64 dev_cookie;
> >
> > Why do you need both an 'id' and a 'dev_cookie'? Since they're both
> > unique, couldn't you just use the cookie directly as the index into
> > the xarray?
>
> ID is the kernel value in the xarray - xarray is much more efficient &
> safe with small kernel controlled values.
>
> dev_cookie is a user assigned value that may not be unique. It's
> purpose is to allow userspace to receive and event and go back to its
> structure. Most likely userspace will store a pointer here, but it is
> also possible userspace could not use it.
>
> It is a pretty normal pattern

Hm, ok. Could you point me at an example?

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.14 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-30 04:22:39

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Wed, Sep 29, 2021 at 07:31:08AM +0000, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Wednesday, September 29, 2021 2:35 PM
> >
> > On Wed, Sep 29, 2021 at 05:38:56AM +0000, Tian, Kevin wrote:
> > > > From: David Gibson <[email protected]>
> > > > Sent: Wednesday, September 29, 2021 12:56 PM
> > > >
> > > > >
> > > > > Unlike vfio, iommufd adopts a device-centric design with all group
> > > > > logistics hidden behind the fd. Binding a device to iommufd serves
> > > > > as the contract to get security context established (and vice versa
> > > > > for unbinding). One additional requirement in iommufd is to manage
> > the
> > > > > switch between multiple security contexts due to decoupled
> > bind/attach:
> > > > >
> > > > > 1) Open a device in "/dev/vfio/devices" with user access blocked;
> > > >
> > > > Probably worth clarifying that (1) must happen for *all* devices in
> > > > the group before (2) happens for any device in the group.
> > >
> > > No. User access is naturally blocked for other devices as long as they
> > > are not opened yet.
> >
> > Uh... my point is that everything in the group has to be removed from
> > regular kernel drivers before we reach step (2). Is the plan that you
> > must do that before you can even open them? That's a reasonable
> > choice, but then I think you should show that step in this description
> > as well.
>
> Agree. I think below proposal can meet above requirement and ensure
> it's not broken in the whole process when the group is operated by the
> userspace:
>
> https://lore.kernel.org/kvm/[email protected]/
>
> and definitely an updated description will be provided when sending out
> the new proposal.
>
> >
> > > > > 2) Bind the device to an iommufd with an initial security context
> > > > > (an empty iommu domain which blocks dma) established for its
> > > > > group, with user access unblocked;
> > > > >
> > > > > 3) Attach the device to a user-specified ioasid (shared by all devices
> > > > > attached to this ioasid). Before attaching, the device should be first
> > > > > detached from the initial context;
> > > >
> > > > So, this step can implicitly but observably change the behaviour for
> > > > other devices in the group as well. I don't love that kind of
> > > > difficult to predict side effect, which is why I'm *still* not totally
> > > > convinced by the device-centric model.
> > >
> > > which side-effect is predicted here? The user anyway needs to be
> > > aware of such group restriction regardless whether it uses group
> > > or nongroup interface.
> >
> > Yes, exactly. And with a group interface it's obvious it has to
> > understand it. With the non-group interface, you can get to this
> > stage in ignorance of groups. It will even work as long as you are
> > lucky enough only to try with singleton-group devices. Then you try
> > it with two devices in the one group and doing (3) on device A will
> > implicitly change the DMA environment of device B.
>
> for non-group we can also document it obviously in uAPI that the user
> must understand group restriction and violating it will get failure
> when attaching to different IOAS's for devices in the same group.

Documenting limitations is always inferior to building them into the
actual API signatures. Sometimes its the only option, but people
frequently don't read the docs, whereas they kind of have to look at
the API itself.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.64 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-30 04:28:45

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Wed, Sep 29, 2021 at 09:57:16AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote:
>
> > Yes, exactly. And with a group interface it's obvious it has to
> > understand it. With the non-group interface, you can get to this
> > stage in ignorance of groups. It will even work as long as you are
> > lucky enough only to try with singleton-group devices. Then you try
> > it with two devices in the one group and doing (3) on device A will
> > implicitly change the DMA environment of device B.
>
> The security model here says this is fine.

I'm not making a statement about the security model, I'm making a
statement about surprisingness of the programming interface. In your
program you have devices A & B, you perform an operation that
specifies only device A and device B changes behaviour.

> This idea to put the iommu code in charge of security is quite clean,
> as I said in the other mail drivers attached to 'struct devices *'
> tell the iommu layer what they are are doing:
>
> iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL)
> iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL)
> iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file)
>
> And it decides if it is allowed.
>
> If device A is allowed to go to userspace then security wise it is
> deemed fine that B is impacted. That is what we have defined already
> today.
>
> This proposal does not free userpace from having to understand this!
> The iommu_group sysfs is still there and still must be understood.
>
> The *admin* the one responsible to understand the groups, not the
> applications. The admin has no idea what a group FD is - they should
> be looking at the sysfs and seeing the iommu_group directories.

Not just the admin. If an app is given two devices in the same group
to use *both* it must understand that and act accordingly.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.10 kB)
signature.asc (849.00 B)
Download all attachments

2021-09-30 22:30:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Thu, Sep 30, 2021 at 01:09:22PM +1000, David Gibson wrote:

> > The *admin* the one responsible to understand the groups, not the
> > applications. The admin has no idea what a group FD is - they should
> > be looking at the sysfs and seeing the iommu_group directories.
>
> Not just the admin. If an app is given two devices in the same group
> to use *both* it must understand that and act accordingly.

Yes, but this is true regardless of what the uAPI is, and for common
app cases where we have a single IO Page table for all devices the app
still doesn't need to care about groups since it can just assign all
devices to the same IO page table and everything works out just fine.

For instance qemu without a vIOMMU does not need to care about
groups. It opens a single iommufd, creates a single IO page table that
maps the guest physical space and assigns every device to that IO page
table. No issue.

Only if qemu is creating a vIOMMU does it need to start to look at the
groups and ensure that the group becomes visible to the guest OS. Here
the group fd doesn't really help anything

Jason


2021-10-01 04:22:46

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Thu, Sep 30, 2021 at 07:28:18PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 30, 2021 at 01:09:22PM +1000, David Gibson wrote:
>
> > > The *admin* the one responsible to understand the groups, not the
> > > applications. The admin has no idea what a group FD is - they should
> > > be looking at the sysfs and seeing the iommu_group directories.
> >
> > Not just the admin. If an app is given two devices in the same group
> > to use *both* it must understand that and act accordingly.
>
> Yes, but this is true regardless of what the uAPI is,

Yes, but formerly it was explicit and now it is implicit. Before we
said "attach this group to this container", which can reasonably be
expected to affect the whole group. Now we say "attach this device to
this IOAS" and it silently also affects other devices.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.00 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-01 13:10:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Thu, Sep 30, 2021 at 01:10:29PM +1000, David Gibson wrote:
> On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
> 65;6402;1c> On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
> >
> > > > +struct iommufd_device {
> > > > + unsigned int id;
> > > > + struct iommufd_ctx *ictx;
> > > > + struct device *dev; /* always be the physical device */
> > > > + u64 dev_cookie;
> > >
> > > Why do you need both an 'id' and a 'dev_cookie'? Since they're both
> > > unique, couldn't you just use the cookie directly as the index into
> > > the xarray?
> >
> > ID is the kernel value in the xarray - xarray is much more efficient &
> > safe with small kernel controlled values.
> >
> > dev_cookie is a user assigned value that may not be unique. It's
> > purpose is to allow userspace to receive and event and go back to its
> > structure. Most likely userspace will store a pointer here, but it is
> > also possible userspace could not use it.
> >
> > It is a pretty normal pattern
>
> Hm, ok. Could you point me at an example?

For instance user_data vs fd in io_uring

RDMA has many similar examples.

More or less anytime you want to allow the kernel to async retun some
information providing a 64 bit user_data lets userspace have an easier
time to deal with it.

Jason


2021-10-07 01:36:41

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Fri, Oct 01, 2021 at 09:43:22AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 30, 2021 at 01:10:29PM +1000, David Gibson wrote:
> > On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
> > >
> > > > > +struct iommufd_device {
> > > > > + unsigned int id;
> > > > > + struct iommufd_ctx *ictx;
> > > > > + struct device *dev; /* always be the physical device */
> > > > > + u64 dev_cookie;
> > > >
> > > > Why do you need both an 'id' and a 'dev_cookie'? Since they're both
> > > > unique, couldn't you just use the cookie directly as the index into
> > > > the xarray?
> > >
> > > ID is the kernel value in the xarray - xarray is much more efficient &
> > > safe with small kernel controlled values.
> > >
> > > dev_cookie is a user assigned value that may not be unique. It's
> > > purpose is to allow userspace to receive and event and go back to its
> > > structure. Most likely userspace will store a pointer here, but it is
> > > also possible userspace could not use it.
> > >
> > > It is a pretty normal pattern
> >
> > Hm, ok. Could you point me at an example?
>
> For instance user_data vs fd in io_uring

Ok, but one of those is an fd, which is an existing type of handle.
Here we're introducing two different unique handles that aren't an
existing kernel concept.

> RDMA has many similar examples.
>
> More or less anytime you want to allow the kernel to async retun some
> information providing a 64 bit user_data lets userspace have an easier
> time to deal with it.

I absolutely see the need for user_data. What I'm questioning is
having two different, user-visible unique handles, neither of which is
an fd.


That said... is there any strong reason why user_data needs to be
unique? I can imagine userspace applications where you don't care
which device the notification is coming from - or at least don't care
down to the same granularity that /dev/iommu is using. In which case
having the kernel provided unique handle and the
not-necessarily-unique user_data would make perfect sense.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.29 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-07 19:23:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Thu, Oct 07, 2021 at 12:23:13PM +1100, David Gibson wrote:
> On Fri, Oct 01, 2021 at 09:43:22AM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 30, 2021 at 01:10:29PM +1000, David Gibson wrote:
> > > On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
> > > >
> > > > > > +struct iommufd_device {
> > > > > > + unsigned int id;
> > > > > > + struct iommufd_ctx *ictx;
> > > > > > + struct device *dev; /* always be the physical device */
> > > > > > + u64 dev_cookie;
> > > > >
> > > > > Why do you need both an 'id' and a 'dev_cookie'? Since they're both
> > > > > unique, couldn't you just use the cookie directly as the index into
> > > > > the xarray?
> > > >
> > > > ID is the kernel value in the xarray - xarray is much more efficient &
> > > > safe with small kernel controlled values.
> > > >
> > > > dev_cookie is a user assigned value that may not be unique. It's
> > > > purpose is to allow userspace to receive and event and go back to its
> > > > structure. Most likely userspace will store a pointer here, but it is
> > > > also possible userspace could not use it.
> > > >
> > > > It is a pretty normal pattern
> > >
> > > Hm, ok. Could you point me at an example?
> >
> > For instance user_data vs fd in io_uring
>
> Ok, but one of those is an fd, which is an existing type of handle.
> Here we're introducing two different unique handles that aren't an
> existing kernel concept.

I'm not sure how that matters, the kernel has many handles - and we
get to make more of them.. Look at xarray/idr users in the kernel, many of
those are making userspace handles.

> That said... is there any strong reason why user_data needs to be
> unique? I can imagine userspace applications where you don't care
> which device the notification is coming from - or at least don't care
> down to the same granularity that /dev/iommu is using. In which case
> having the kernel provided unique handle and the
> not-necessarily-unique user_data would make perfect sense.

I don't think the user_data 64 bit value should be unique, it is just
transported from user to kernal and back again. It is *not* a handle,
it is a cookie.

Handles for the kernel/user boundary should come from xarrays that
have nice lookup properties - not from user provided 64 bit values
that have to be stored in red black trees..

Jason

2021-10-11 05:38:08

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

On Thu, Oct 07, 2021 at 08:35:03AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 07, 2021 at 12:23:13PM +1100, David Gibson wrote:
> > On Fri, Oct 01, 2021 at 09:43:22AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 30, 2021 at 01:10:29PM +1000, David Gibson wrote:
> > > > On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote:
> > > > >
> > > > > > > +struct iommufd_device {
> > > > > > > + unsigned int id;
> > > > > > > + struct iommufd_ctx *ictx;
> > > > > > > + struct device *dev; /* always be the physical device */
> > > > > > > + u64 dev_cookie;
> > > > > >
> > > > > > Why do you need both an 'id' and a 'dev_cookie'? Since they're both
> > > > > > unique, couldn't you just use the cookie directly as the index into
> > > > > > the xarray?
> > > > >
> > > > > ID is the kernel value in the xarray - xarray is much more efficient &
> > > > > safe with small kernel controlled values.
> > > > >
> > > > > dev_cookie is a user assigned value that may not be unique. It's
> > > > > purpose is to allow userspace to receive and event and go back to its
> > > > > structure. Most likely userspace will store a pointer here, but it is
> > > > > also possible userspace could not use it.
> > > > >
> > > > > It is a pretty normal pattern
> > > >
> > > > Hm, ok. Could you point me at an example?
> > >
> > > For instance user_data vs fd in io_uring
> >
> > Ok, but one of those is an fd, which is an existing type of handle.
> > Here we're introducing two different unique handles that aren't an
> > existing kernel concept.
>
> I'm not sure how that matters, the kernel has many handles - and we
> get to make more of them.. Look at xarray/idr users in the kernel, many of
> those are making userspace handles.

Again, I'm commenting *just* on the fact that the current draft
introduce *two* handles for the same object. I have no objection to
either of the handles in isoation.

> > That said... is there any strong reason why user_data needs to be
> > unique? I can imagine userspace applications where you don't care
> > which device the notification is coming from - or at least don't care
> > down to the same granularity that /dev/iommu is using. In which case
> > having the kernel provided unique handle and the
> > not-necessarily-unique user_data would make perfect sense.
>
> I don't think the user_data 64 bit value should be unique, it is just
> transported from user to kernal and back again. It is *not* a handle,
> it is a cookie.
>
> Handles for the kernel/user boundary should come from xarrays that
> have nice lookup properties - not from user provided 64 bit values
> that have to be stored in red black trees..

Yes, I think that would make more sense.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.96 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-15 10:28:52

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

Hi, Jason,

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 29, 2021 8:59 PM
>
> On Wed, Sep 29, 2021 at 12:38:35AM +0000, Tian, Kevin wrote:
>
> > /* If set the driver must call iommu_XX as the first action in probe() or
> > * before it attempts to do DMA
> > */
> > bool suppress_dma_owner:1;
>
> It is not "attempts to do DMA" but more "operates the physical device
> in any away"
>
> Not having ownership means another entity could be using user space
> DMA to manipulate the device state and attack the integrity of the
> kernel's programming of the device.
>

Does suppress_kernel_dma sounds better than suppress_dma_owner?
We found the latter causing some confusion when doing internal
code review. Somehow this flag represents "don't claim the kernel dma
ownership during driver binding". suppress_dma_owner sounds the
entire ownership is disabled...

Another thing is about DMA_OWNER_SHARED, which is set to indicate
no dma at all. Thinking more we feel that this flag is meaningless. Its
sole purpose is to show compatibility to any USER/KERNEL ownership,
and essentially the same semantics as a device which is not bound to
any driver. So we plan to remove it then pci-stub just needs one line
change to set the suppress flag. But want to check with you first in case
any oversight.

Thanks
Kevin

2021-10-15 16:51:40

by Yi Liu

[permalink] [raw]
Subject: RE: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 22, 2021 1:15 AM
>
> On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:
>
> > +/*
> > + * A iommufd_device object represents the binding relationship
> > + * between iommufd and device. It is created per a successful
> > + * binding request from device driver. The bound device must be
> > + * a physical device so far. Subdevice will be supported later
> > + * (with additional PASID information). An user-assigned cookie
> > + * is also recorded to mark the device in the /dev/iommu uAPI.
> > + */
> > +struct iommufd_device {
> > + unsigned int id;
> > + struct iommufd_ctx *ictx;
> > + struct device *dev; /* always be the physical device */
> > + u64 dev_cookie;
> > };
> >
> > static int iommufd_fops_open(struct inode *inode, struct file *filep)
> > @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode,
> struct file *filep)
> > return -ENOMEM;
> >
> > refcount_set(&ictx->refs, 1);
> > + mutex_init(&ictx->lock);
> > + xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
> > filep->private_data = ictx;
> >
> > return ret;
> > }
> >
> > +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> > +{
> > + refcount_inc(&ictx->refs);
> > +}
>
> See my earlier remarks about how to structure the lifetime logic, this
> ref isn't necessary.
>
> > +static const struct file_operations iommufd_fops;
> > +
> > +/**
> > + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd
> context.
> > + * @fd: [in] iommufd file descriptor.
> > + *
> > + * Returns a pointer to the iommufd context, otherwise NULL;
> > + *
> > + */
> > +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> > +{
> > + struct fd f = fdget(fd);
> > + struct file *file = f.file;
> > + struct iommufd_ctx *ictx;
> > +
> > + if (!file)
> > + return NULL;
> > +
> > + if (file->f_op != &iommufd_fops)
> > + return NULL;
>
> Leaks the fdget
>
> > +
> > + ictx = file->private_data;
> > + if (ictx)
> > + iommufd_ctx_get(ictx);
>
> Use success oriented flow
>
> > + fdput(f);
> > + return ictx;
> > +}
>
> > + */
> > +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> > + u64 dev_cookie)
> > +{
> > + struct iommufd_ctx *ictx;
> > + struct iommufd_device *idev;
> > + unsigned long index;
> > + unsigned int id;
> > + int ret;
> > +
> > + ictx = iommufd_ctx_fdget(fd);
> > + if (!ictx)
> > + return ERR_PTR(-EINVAL);
> > +
> > + mutex_lock(&ictx->lock);
> > +
> > + /* check duplicate registration */
> > + xa_for_each(&ictx->device_xa, index, idev) {
> > + if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> > + idev = ERR_PTR(-EBUSY);
> > + goto out_unlock;
> > + }
>
> I can't think of a reason why this expensive check is needed.
>
> > + }
> > +
> > + idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> > + if (!idev) {
> > + ret = -ENOMEM;
> > + goto out_unlock;
> > + }
> > +
> > + /* Establish the security context */
> > + ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> > + if (ret)
> > + goto out_free;
> > +
> > + ret = xa_alloc(&ictx->device_xa, &id, idev,
> > + XA_LIMIT(IOMMUFD_DEVID_MIN,
> IOMMUFD_DEVID_MAX),
> > + GFP_KERNEL);
>
> idev should be fully initialized before being placed in the xarray, so
> this should be the last thing done.

all good suggestions above. thanks for catching them.

> Why not just use the standard xa_limit_32b instead of special single
> use constants?

yeah. should use xa_limit_32b.

Regards,
Yi Liu

> Jason

2021-10-15 19:56:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

On Fri, Oct 15, 2021 at 01:29:16AM +0000, Tian, Kevin wrote:
> Hi, Jason,
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, September 29, 2021 8:59 PM
> >
> > On Wed, Sep 29, 2021 at 12:38:35AM +0000, Tian, Kevin wrote:
> >
> > > /* If set the driver must call iommu_XX as the first action in probe() or
> > > * before it attempts to do DMA
> > > */
> > > bool suppress_dma_owner:1;
> >
> > It is not "attempts to do DMA" but more "operates the physical device
> > in any away"
> >
> > Not having ownership means another entity could be using user space
> > DMA to manipulate the device state and attack the integrity of the
> > kernel's programming of the device.
> >
>
> Does suppress_kernel_dma sounds better than suppress_dma_owner?
> We found the latter causing some confusion when doing internal
> code review. Somehow this flag represents "don't claim the kernel dma
> ownership during driver binding". suppress_dma_owner sounds the
> entire ownership is disabled...

If in doubt make it

suppress_iommu_whatever_the_api_is_that_isn't_called

> Another thing is about DMA_OWNER_SHARED, which is set to indicate
> no dma at all. Thinking more we feel that this flag is meaningless. Its
> sole purpose is to show compatibility to any USER/KERNEL ownership,
> and essentially the same semantics as a device which is not bound to
> any driver. So we plan to remove it then pci-stub just needs one line
> change to set the suppress flag. But want to check with you first in case
> any oversight.

It sounds reasonable, but also makes it much harder to find the few
places that have this special relationship - ie we can't grep for
DMA_OWNER_SHARED anymore.

Jason

2021-10-18 03:50:31

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, October 15, 2021 7:10 PM
>
> On Fri, Oct 15, 2021 at 01:29:16AM +0000, Tian, Kevin wrote:
> > Hi, Jason,
> >
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, September 29, 2021 8:59 PM
> > >
> > > On Wed, Sep 29, 2021 at 12:38:35AM +0000, Tian, Kevin wrote:
> > >
> > > > /* If set the driver must call iommu_XX as the first action in probe() or
> > > > * before it attempts to do DMA
> > > > */
> > > > bool suppress_dma_owner:1;
> > >
> > > It is not "attempts to do DMA" but more "operates the physical device
> > > in any away"
> > >
> > > Not having ownership means another entity could be using user space
> > > DMA to manipulate the device state and attack the integrity of the
> > > kernel's programming of the device.
> > >
> >
> > Does suppress_kernel_dma sounds better than suppress_dma_owner?
> > We found the latter causing some confusion when doing internal
> > code review. Somehow this flag represents "don't claim the kernel dma
> > ownership during driver binding". suppress_dma_owner sounds the
> > entire ownership is disabled...
>
> If in doubt make it
>
> suppress_iommu_whatever_the_api_is_that_isn't_called

ok

>
> > Another thing is about DMA_OWNER_SHARED, which is set to indicate
> > no dma at all. Thinking more we feel that this flag is meaningless. Its
> > sole purpose is to show compatibility to any USER/KERNEL ownership,
> > and essentially the same semantics as a device which is not bound to
> > any driver. So we plan to remove it then pci-stub just needs one line
> > change to set the suppress flag. But want to check with you first in case
> > any oversight.
>
> It sounds reasonable, but also makes it much harder to find the few
> places that have this special relationship - ie we can't grep for
> DMA_OWNER_SHARED anymore.
>

It's probably fine. People can just search the suppress flag and filter out
drivers with DMA_OWNER_USER. Then the remaining set is about
drivers in SHARED category. Less straightforward but should be fine
for a relatively small set of drivers.

Thanks
Kevin

2021-10-29 07:43:49

by David Gibson

[permalink] [raw]
Subject: Re: [RFC 20/20] Doc: Add documentation for /dev/iommu

On Sun, Sep 19, 2021 at 02:38:48PM +0800, Liu Yi L wrote:
> Document the /dev/iommu framework for user.
>
> Open:
> Do we want to document /dev/iommu in Documentation/userspace-api/iommu.rst?
> Existing iommu.rst is for the vSVA interfaces, honestly, may need to rewrite
> this doc entirely.
>
> Signed-off-by: Kevin Tian <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> ---
> Documentation/userspace-api/index.rst | 1 +
> Documentation/userspace-api/iommufd.rst | 183 ++++++++++++++++++++++++
> 2 files changed, 184 insertions(+)
> create mode 100644 Documentation/userspace-api/iommufd.rst
>
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index 0b5eefed027e..54df5a278023 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -25,6 +25,7 @@ place where this information is gathered.
> ebpf/index
> ioctl/index
> iommu
> + iommufd
> media/index
> sysfs-platform_profile
>
> diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> new file mode 100644
> index 000000000000..abffbb47dc02
> --- /dev/null
> +++ b/Documentation/userspace-api/iommufd.rst
> @@ -0,0 +1,183 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. iommu:
> +
> +===================
> +IOMMU Userspace API
> +===================
> +
> +Direct device access from userspace has been a crtical feature in
> +high performance computing and virtualization usages. Linux now
> +includes multiple device-passthrough frameworks (e.g. VFIO and vDPA)
> +to manage secure device access from the userspace. One critical
> +task of those frameworks is to put the assigned device in a secure,
> +IOMMU-protected context so the device is prevented from doing harm
> +to the rest of the system.
> +
> +Currently those frameworks implement their own logic for managing
> +I/O page tables to isolate user-initiated DMAs. This doesn't scale
> +to support many new IOMMU features, such as PASID-granular DMA
> +remapping, nested translation, I/O page fault, IOMMU dirty bit, etc.
> +
> +The /dev/iommu framework provides an unified interface for managing
> +I/O page tables for passthrough devices. Existing passthrough
> +frameworks are expected to use this interface instead of continuing
> +their ad-hoc implementations.
> +
> +IOMMUFDs, IOASIDs, Devices and Groups
> +-------------------------------------
> +
> +The core concepts in /dev/iommu are IOMMUFDs and IOASIDs. IOMMUFD (by
> +opening /dev/iommu) is the container holding multiple I/O address
> +spaces for a user, while IOASID is the fd-local software handle
> +representing an I/O address space and associated with a single I/O
> +page table. User manages those address spaces through fd operations,
> +e.g. by using vfio type1v2 mapping semantics to manage respective
> +I/O page tables.
> +
> +IOASID is comparable to the conatiner concept in VFIO. The latter
> +is also associated to a single I/O address space. A main difference
> +between them is that multiple IOASIDs in the same IOMMUFD can be
> +nested together (not supported yet) to allow centralized accounting
> +of locked pages, while multiple containers are disconnected thus
> +duplicated accounting is incurred. Typically one IOMMUFD is
> +sufficient for all intended IOMMU usages for a user.
> +
> +An I/O address space takes effect in the IOMMU only after it is
> +attached by a device. One I/O address space can be attached by
> +multiple devices. One device can be only attached to a single I/O
> +address space at this point (on par with current vfio behavior).
> +
> +Device must be bound to an iommufd before the attach operation can
> +be conducted. The binding operation builds the connection between
> +the devicefd (opened via device-passthrough framework) and IOMMUFD.
> +IOMMU-protected security context is esbliashed when the binding
> +operation is completed.

This can't be quite right. You can't establish a safe security
context until all devices in the groun are bound, but you can only
bind them one at a time.

> The passthrough framework must block user
> +access to the assigned device until bind() returns success.
> +
> +The entire /dev/iommu framework adopts a device-centric model w/o
> +carrying any container/group legacy as current vfio does. However
> +the group is the minimum granularity that must be used to ensure
> +secure user access (refer to vfio.rst). This framework relies on
> +the IOMMU core layer to map device-centric model into group-granular
> +isolation.
> +
> +Managing I/O Address Spaces
> +---------------------------
> +
> +When creating an I/O address space (by allocating IOASID), the user
> +must specify the type of underlying I/O page table. Currently only
> +one type (kernel-managed) is supported. In the future other types
> +will be introduced, e.g. to support user-managed I/O page table or
> +a shared I/O page table which is managed by another kernel sub-
> +system (mm, ept, etc.). Kernel-managed I/O page table is currently
> +managed via vfio type1v2 equivalent mapping semantics.
> +
> +The user also needs to specify the format of the I/O page table
> +when allocating an IOASID.

This almost seems redundant with the previous paragraph. I think
maybe it's making a distinction between "type" and "format", but I
don't think it's very clear what the distinction is.

> The format must be compatible to the
> +attached devices (or more specifically to the IOMMU which serves
> +the DMA from the attached devices). User can query the device IOMMU
> +format via IOMMUFD once a device is successfully bound. Attaching a
> +device to an IOASID with incompatible format is simply rejected.
> +
> +Currently no-snoop DMA is not supported yet. This implies that
> +IOASID must be created in an enforce-snoop format and only devices
> +which can be forced to snoop cache by IOMMU are allowed to be
> +attached to IOASID. The user should check uAPI extension and get
> +device info via IOMMUFD to handle such restriction.
> +
> +Usage Example
> +-------------
> +
> +Assume user wants to access PCI device 0000:06:0d.0, which is
> +exposed under the new /dev/vfio/devices directory by VFIO:
> +
> + /* Open device-centric interface and /dev/iommu interface */
> + device_fd = open("/dev/vfio/devices/0000:06:0d.0", O_RDWR);
> + iommu_fd = open("/dev/iommu", O_RDWR);
> +
> + /* Bind device to IOMMUFD */
> + bind_data = { .iommu_fd = iommu_fd, .dev_cookie = cookie };
> + ioctl(device_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind_data);
> +
> + /* Query per-device IOMMU capability/format */
> + info = { .dev_cookie = cookie, };
> + ioctl(iommu_fd, IOMMU_DEVICE_GET_INFO, &info);
> +
> + if (!(info.flags & IOMMU_DEVICE_INFO_ENFORCE_SNOOP)) {
> + if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION,
> + EXT_DMA_NO_SNOOP))
> + /* No support of no-snoop DMA */
> + }
> +
> + if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION, EXT_MAP_TYPE1V2))
> + /* No support of vfio type1v2 mapping semantics */
> +
> + /* Decides IOASID alloc fields based on info */
> + alloc_data = { .type = IOMMU_IOASID_TYPE_KERNEL,
> + .flags = IOMMU_IOASID_ENFORCE_SNOOP,
> + .addr_width = info.addr_width, };
> +
> + /* Allocate IOASID */
> + gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
> +
> + /* Attach device to an IOASID */
> + at_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
> + ioctl(device_fd, VFIO_DEVICE_ATTACH_IOASID, &at_data);
> +
> + /* Setup GPA mapping [0 - 1GB] */
> + dma_map = {
> + .ioasid = gpa_ioasid,
> + .data {
> + .flags = R/W /* permission */
> + .iova = 0, /* GPA */
> + .vaddr = 0x40000000, /* HVA */
> + .size = 1GB,
> + },
> + };
> + ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
> +
> + /* DMA */
> +
> + /* Unmap GPA mapping [0 - 1GB] */
> + dma_unmap = {
> + .ioasid = gpa_ioasid,
> + .data {
> + .iova = 0, /* GPA */
> + .size = 1GB,
> + },
> + };
> + ioctl(iommu_fd, IOMMU_UNMAP_DMA, &dma_unmap);
> +
> + /* Detach device from an IOASID */
> + dt_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
> + ioctl(device_fd, VFIO_DEVICE_DETACH_IOASID, &dt_data);
> +
> + /* Free IOASID */
> + ioctl(iommu_fd, IOMMU_IOASID_FREE, gpa_ioasid);
> +
> + close(device_fd);
> + close(iommu_fd);
> +
> +API for device-passthrough frameworks
> +-------------------------------------
> +
> +iommufd binding and IOASID attach/detach are initiated via the device-
> +passthrough framework uAPI.
> +
> +When a binding operation is requested by the user, the passthrough
> +framework should call iommufd_bind_device(). When the device fd is
> +closed by the user, iommufd_unbind_device() should be called
> +automatically::
> +
> + struct iommufd_device *
> + iommufd_bind_device(int fd, struct device *dev,
> + u64 dev_cookie);
> + void iommufd_unbind_device(struct iommufd_device *idev);
> +
> +IOASID attach/detach operations are per iommufd_device which is
> +returned by iommufd_bind_device():
> +
> + int iommufd_device_attach_ioasid(struct iommufd_device *idev,
> + int ioasid);
> + void iommufd_device_detach_ioasid(struct iommufd_device *idev,
> + int ioasid);

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (9.36 kB)
signature.asc (849.00 B)
Download all attachments

2021-10-29 12:48:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 20/20] Doc: Add documentation for /dev/iommu

On Fri, Oct 29, 2021 at 11:15:31AM +1100, David Gibson wrote:

> > +Device must be bound to an iommufd before the attach operation can
> > +be conducted. The binding operation builds the connection between
> > +the devicefd (opened via device-passthrough framework) and IOMMUFD.
> > +IOMMU-protected security context is esbliashed when the binding
> > +operation is completed.
>
> This can't be quite right. You can't establish a safe security
> context until all devices in the groun are bound, but you can only
> bind them one at a time.

When any device is bound the entire group is implicitly adopted to
this iommufd and the whole group enters a safe-for-userspace state.

It is symmetrical with the kernel side which is also device focused,
when any struct device is bound to a kernel driver the entire group is
implicitly adopted to kernel mode.

Lu should send a patch series soon that harmonize how this works, it
is a very nice cleanup.

Jason