2023-10-26 02:54:15

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

Hi folks,

This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework for nested translation. Nested
translation is a hardware feature that supports two-stage translation
tables for IOMMU. The second-stage translation table is managed by the
host VMM, while the first-stage translation table is owned by user
space. This allows user space to control the IOMMU mappings for its
devices.

When an IO page fault occurs on the first-stage translation table, the
IOMMU hardware can deliver the page fault to user space through the
IOMMUFD framework. User space can then handle the page fault and respond
to the device top-down through the IOMMUFD. This allows user space to
implement its own IO page fault handling policies.

User space indicates its capability of handling IO page faults by
setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
hardware page table (HWPT). IOMMUFD will then set up its infrastructure
for page fault delivery. On a successful return of HWPT allocation, the
user can retrieve and respond to page faults by reading and writing to
the file descriptor (FD) returned in out_fault_fd.

The iommu selftest framework has been updated to test the IO page fault
delivery and response functionality.

This series is based on the latest implementation of nested translation
under discussion [1] and the page fault handling framework refactoring in
the IOMMU core [2].

The series and related patches are available on GitHub: [3]

[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] https://lore.kernel.org/linux-iommu/[email protected]/
[3] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2

Best regards,
baolu

Change log:
v2:
- Move all iommu refactoring patches into a sparated series and discuss
it in a different thread. The latest patch series [v6] is available at
https://lore.kernel.org/linux-iommu/[email protected]/
- We discussed the timeout of the pending page fault messages. We
agreed that we shouldn't apply any timeout policy for the page fault
handling in user space.
https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/
- Jason suggested that we adopt a simple file descriptor interface for
reading and responding to I/O page requests, so that user space
applications can improve performance using io_uring.
https://lore.kernel.org/linux-iommu/[email protected]/

v1: https://lore.kernel.org/linux-iommu/[email protected]/

Lu Baolu (6):
iommu: Add iommu page fault cookie helpers
iommufd: Add iommu page fault uapi data
iommufd: Initializing and releasing IO page fault data
iommufd: Deliver fault messages to user space
iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support
iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF

include/linux/iommu.h | 9 +
drivers/iommu/iommu-priv.h | 15 +
drivers/iommu/iommufd/iommufd_private.h | 12 +
drivers/iommu/iommufd/iommufd_test.h | 8 +
include/uapi/linux/iommufd.h | 65 +++++
tools/testing/selftests/iommu/iommufd_utils.h | 66 ++++-
drivers/iommu/io-pgfault.c | 50 ++++
drivers/iommu/iommufd/device.c | 69 ++++-
drivers/iommu/iommufd/hw_pagetable.c | 260 +++++++++++++++++-
drivers/iommu/iommufd/selftest.c | 56 ++++
tools/testing/selftests/iommu/iommufd.c | 24 +-
.../selftests/iommu/iommufd_fail_nth.c | 2 +-
12 files changed, 620 insertions(+), 16 deletions(-)

--
2.34.1


2023-10-26 02:54:21

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v2 1/6] iommu: Add iommu page fault cookie helpers

Add an xarray in iommu_fault_param as place holder for per-{device, pasid}
fault cookie. The iommufd will use it to store the iommufd device pointers.
This allows the iommufd to quickly retrieve the device object ID for a
given {device, pasid} pair in the hot path of I/O page fault delivery.

Otherwise, the iommufd would have to maintain its own data structures to
map {device, pasid} pairs to object IDs, and then look up the object ID on
the critical path. This is not performance friendly.

The iommufd is supposed to set the cookie when a fault capable domain is
attached to the physical device or pasid, and clear the fault cookie when
the domain is removed.

Signed-off-by: Lu Baolu <[email protected]>
---
include/linux/iommu.h | 3 +++
drivers/iommu/iommu-priv.h | 15 ++++++++++++
drivers/iommu/io-pgfault.c | 50 ++++++++++++++++++++++++++++++++++++++
3 files changed, 68 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2ca3a3eda2e4..615d8a5f9dee 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -608,6 +608,8 @@ struct iommu_device {
* @dev: the device that owns this param
* @queue: IOPF queue
* @queue_list: index into queue->devices
+ * @pasid_cookie: per-pasid fault cookie used by fault message consumers.
+ * This array is self-protected by xa_lock().
* @partial: faults that are part of a Page Request Group for which the last
* request hasn't been submitted yet.
* @faults: holds the pending faults which needs response
@@ -619,6 +621,7 @@ struct iommu_fault_param {
struct device *dev;
struct iopf_queue *queue;
struct list_head queue_list;
+ struct xarray pasid_cookie;

struct list_head partial;
struct list_head faults;
diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
index 2024a2313348..0dc5ad81cbb6 100644
--- a/drivers/iommu/iommu-priv.h
+++ b/drivers/iommu/iommu-priv.h
@@ -27,4 +27,19 @@ void iommu_device_unregister_bus(struct iommu_device *iommu,
struct bus_type *bus,
struct notifier_block *nb);

+#ifdef CONFIG_IOMMU_IOPF
+void *iopf_pasid_cookie_set(struct device *dev, ioasid_t pasid, void *cookie);
+void *iopf_pasid_cookie_get(struct device *dev, ioasid_t pasid);
+#else
+static inline void *iopf_pasid_cookie_set(struct device *dev, ioasid_t pasid, void *cookie)
+{
+ return ERR_PTR(-ENODEV);
+}
+
+static inline void *iopf_pasid_cookie_get(struct device *dev, ioasid_t pasid)
+{
+ return ERR_PTR(-ENODEV);
+}
+#endif /* CONFIG_IOMMU_IOPF */
+
#endif /* __LINUX_IOMMU_PRIV_H */
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index b288c73f2b22..6fa029538deb 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -454,6 +454,7 @@ int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev)
mutex_init(&fault_param->lock);
INIT_LIST_HEAD(&fault_param->faults);
INIT_LIST_HEAD(&fault_param->partial);
+ xa_init(&fault_param->pasid_cookie);
fault_param->dev = dev;
fault_param->users = 1;
list_add(&fault_param->queue_list, &queue->devices);
@@ -575,3 +576,52 @@ void iopf_queue_free(struct iopf_queue *queue)
kfree(queue);
}
EXPORT_SYMBOL_GPL(iopf_queue_free);
+
+/**
+ * iopf_pasid_cookie_set - Set a fault cookie for per-{device, pasid}
+ * @dev: the device to set the cookie
+ * @pasid: the pasid on this device
+ * @cookie: the opaque data
+ *
+ * Return the old cookie on success, or ERR_PTR on failure.
+ */
+void *iopf_pasid_cookie_set(struct device *dev, ioasid_t pasid, void *cookie)
+{
+ struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
+ void *curr;
+
+ if (!iopf_param)
+ return ERR_PTR(-ENODEV);
+
+ curr = xa_store(&iopf_param->pasid_cookie, pasid, cookie, GFP_KERNEL);
+ iopf_put_dev_fault_param(iopf_param);
+
+ return xa_is_err(curr) ? ERR_PTR(xa_err(curr)) : curr;
+}
+EXPORT_SYMBOL_NS_GPL(iopf_pasid_cookie_set, IOMMUFD_INTERNAL);
+
+/**
+ * iopf_pasid_cookie_get - Get the fault cookie for {device, pasid}
+ * @dev: the device where the cookie was set
+ * @pasid: the pasid on this device
+ *
+ * Return the cookie on success, or ERR_PTR on failure. Note that NULL is
+ * also a successful return.
+ */
+void *iopf_pasid_cookie_get(struct device *dev, ioasid_t pasid)
+{
+ struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
+ void *curr;
+
+ if (!iopf_param)
+ return ERR_PTR(-ENODEV);
+
+ xa_lock(&iopf_param->pasid_cookie);
+ curr = xa_load(&iopf_param->pasid_cookie, pasid);
+ xa_unlock(&iopf_param->pasid_cookie);
+
+ iopf_put_dev_fault_param(iopf_param);
+
+ return curr;
+}
+EXPORT_SYMBOL_NS_GPL(iopf_pasid_cookie_get, IOMMUFD_INTERNAL);
--
2.34.1

2023-10-26 02:54:28

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v2 2/6] iommufd: Add iommu page fault uapi data

For user to handle IO page faults generated by IOMMU hardware when
walking the HWPT managed by the user. One example of the use case
is nested translation, where the first-stage page table is managed
by the user space.

When allocating a user HWPT, the user could opt-in a flag named
IOMMU_HWPT_ALLOC_IOPF_CAPABLE, which indicates that user is capable
of handling IO page faults generated for this HWPT.

On a successful return of hwpt allocation, the user can retrieve
and respond the page faults by reading and writing the fd returned
in out_fault_fd. The format of the page fault and response data is
encoded in the format defined by struct iommu_hwpt_pgfault and
struct iommu_hwpt_response.

The iommu_hwpt_pgfault is mostly like the iommu_fault with some new
members like fault data size and the device object id where the page
fault was originated from.

Signed-off-by: Lu Baolu <[email protected]>
---
include/uapi/linux/iommufd.h | 65 ++++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index f9b8b95b36b2..0f00f1dfcded 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -355,9 +355,17 @@ struct iommu_vfio_ioas {
* @IOMMU_HWPT_ALLOC_NEST_PARENT: If set, allocate a domain which can serve
* as the parent domain in the nesting
* configuration.
+ * @IOMMU_HWPT_ALLOC_IOPF_CAPABLE: User is capable of handling IO page faults.
+ * On successful return, user can retrieve
+ * faults by reading the @out_fault_fd and
+ * respond the faults by writing it. The fault
+ * data is encoded in the format defined by
+ * iommu_hwpt_pgfault. The response data format
+ * is defined by iommu_hwpt_page_response
*/
enum iommufd_hwpt_alloc_flags {
IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
+ IOMMU_HWPT_ALLOC_IOPF_CAPABLE = 1 << 1,
};

/**
@@ -476,6 +484,7 @@ struct iommu_hwpt_alloc {
__u32 hwpt_type;
__u32 data_len;
__aligned_u64 data_uptr;
+ __u32 out_fault_fd;
};
#define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)

@@ -679,6 +688,62 @@ struct iommu_dev_data_arm_smmuv3 {
__u32 sid;
};

+/**
+ * struct iommu_hwpt_pgfault - iommu page fault data
+ * @size: sizeof(struct iommu_hwpt_pgfault)
+ * @flags: Combination of IOMMU_PGFAULT_FLAGS_ flags.
+ * - PASID_VALID: @pasid field is valid
+ * - LAST_PAGE: the last page fault in a group
+ * - PRIV_DATA: @private_data field is valid
+ * - RESP_NEEDS_PASID: the page response must have the same
+ * PASID value as the page request.
+ * @dev_id: id of the originated device
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_PGFAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_hwpt_pgfault {
+ __u32 size;
+ __u32 flags;
+#define IOMMU_PGFAULT_FLAGS_PASID_VALID (1 << 0)
+#define IOMMU_PGFAULT_FLAGS_LAST_PAGE (1 << 1)
+#define IOMMU_PGFAULT_FLAGS_PRIV_DATA (1 << 2)
+#define IOMMU_PGFAULT_FLAGS_RESP_NEEDS_PASID (1 << 3)
+ __u32 dev_id;
+ __u32 pasid;
+ __u32 grpid;
+ __u32 perm;
+#define IOMMU_PGFAULT_PERM_READ (1 << 0)
+#define IOMMU_PGFAULT_PERM_WRITE (1 << 1)
+#define IOMMU_PGFAULT_PERM_EXEC (1 << 2)
+#define IOMMU_PGFAULT_PERM_PRIV (1 << 3)
+ __u64 addr;
+ __u64 private_data[2];
+};
+
+/**
+ * struct iommu_hwpt_response - IOMMU page fault response
+ * @size: sizeof(struct iommu_hwpt_response)
+ * @flags: Must be set to 0
+ * @hwpt_id: hwpt ID of target hardware page table for the response
+ * @dev_id: device ID of target device for the response
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code. The supported codes include:
+ * 0: Successful; 1: Response Failure; 2: Invalid Request.
+ */
+struct iommu_hwpt_page_response {
+ __u32 size;
+ __u32 flags;
+ __u32 hwpt_id;
+ __u32 dev_id;
+ __u32 pasid;
+ __u32 grpid;
+ __u32 code;
+};
+
/**
* struct iommu_set_dev_data - ioctl(IOMMU_SET_DEV_DATA)
* @size: sizeof(struct iommu_set_dev_data)
--
2.34.1

2023-10-26 02:54:31

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v2 3/6] iommufd: Initializing and releasing IO page fault data

Add some housekeeping code for IO page fault dilivery. Add a fault field
in the iommufd_hw_pagetable structure to store pending IO page faults and
other related data.

The fault field is allocated and initialized when an IOPF-capable user
HWPT is allocated. It is indicated by the IOMMU_HWPT_ALLOC_IOPF_CAPABLE
flag being set in the allocation user data. The fault field exists until
the HWPT is destroyed. This also means that you can determine whether a
HWPT is IOPF-capable by checking the fault field.

When an IOPF-capable HWPT is attached to a device (could also be a PASID of
a device in the future), the iommufd device pointer is saved for the pasid
of the device. The pointer is recalled and all pending iopf groups are
discarded after the HWPT is detached from the device.

Signed-off-by: Lu Baolu <[email protected]>
---
include/linux/iommu.h | 6 +++
drivers/iommu/iommufd/iommufd_private.h | 10 ++++
drivers/iommu/iommufd/device.c | 69 +++++++++++++++++++++++--
drivers/iommu/iommufd/hw_pagetable.c | 56 +++++++++++++++++++-
4 files changed, 137 insertions(+), 4 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 615d8a5f9dee..600ca3842c8a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -130,6 +130,12 @@ struct iopf_group {
struct work_struct work;
struct device *dev;
struct iommu_domain *domain;
+
+ /*
+ * Used by iopf handlers, like iommufd, to hook the iopf group
+ * on its own lists.
+ */
+ struct list_head node;
};

/**
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 1bd412cff2d6..0dbaa2dc5b22 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -230,6 +230,15 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,

int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);

+struct hw_pgtable_fault {
+ struct iommufd_ctx *ictx;
+ struct iommufd_hw_pagetable *hwpt;
+ /* Protect below iopf lists. */
+ struct mutex mutex;
+ struct list_head deliver;
+ struct list_head response;
+};
+
/*
* A HW pagetable is called an iommu_domain inside the kernel. This user object
* allows directly creating and inspecting the domains. Domains that have kernel
@@ -239,6 +248,7 @@ int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
struct iommufd_hw_pagetable {
struct iommufd_object obj;
struct iommu_domain *domain;
+ struct hw_pgtable_fault *fault;

void (*abort)(struct iommufd_object *obj);
void (*destroy)(struct iommufd_object *obj);
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 645ab5d290fe..0a8e03d5e7c5 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -456,6 +456,16 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
if (rc)
goto err_unlock;

+ if (hwpt->fault) {
+ void *curr;
+
+ curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
+ if (IS_ERR(curr)) {
+ rc = PTR_ERR(curr);
+ goto err_unresv;
+ }
+ }
+
/*
* Only attach to the group once for the first device that is in the
* group. All the other devices will follow this attachment. The user
@@ -466,17 +476,20 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
if (list_empty(&idev->igroup->device_list)) {
rc = iommufd_group_setup_msi(idev->igroup, hwpt);
if (rc)
- goto err_unresv;
+ goto err_unset;

rc = iommu_attach_group(hwpt->domain, idev->igroup->group);
if (rc)
- goto err_unresv;
+ goto err_unset;
idev->igroup->hwpt = hwpt;
}
refcount_inc(&hwpt->obj.users);
list_add_tail(&idev->group_item, &idev->igroup->device_list);
mutex_unlock(&idev->igroup->lock);
return 0;
+err_unset:
+ if (hwpt->fault)
+ iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
err_unresv:
iommufd_device_remove_rr(idev, hwpt);
err_unlock:
@@ -484,6 +497,30 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
return rc;
}

+/*
+ * Discard all pending page faults. Called when a hw pagetable is detached
+ * from a device. The iommu core guarantees that all page faults have been
+ * responded, hence there's no need to respond it again.
+ */
+static void iommufd_hw_pagetable_discard_iopf(struct iommufd_hw_pagetable *hwpt)
+{
+ struct iopf_group *group, *next;
+
+ if (!hwpt->fault)
+ return;
+
+ mutex_lock(&hwpt->fault->mutex);
+ list_for_each_entry_safe(group, next, &hwpt->fault->deliver, node) {
+ list_del(&group->node);
+ iopf_free_group(group);
+ }
+ list_for_each_entry_safe(group, next, &hwpt->fault->response, node) {
+ list_del(&group->node);
+ iopf_free_group(group);
+ }
+ mutex_unlock(&hwpt->fault->mutex);
+}
+
struct iommufd_hw_pagetable *
iommufd_hw_pagetable_detach(struct iommufd_device *idev)
{
@@ -491,6 +528,8 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)

mutex_lock(&idev->igroup->lock);
list_del(&idev->group_item);
+ if (hwpt->fault)
+ iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
if (list_empty(&idev->igroup->device_list)) {
iommu_detach_group(hwpt->domain, idev->igroup->group);
idev->igroup->hwpt = NULL;
@@ -498,6 +537,8 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
iommufd_device_remove_rr(idev, hwpt);
mutex_unlock(&idev->igroup->lock);

+ iommufd_hw_pagetable_discard_iopf(hwpt);
+
/* Caller must destroy hwpt */
return hwpt;
}
@@ -563,9 +604,24 @@ iommufd_device_do_replace(struct iommufd_device *idev,
if (rc)
goto err_unresv;

+ if (old_hwpt->fault) {
+ iommufd_hw_pagetable_discard_iopf(old_hwpt);
+ iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
+ }
+
+ if (hwpt->fault) {
+ void *curr;
+
+ curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
+ if (IS_ERR(curr)) {
+ rc = PTR_ERR(curr);
+ goto err_unresv;
+ }
+ }
+
rc = iommu_group_replace_domain(igroup->group, hwpt->domain);
if (rc)
- goto err_unresv;
+ goto err_unset;

if (iommufd_hw_pagetable_compare_ioas(old_hwpt, hwpt)) {
list_for_each_entry(cur, &igroup->device_list, group_item)
@@ -583,8 +639,15 @@ iommufd_device_do_replace(struct iommufd_device *idev,
&old_hwpt->obj.users));
mutex_unlock(&idev->igroup->lock);

+ iommufd_hw_pagetable_discard_iopf(old_hwpt);
+
/* Caller must destroy old_hwpt */
return old_hwpt;
+err_unset:
+ if (hwpt->fault)
+ iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
+ if (old_hwpt->fault)
+ iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
err_unresv:
if (iommufd_hw_pagetable_compare_ioas(old_hwpt, hwpt)) {
list_for_each_entry(cur, &igroup->device_list, group_item)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 72c46de1396b..9f94c824cf86 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -38,9 +38,38 @@ static void iommufd_kernel_managed_hwpt_destroy(struct iommufd_object *obj)
refcount_dec(&hwpt->ioas->obj.users);
}

+static struct hw_pgtable_fault *hw_pagetable_fault_alloc(void)
+{
+ struct hw_pgtable_fault *fault;
+
+ fault = kzalloc(sizeof(*fault), GFP_KERNEL);
+ if (!fault)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&fault->deliver);
+ INIT_LIST_HEAD(&fault->response);
+ mutex_init(&fault->mutex);
+
+ return fault;
+}
+
+static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
+{
+ WARN_ON(!list_empty(&fault->deliver));
+ WARN_ON(!list_empty(&fault->response));
+
+ kfree(fault);
+}
+
void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
{
- container_of(obj, struct iommufd_hw_pagetable, obj)->destroy(obj);
+ struct iommufd_hw_pagetable *hwpt =
+ container_of(obj, struct iommufd_hw_pagetable, obj);
+
+ if (hwpt->fault)
+ hw_pagetable_fault_free(hwpt->fault);
+
+ hwpt->destroy(obj);
}

static void iommufd_user_managed_hwpt_abort(struct iommufd_object *obj)
@@ -289,6 +318,17 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx,
return ERR_PTR(rc);
}

+static int iommufd_hw_pagetable_iopf_handler(struct iopf_group *group)
+{
+ struct iommufd_hw_pagetable *hwpt = group->domain->fault_data;
+
+ mutex_lock(&hwpt->fault->mutex);
+ list_add_tail(&group->node, &hwpt->fault->deliver);
+ mutex_unlock(&hwpt->fault->mutex);
+
+ return 0;
+}
+
int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
{
struct iommufd_hw_pagetable *(*alloc_fn)(
@@ -364,6 +404,20 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
goto out_unlock;
}

+ if (cmd->flags & IOMMU_HWPT_ALLOC_IOPF_CAPABLE) {
+ hwpt->fault = hw_pagetable_fault_alloc();
+ if (IS_ERR(hwpt->fault)) {
+ rc = PTR_ERR(hwpt->fault);
+ hwpt->fault = NULL;
+ goto out_hwpt;
+ }
+
+ hwpt->fault->ictx = ucmd->ictx;
+ hwpt->fault->hwpt = hwpt;
+ hwpt->domain->iopf_handler = iommufd_hw_pagetable_iopf_handler;
+ hwpt->domain->fault_data = hwpt;
+ }
+
cmd->out_hwpt_id = hwpt->obj.id;
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
if (rc)
--
2.34.1

2023-10-26 02:54:39

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

Add the file interface that provides a simple and efficient way for
userspace to handle page faults. The file interface allows userspace
to read fault messages sequentially, and to respond to the handling
result by writing to the same file.

Userspace applications are recommended to use io_uring to speed up read
and write efficiency.

With this done, allow userspace application to allocate a hw page table
with IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag set.

Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/iommufd/iommufd_private.h | 2 +
drivers/iommu/iommufd/hw_pagetable.c | 204 +++++++++++++++++++++++-
2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 0dbaa2dc5b22..ff063bc48150 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -237,6 +237,8 @@ struct hw_pgtable_fault {
struct mutex mutex;
struct list_head deliver;
struct list_head response;
+ struct file *fault_file;
+ int fault_fd;
};

/*
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 9f94c824cf86..f0aac1bb2d2d 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -3,6 +3,8 @@
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
*/
#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
#include <uapi/linux/iommufd.h>

#include "../iommu-priv.h"
@@ -38,9 +40,198 @@ static void iommufd_kernel_managed_hwpt_destroy(struct iommufd_object *obj)
refcount_dec(&hwpt->ioas->obj.users);
}

+static int iommufd_compose_fault_message(struct iommu_fault *fault,
+ struct iommu_hwpt_pgfault *hwpt_fault,
+ struct device *dev)
+{
+ struct iommufd_device *idev = iopf_pasid_cookie_get(dev, IOMMU_NO_PASID);
+
+ if (!idev)
+ return -ENODEV;
+
+ if (IS_ERR(idev))
+ return PTR_ERR(idev);
+
+ hwpt_fault->size = sizeof(*hwpt_fault);
+ hwpt_fault->flags = fault->prm.flags;
+ hwpt_fault->dev_id = idev->obj.id;
+ hwpt_fault->pasid = fault->prm.pasid;
+ hwpt_fault->grpid = fault->prm.grpid;
+ hwpt_fault->perm = fault->prm.perm;
+ hwpt_fault->addr = fault->prm.addr;
+ hwpt_fault->private_data[0] = fault->prm.private_data[0];
+ hwpt_fault->private_data[1] = fault->prm.private_data[1];
+
+ return 0;
+}
+
+static ssize_t hwpt_fault_fops_read(struct file *filep, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
+ struct hw_pgtable_fault *fault = filep->private_data;
+ struct iommu_hwpt_pgfault data;
+ struct iopf_group *group;
+ struct iopf_fault *iopf;
+ size_t done = 0;
+ int rc;
+
+ if (*ppos || count % fault_size)
+ return -ESPIPE;
+
+ mutex_lock(&fault->mutex);
+ while (!list_empty(&fault->deliver) && count > done) {
+ group = list_first_entry(&fault->deliver,
+ struct iopf_group, node);
+
+ if (list_count_nodes(&group->faults) * fault_size > count - done)
+ break;
+
+ list_for_each_entry(iopf, &group->faults, list) {
+ rc = iommufd_compose_fault_message(&iopf->fault,
+ &data, group->dev);
+ if (rc)
+ goto err_unlock;
+ rc = copy_to_user(buf + done, &data, fault_size);
+ if (rc)
+ goto err_unlock;
+ done += fault_size;
+ }
+
+ list_move_tail(&group->node, &fault->response);
+ }
+ mutex_unlock(&fault->mutex);
+
+ return done;
+err_unlock:
+ mutex_unlock(&fault->mutex);
+ return rc;
+}
+
+static ssize_t hwpt_fault_fops_write(struct file *filep,
+ const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ size_t response_size = sizeof(struct iommu_hwpt_page_response);
+ struct hw_pgtable_fault *fault = filep->private_data;
+ struct iommu_hwpt_page_response response;
+ struct iommufd_hw_pagetable *hwpt;
+ struct iopf_group *iter, *group;
+ struct iommufd_device *idev;
+ size_t done = 0;
+ int rc = 0;
+
+ if (*ppos || count % response_size)
+ return -ESPIPE;
+
+ mutex_lock(&fault->mutex);
+ while (!list_empty(&fault->response) && count > done) {
+ rc = copy_from_user(&response, buf + done, response_size);
+ if (rc)
+ break;
+
+ /* Get the device that this response targets at. */
+ idev = container_of(iommufd_get_object(fault->ictx,
+ response.dev_id,
+ IOMMUFD_OBJ_DEVICE),
+ struct iommufd_device, obj);
+ if (IS_ERR(idev)) {
+ rc = PTR_ERR(idev);
+ break;
+ }
+
+ /*
+ * Get the hw page table that this response was generated for.
+ * It must match the one stored in the fault data.
+ */
+ hwpt = container_of(iommufd_get_object(fault->ictx,
+ response.hwpt_id,
+ IOMMUFD_OBJ_HW_PAGETABLE),
+ struct iommufd_hw_pagetable, obj);
+ if (IS_ERR(hwpt)) {
+ iommufd_put_object(&idev->obj);
+ rc = PTR_ERR(hwpt);
+ break;
+ }
+
+ if (hwpt != fault->hwpt) {
+ rc = -EINVAL;
+ goto put_obj;
+ }
+
+ group = NULL;
+ list_for_each_entry(iter, &fault->response, node) {
+ if (response.grpid != iter->last_fault.fault.prm.grpid)
+ continue;
+
+ if (idev->dev != iter->dev)
+ continue;
+
+ if ((iter->last_fault.fault.prm.flags &
+ IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
+ response.pasid != iter->last_fault.fault.prm.pasid)
+ continue;
+
+ group = iter;
+ break;
+ }
+
+ if (!group) {
+ rc = -ENODEV;
+ goto put_obj;
+ }
+
+ rc = iopf_group_response(group, response.code);
+ if (rc)
+ goto put_obj;
+
+ list_del(&group->node);
+ iopf_free_group(group);
+ done += response_size;
+put_obj:
+ iommufd_put_object(&hwpt->obj);
+ iommufd_put_object(&idev->obj);
+ if (rc)
+ break;
+ }
+ mutex_unlock(&fault->mutex);
+
+ return (rc < 0) ? rc : done;
+}
+
+static const struct file_operations hwpt_fault_fops = {
+ .owner = THIS_MODULE,
+ .read = hwpt_fault_fops_read,
+ .write = hwpt_fault_fops_write,
+};
+
+static int hw_pagetable_get_fault_fd(struct hw_pgtable_fault *fault)
+{
+ struct file *filep;
+ int fdno;
+
+ fdno = get_unused_fd_flags(O_CLOEXEC);
+ if (fdno < 0)
+ return fdno;
+
+ filep = anon_inode_getfile("[iommufd-pgfault]", &hwpt_fault_fops,
+ fault, O_RDWR);
+ if (IS_ERR(filep)) {
+ put_unused_fd(fdno);
+ return PTR_ERR(filep);
+ }
+
+ fd_install(fdno, filep);
+ fault->fault_file = filep;
+ fault->fault_fd = fdno;
+
+ return 0;
+}
+
static struct hw_pgtable_fault *hw_pagetable_fault_alloc(void)
{
struct hw_pgtable_fault *fault;
+ int rc;

fault = kzalloc(sizeof(*fault), GFP_KERNEL);
if (!fault)
@@ -50,6 +241,12 @@ static struct hw_pgtable_fault *hw_pagetable_fault_alloc(void)
INIT_LIST_HEAD(&fault->response);
mutex_init(&fault->mutex);

+ rc = hw_pagetable_get_fault_fd(fault);
+ if (rc) {
+ kfree(fault);
+ return ERR_PTR(rc);
+ }
+
return fault;
}

@@ -58,6 +255,8 @@ static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
WARN_ON(!list_empty(&fault->deliver));
WARN_ON(!list_empty(&fault->response));

+ fput(fault->fault_file);
+ put_unused_fd(fault->fault_fd);
kfree(fault);
}

@@ -347,7 +546,9 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
struct mutex *mutex;
int rc;

- if (cmd->flags & ~IOMMU_HWPT_ALLOC_NEST_PARENT || cmd->__reserved)
+ if ((cmd->flags & ~(IOMMU_HWPT_ALLOC_NEST_PARENT |
+ IOMMU_HWPT_ALLOC_IOPF_CAPABLE)) ||
+ cmd->__reserved)
return -EOPNOTSUPP;
if (!cmd->data_len && cmd->hwpt_type != IOMMU_HWPT_TYPE_DEFAULT)
return -EINVAL;
@@ -416,6 +617,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
hwpt->fault->hwpt = hwpt;
hwpt->domain->iopf_handler = iommufd_hw_pagetable_iopf_handler;
hwpt->domain->fault_data = hwpt;
+ cmd->out_fault_fd = hwpt->fault->fault_fd;
}

cmd->out_hwpt_id = hwpt->obj.id;
--
2.34.1

2023-10-26 02:54:48

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v2 5/6] iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support

Extend the selftest mock device to support generating and responding to
an IOPF. Also add an ioctl interface to userspace applications to trigger
the IOPF on the mock device. This would allow userspace applications to
test the IOMMUFD's handling of IOPFs without having to rely on any real
hardware.

Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/iommufd/iommufd_test.h | 8 ++++
drivers/iommu/iommufd/selftest.c | 56 ++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 65a363f5e81e..98951a2af4bd 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -21,6 +21,7 @@ enum {
IOMMU_TEST_OP_ACCESS_REPLACE_IOAS,
IOMMU_TEST_OP_MD_CHECK_IOTLB,
IOMMU_TEST_OP_DEV_CHECK_DATA,
+ IOMMU_TEST_OP_TRIGGER_IOPF,
};

enum {
@@ -109,6 +110,13 @@ struct iommu_test_cmd {
struct {
__u32 val;
} check_dev_data;
+ struct {
+ __u32 dev_id;
+ __u32 pasid;
+ __u32 grpid;
+ __u32 perm;
+ __u64 addr;
+ } trigger_iopf;
};
__u32 last;
};
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 117776d236dc..b2d2edc3d2d2 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -401,11 +401,21 @@ static void mock_domain_set_plaform_dma_ops(struct device *dev)
*/
}

+static struct iopf_queue *mock_iommu_iopf_queue;
+
static struct iommu_device mock_iommu_device = {
};

static struct iommu_device *mock_probe_device(struct device *dev)
{
+ int rc;
+
+ if (mock_iommu_iopf_queue) {
+ rc = iopf_queue_add_device(mock_iommu_iopf_queue, dev);
+ if (rc)
+ return ERR_PTR(-ENODEV);
+ }
+
return &mock_iommu_device;
}

@@ -431,6 +441,12 @@ static void mock_domain_unset_dev_user_data(struct device *dev)
mdev->dev_data = 0;
}

+static int mock_domain_page_response(struct device *dev, struct iopf_fault *evt,
+ struct iommu_page_response *msg)
+{
+ return 0;
+}
+
static const struct iommu_ops mock_ops = {
.owner = THIS_MODULE,
.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
@@ -443,6 +459,7 @@ static const struct iommu_ops mock_ops = {
.probe_device = mock_probe_device,
.set_dev_user_data = mock_domain_set_dev_user_data,
.unset_dev_user_data = mock_domain_unset_dev_user_data,
+ .page_response = mock_domain_page_response,
.default_domain_ops =
&(struct iommu_domain_ops){
.free = mock_domain_free,
@@ -542,6 +559,9 @@ static void mock_dev_release(struct device *dev)
{
struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);

+ if (mock_iommu_iopf_queue)
+ iopf_queue_remove_device(mock_iommu_iopf_queue, dev);
+
atomic_dec(&mock_dev_num);
kfree(mdev);
}
@@ -1200,6 +1220,32 @@ static_assert((unsigned int)MOCK_ACCESS_RW_WRITE == IOMMUFD_ACCESS_RW_WRITE);
static_assert((unsigned int)MOCK_ACCESS_RW_SLOW_PATH ==
__IOMMUFD_ACCESS_RW_SLOW_PATH);

+static int iommufd_test_trigger_iopf(struct iommufd_ucmd *ucmd,
+ struct iommu_test_cmd *cmd)
+{
+ struct iopf_fault event = { };
+ struct iommufd_device *idev;
+ int rc;
+
+ idev = iommufd_get_device(ucmd, cmd->trigger_iopf.dev_id);
+ if (IS_ERR(idev))
+ return PTR_ERR(idev);
+
+ event.fault.prm.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
+ if (cmd->trigger_iopf.pasid != IOMMU_NO_PASID)
+ event.fault.prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
+ event.fault.type = IOMMU_FAULT_PAGE_REQ;
+ event.fault.prm.addr = cmd->trigger_iopf.addr;
+ event.fault.prm.pasid = cmd->trigger_iopf.pasid;
+ event.fault.prm.grpid = cmd->trigger_iopf.grpid;
+ event.fault.prm.perm = cmd->trigger_iopf.perm;
+
+ rc = iommu_report_device_fault(idev->dev, &event);
+ iommufd_put_object(&idev->obj);
+
+ return rc;
+}
+
void iommufd_selftest_destroy(struct iommufd_object *obj)
{
struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
@@ -1271,6 +1317,8 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
return -EINVAL;
iommufd_test_memory_limit = cmd->memory_limit.limit;
return 0;
+ case IOMMU_TEST_OP_TRIGGER_IOPF:
+ return iommufd_test_trigger_iopf(ucmd, cmd);
default:
return -EOPNOTSUPP;
}
@@ -1312,6 +1360,9 @@ int __init iommufd_test_init(void)
&iommufd_mock_bus_type.nb);
if (rc)
goto err_sysfs;
+
+ mock_iommu_iopf_queue = iopf_queue_alloc("mock-iopfq");
+
return 0;

err_sysfs:
@@ -1327,6 +1378,11 @@ int __init iommufd_test_init(void)

void iommufd_test_exit(void)
{
+ if (mock_iommu_iopf_queue) {
+ iopf_queue_free(mock_iommu_iopf_queue);
+ mock_iommu_iopf_queue = NULL;
+ }
+
iommu_device_sysfs_remove(&mock_iommu_device);
iommu_device_unregister_bus(&mock_iommu_device,
&iommufd_mock_bus_type.bus,
--
2.34.1

2023-10-26 02:54:56

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v2 6/6] iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF

Extend the selftest tool to add coverage of testing IOPF handling. This
would include the following tests:

- Allocating and destroying an IOPF-capable HWPT.
- Attaching/detaching/replacing an IOPF-capable HWPT on a device.
- Triggering an IOPF on the mock device.
- Retrieving and responding to the IOPF through the IOPF FD

Signed-off-by: Lu Baolu <[email protected]>
---
tools/testing/selftests/iommu/iommufd_utils.h | 66 +++++++++++++++++--
tools/testing/selftests/iommu/iommufd.c | 24 +++++--
.../selftests/iommu/iommufd_fail_nth.c | 2 +-
3 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index b75f168fca46..df22c02af997 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -103,8 +103,8 @@ static int _test_cmd_mock_domain_replace(int fd, __u32 stdev_id, __u32 pt_id,
pt_id, NULL))

static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
- __u32 flags, __u32 *hwpt_id, __u32 hwpt_type,
- void *data, size_t data_len)
+ __u32 flags, __u32 *hwpt_id, __u32 *fault_fd,
+ __u32 hwpt_type, void *data, size_t data_len)
{
struct iommu_hwpt_alloc cmd = {
.size = sizeof(cmd),
@@ -122,28 +122,39 @@ static int _test_cmd_hwpt_alloc(int fd, __u32 device_id, __u32 pt_id,
return ret;
if (hwpt_id)
*hwpt_id = cmd.out_hwpt_id;
+ if (fault_fd)
+ *fault_fd = cmd.out_fault_fd;
+
return 0;
}

#define test_cmd_hwpt_alloc(device_id, pt_id, flags, hwpt_id) \
ASSERT_EQ(0, _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, flags, \
- hwpt_id, IOMMU_HWPT_TYPE_DEFAULT, \
+ hwpt_id, NULL, \
+ IOMMU_HWPT_TYPE_DEFAULT, \
NULL, 0))
#define test_err_hwpt_alloc(_errno, device_id, pt_id, flags, hwpt_id) \
EXPECT_ERRNO(_errno, _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, \
- flags, hwpt_id, \
+ flags, hwpt_id, NULL, \
IOMMU_HWPT_TYPE_DEFAULT, \
NULL, 0))

#define test_cmd_hwpt_alloc_nested(device_id, pt_id, flags, hwpt_id, \
hwpt_type, data, data_len) \
ASSERT_EQ(0, _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, flags, \
- hwpt_id, hwpt_type, data, data_len))
+ hwpt_id, NULL, hwpt_type, data, \
+ data_len))
#define test_err_hwpt_alloc_nested(_errno, device_id, pt_id, flags, hwpt_id, \
hwpt_type, data, data_len) \
EXPECT_ERRNO(_errno, \
_test_cmd_hwpt_alloc(self->fd, device_id, pt_id, flags, \
- hwpt_id, hwpt_type, data, data_len))
+ hwpt_id, NULL, hwpt_type, data, \
+ data_len))
+#define test_cmd_hwpt_alloc_nested_iopf(device_id, pt_id, flags, hwpt_id, \
+ fault_fd, hwpt_type, data, data_len) \
+ ASSERT_EQ(0, _test_cmd_hwpt_alloc(self->fd, device_id, pt_id, flags, \
+ hwpt_id, fault_fd, hwpt_type, data, \
+ data_len))

#define test_cmd_hwpt_check_iotlb(hwpt_id, iotlb_id, expected) \
({ \
@@ -551,3 +562,46 @@ static int _test_cmd_unset_dev_data(int fd, __u32 device_id)
#define test_err_unset_dev_data(_errno, device_id) \
EXPECT_ERRNO(_errno, \
_test_cmd_unset_dev_data(self->fd, device_id))
+
+static int _test_cmd_trigger_iopf(int fd, __u32 device_id, __u32 fault_fd, __u32 hwpt_id)
+{
+ struct iommu_test_cmd trigger_iopf_cmd = {
+ .size = sizeof(trigger_iopf_cmd),
+ .op = IOMMU_TEST_OP_TRIGGER_IOPF,
+ .trigger_iopf = {
+ .dev_id = device_id,
+ .pasid = 0x1,
+ .grpid = 0x2,
+ .perm = IOMMU_PGFAULT_PERM_READ | IOMMU_PGFAULT_PERM_WRITE,
+ .addr = 0xdeadbeaf,
+ },
+ };
+ struct iommu_hwpt_page_response response = {
+ .size = sizeof(struct iommu_hwpt_page_response),
+ .hwpt_id = hwpt_id,
+ .dev_id = device_id,
+ .pasid = 0x1,
+ .grpid = 0x2,
+ .code = 0,
+ };
+ struct iommu_hwpt_pgfault fault = {};
+ ssize_t bytes;
+ int ret;
+
+ ret = ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_TRIGGER_IOPF), &trigger_iopf_cmd);
+ if (ret)
+ return ret;
+
+ bytes = read(fault_fd, &fault, sizeof(fault));
+ if (bytes < 0)
+ return bytes;
+
+ bytes = write(fault_fd, &response, sizeof(response));
+ if (bytes < 0)
+ return bytes;
+
+ return 0;
+}
+
+#define test_cmd_trigger_iopf(device_id, fault_fd, hwpt_id) \
+ ASSERT_EQ(0, _test_cmd_trigger_iopf(self->fd, device_id, fault_fd, hwpt_id))
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 7cf06a4635d8..b30b82a72785 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -275,11 +275,12 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
};
struct iommu_hwpt_invalidate_selftest inv_reqs[2] = {0};
- uint32_t nested_hwpt_id[2] = {};
+ uint32_t nested_hwpt_id[3] = {};
uint32_t num_inv, driver_error;
uint32_t parent_hwpt_id = 0;
uint32_t parent_hwpt_id_not_work = 0;
uint32_t test_hwpt_id = 0;
+ uint32_t fault_fd;

if (self->device_id) {
/* Negative tests */
@@ -323,7 +324,7 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
IOMMU_HWPT_TYPE_SELFTEST,
&data, sizeof(data));

- /* Allocate two nested hwpts sharing one common parent hwpt */
+ /* Allocate nested hwpts sharing one common parent hwpt */
test_cmd_hwpt_alloc_nested(self->device_id, parent_hwpt_id,
0, &nested_hwpt_id[0],
IOMMU_HWPT_TYPE_SELFTEST,
@@ -332,6 +333,11 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
0, &nested_hwpt_id[1],
IOMMU_HWPT_TYPE_SELFTEST,
&data, sizeof(data));
+ test_cmd_hwpt_alloc_nested_iopf(self->device_id, parent_hwpt_id,
+ IOMMU_HWPT_ALLOC_IOPF_CAPABLE,
+ &nested_hwpt_id[2], &fault_fd,
+ IOMMU_HWPT_TYPE_SELFTEST,
+ &data, sizeof(data));
test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[0],
IOMMU_TEST_IOTLB_DEFAULT);
test_cmd_hwpt_check_iotlb_all(nested_hwpt_id[1],
@@ -418,10 +424,20 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
_test_ioctl_destroy(self->fd, nested_hwpt_id[1]));
test_ioctl_destroy(nested_hwpt_id[0]);

- /* Detach from nested_hwpt_id[1] and destroy it */
- test_cmd_mock_domain_replace(self->stdev_id, parent_hwpt_id);
+ /* Switch from nested_hwpt_id[1] to nested_hwpt_id[2] */
+ test_cmd_mock_domain_replace(self->stdev_id,
+ nested_hwpt_id[2]);
+ EXPECT_ERRNO(EBUSY,
+ _test_ioctl_destroy(self->fd, nested_hwpt_id[2]));
test_ioctl_destroy(nested_hwpt_id[1]);

+ /* Trigger an IOPF on the device */
+ test_cmd_trigger_iopf(self->device_id, fault_fd, nested_hwpt_id[2]);
+
+ /* Detach from nested_hwpt_id[2] and destroy it */
+ test_cmd_mock_domain_replace(self->stdev_id, parent_hwpt_id);
+ test_ioctl_destroy(nested_hwpt_id[2]);
+
/* Detach from the parent hw_pagetable and destroy it */
test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id);
test_ioctl_destroy(parent_hwpt_id);
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index d3f47f262c04..2b7b582c17c4 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -615,7 +615,7 @@ TEST_FAIL_NTH(basic_fail_nth, device)
if (_test_cmd_get_hw_info(self->fd, idev_id, &info, sizeof(info)))
return -1;

- if (_test_cmd_hwpt_alloc(self->fd, idev_id, ioas_id, 0, &hwpt_id,
+ if (_test_cmd_hwpt_alloc(self->fd, idev_id, ioas_id, 0, &hwpt_id, NULL,
IOMMU_HWPT_TYPE_DEFAULT, 0, 0))
return -1;

--
2.34.1

2023-11-02 12:47:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
>
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.

Having now looked more closely at the ARM requirements it seems we
will need generic events, not just page fault events to have a
complete emulation.

So I'd like to see this generalized into a channel to carry any
events..

> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.

This is the right way to approach it, and more broadly this shouldn't
be an iommufd specific thing. Kernel drivers will also need to create
fault capable PAGING iommu domains.

Jason

2023-11-07 08:36:08

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, November 2, 2023 8:48 PM
>
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > Hi folks,
> >
> > This series implements the functionality of delivering IO page faults to
> > user space through the IOMMUFD framework for nested translation.
> Nested
> > translation is a hardware feature that supports two-stage translation
> > tables for IOMMU. The second-stage translation table is managed by the
> > host VMM, while the first-stage translation table is owned by user
> > space. This allows user space to control the IOMMU mappings for its
> > devices.
>
> Having now looked more closely at the ARM requirements it seems we
> will need generic events, not just page fault events to have a
> complete emulation.

Can you elaborate?

>
> So I'd like to see this generalized into a channel to carry any
> events..
>
> > User space indicates its capability of handling IO page faults by
> > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > for page fault delivery. On a successful return of HWPT allocation, the
> > user can retrieve and respond to page faults by reading and writing to
> > the file descriptor (FD) returned in out_fault_fd.
>
> This is the right way to approach it, and more broadly this shouldn't
> be an iommufd specific thing. Kernel drivers will also need to create
> fault capable PAGING iommu domains.
>

Are you suggesting a common interface used by both iommufd and
kernel drivers?

but I didn't get the last piece. If those domains are created by kernel
drivers why would they require a uAPI for userspace to specify fault
capable?

2023-11-15 13:58:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Wed, Nov 15, 2023 at 01:17:06PM +0800, Liu, Jing2 wrote:

> This is the right way to approach it,
>
> I learned that there was discussion about using io_uring to get the
> page fault without
>
> eventfd notification in [1], and I am new at io_uring and studying the
> man page of
>
> liburing, but there're questions in my mind on how can QEMU get the
> coming page fault
>
> with a good performance.
>
> Since both QEMU and Kernel don't know when comes faults, after QEMU
> submits one
>
> read task to io_uring, we want kernel pending until fault comes. While
> based on
>
> hwpt_fault_fops_read() in [patch v2 4/6], it just returns 0 since
> there's now no fault,
>
> thus this round of read completes to CQ but it's not what we want. So
> I'm wondering
>
> how kernel pending on the read until fault comes. Does fops callback
> need special work to

Implement a fops with poll support that triggers when a new event is
pushed and everything will be fine. There are many examples in the
kernel. The ones in the mlx5 vfio driver spring to mind as a scheme I
recently looked at.

Jason

2023-11-16 01:44:17

by Liu, Jing2

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

Hi Jason,

On 11/15/2023 9:58 PM, Jason Gunthorpe wrote:
> On Wed, Nov 15, 2023 at 01:17:06PM +0800, Liu, Jing2 wrote:
>
>> This is the right way to approach it,
>>
>> I learned that there was discussion about using io_uring to get the
>> page fault without
>>
>> eventfd notification in [1], and I am new at io_uring and studying the
>> man page of
>>
>> liburing, but there're questions in my mind on how can QEMU get the
>> coming page fault
>>
>> with a good performance.
>>
>> Since both QEMU and Kernel don't know when comes faults, after QEMU
>> submits one
>>
>> read task to io_uring, we want kernel pending until fault comes. While
>> based on
>>
>> hwpt_fault_fops_read() in [patch v2 4/6], it just returns 0 since
>> there's now no fault,
>>
>> thus this round of read completes to CQ but it's not what we want. So
>> I'm wondering
>>
>> how kernel pending on the read until fault comes. Does fops callback
>> need special work to
> Implement a fops with poll support that triggers when a new event is
> pushed and everything will be fine.

Does userspace need also setup a POLL flag to let io_uring go into poll,
or io_uring

always try to poll?

> There are many examples in the
> kernel. The ones in the mlx5 vfio driver spring to mind as a scheme I
> recently looked at.

Thank you very much for guiding the way. We will study the example to
understand

more.

BRs,

Jing

> Jason

2023-11-21 00:16:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Thu, Nov 16, 2023 at 09:42:23AM +0800, Liu, Jing2 wrote:
> Hi Jason,
>
> On 11/15/2023 9:58 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 15, 2023 at 01:17:06PM +0800, Liu, Jing2 wrote:
> >
> > > This is the right way to approach it,
> > >
> > > I learned that there was discussion about using io_uring to get the
> > > page fault without
> > >
> > > eventfd notification in [1], and I am new at io_uring and studying the
> > > man page of
> > >
> > > liburing, but there're questions in my mind on how can QEMU get the
> > > coming page fault
> > >
> > > with a good performance.
> > >
> > > Since both QEMU and Kernel don't know when comes faults, after QEMU
> > > submits one
> > >
> > > read task to io_uring, we want kernel pending until fault comes. While
> > > based on
> > >
> > > hwpt_fault_fops_read() in [patch v2 4/6], it just returns 0 since
> > > there's now no fault,
> > >
> > > thus this round of read completes to CQ but it's not what we want. So
> > > I'm wondering
> > >
> > > how kernel pending on the read until fault comes. Does fops callback
> > > need special work to
> > Implement a fops with poll support that triggers when a new event is
> > pushed and everything will be fine.
>
> Does userspace need also setup a POLL flag to let io_uring go into poll, or
> io_uring always try to poll?

io_uring can trigger poll and use other approaches, it is flexible the
driver can scale this in different ways.

Jason

Subject: RE: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space



> -----Original Message-----
> From: Lu Baolu [mailto:[email protected]]
> Sent: 26 October 2023 03:49
> To: Jason Gunthorpe <[email protected]>; Kevin Tian <[email protected]>;
> Joerg Roedel <[email protected]>; Will Deacon <[email protected]>; Robin
> Murphy <[email protected]>; Jean-Philippe Brucker
> <[email protected]>; Nicolin Chen <[email protected]>; Yi Liu
> <[email protected]>; Jacob Pan <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Lu
> Baolu <[email protected]>
> Subject: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space
>
> Hi folks,
>
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
>
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and
> respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
>
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.
>
> The iommu selftest framework has been updated to test the IO page fault
> delivery and response functionality.
>
> This series is based on the latest implementation of nested translation
> under discussion [1] and the page fault handling framework refactoring in
> the IOMMU core [2].
>
> The series and related patches are available on GitHub: [3]
>
> [1]
> https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@int
> el.com/
> [2]
> https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@li
> nux.intel.com/
> [3]
> https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-deliv
> ery-v2

Hi Baolu,

Do you have a corresponding Qemu git to share? I could give it a spin on our ARM
platform. Please let me know.

Thanks,
Shameer

2023-11-30 03:44:33

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On 2023/11/29 17:08, Shameerali Kolothum Thodi wrote:
>
>
>> -----Original Message-----
>> From: Lu Baolu [mailto:[email protected]]
>> Sent: 26 October 2023 03:49
>> To: Jason Gunthorpe <[email protected]>; Kevin Tian <[email protected]>;
>> Joerg Roedel <[email protected]>; Will Deacon <[email protected]>; Robin
>> Murphy <[email protected]>; Jean-Philippe Brucker
>> <[email protected]>; Nicolin Chen <[email protected]>; Yi Liu
>> <[email protected]>; Jacob Pan <[email protected]>
>> Cc: [email protected]; [email protected];
>> [email protected]; [email protected]; Lu
>> Baolu <[email protected]>
>> Subject: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space
>>
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
>> translation is a hardware feature that supports two-stage translation
>> tables for IOMMU. The second-stage translation table is managed by the
>> host VMM, while the first-stage translation table is owned by user
>> space. This allows user space to control the IOMMU mappings for its
>> devices.
>>
>> When an IO page fault occurs on the first-stage translation table, the
>> IOMMU hardware can deliver the page fault to user space through the
>> IOMMUFD framework. User space can then handle the page fault and
>> respond
>> to the device top-down through the IOMMUFD. This allows user space to
>> implement its own IO page fault handling policies.
>>
>> User space indicates its capability of handling IO page faults by
>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>> for page fault delivery. On a successful return of HWPT allocation, the
>> user can retrieve and respond to page faults by reading and writing to
>> the file descriptor (FD) returned in out_fault_fd.
>>
>> The iommu selftest framework has been updated to test the IO page fault
>> delivery and response functionality.
>>
>> This series is based on the latest implementation of nested translation
>> under discussion [1] and the page fault handling framework refactoring in
>> the IOMMU core [2].
>>
>> The series and related patches are available on GitHub: [3]
>>
>> [1]
>> https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@int
>> el.com/
>> [2]
>> https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@li
>> nux.intel.com/
>> [3]
>> https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-deliv
>> ery-v2
>
> Hi Baolu,

Hi Shameer,

>
> Do you have a corresponding Qemu git to share? I could give it a spin on our ARM
> platform. Please let me know.

This version of the series is tested by the iommufd selftest. We are in
process of developing the QEMU code. I will provide the repo link after
we complete it.

Best regards,
baolu

2023-12-01 14:24:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
>
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
>
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
>
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.

This is probably backwards, userspace should allocate the FD with a
dedicated ioctl and provide it during domain allocation.

If the userspace wants a fd per domain then it should do that. If it
wants to share fds between domains that should work too.

Jason

2023-12-01 14:38:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] iommu: Add iommu page fault cookie helpers

On Thu, Oct 26, 2023 at 10:49:25AM +0800, Lu Baolu wrote:

> +void *iopf_pasid_cookie_get(struct device *dev, ioasid_t pasid)
> +{
> + struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
> + void *curr;
> +
> + if (!iopf_param)
> + return ERR_PTR(-ENODEV);
> +
> + xa_lock(&iopf_param->pasid_cookie);
> + curr = xa_load(&iopf_param->pasid_cookie, pasid);
> + xa_unlock(&iopf_param->pasid_cookie);

No need for this locking, the caller has to provide some kind of
locking to protect the returned pointer.

I'm not sure how this can work really..

What iommfd wants is to increment the device object refcount under
this xa_lock.

I'm not sure this is the right arrangement: Basically you want to
have a cookie per domain attachment for iopf domains that is forwarded
to the handler.

So maybe this entire thing is not quite right, instead of having a
generic iopf attached to the domain the iopf should be supplied at
domain attach time? Something like:

iommu_domain_attach_iopf(struct iommu_domain *, struct device *,
ioasid_t pasid, struct iopf *, void *cookie);

The per-attach cookie would be passed to the iopf function
automatically by the infrastructure.

Detach would have the necessary locking to ensure that no handler is
running across detach

Then the cookie is logically placed in the API and properly protected
with natural locking we already need.

Jason

2023-12-01 15:14:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] iommufd: Add iommu page fault uapi data

On Thu, Oct 26, 2023 at 10:49:26AM +0800, Lu Baolu wrote:

> + * @IOMMU_HWPT_ALLOC_IOPF_CAPABLE: User is capable of handling IO page faults.

This does not seem like the best name?

Probably like this given my remark in the cover letter:

--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -359,6 +359,7 @@ struct iommu_vfio_ioas {
enum iommufd_hwpt_alloc_flags {
IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
IOMMU_HWPT_ALLOC_DIRTY_TRACKING = 1 << 1,
+ IOMMU_HWPT_IOPFD_FD_VALID = 1 << 2,
};

/**
@@ -440,6 +441,7 @@ struct iommu_hwpt_alloc {
__u32 data_type;
__u32 data_len;
__aligned_u64 data_uptr;
+ __s32 iopf_fd;
};
#define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)



> @@ -679,6 +688,62 @@ struct iommu_dev_data_arm_smmuv3 {
> __u32 sid;
> };
>
> +/**
> + * struct iommu_hwpt_pgfault - iommu page fault data
> + * @size: sizeof(struct iommu_hwpt_pgfault)
> + * @flags: Combination of IOMMU_PGFAULT_FLAGS_ flags.
> + * - PASID_VALID: @pasid field is valid
> + * - LAST_PAGE: the last page fault in a group
> + * - PRIV_DATA: @private_data field is valid
> + * - RESP_NEEDS_PASID: the page response must have the same
> + * PASID value as the page request.
> + * @dev_id: id of the originated device
> + * @pasid: Process Address Space ID
> + * @grpid: Page Request Group Index
> + * @perm: requested page permissions (IOMMU_PGFAULT_PERM_* values)
> + * @addr: page address
> + * @private_data: device-specific private information
> + */
> +struct iommu_hwpt_pgfault {
> + __u32 size;
> + __u32 flags;
> +#define IOMMU_PGFAULT_FLAGS_PASID_VALID (1 << 0)
> +#define IOMMU_PGFAULT_FLAGS_LAST_PAGE (1 << 1)
> +#define IOMMU_PGFAULT_FLAGS_PRIV_DATA (1 << 2)
> +#define IOMMU_PGFAULT_FLAGS_RESP_NEEDS_PASID (1 << 3)
> + __u32 dev_id;
> + __u32 pasid;
> + __u32 grpid;
> + __u32 perm;
> +#define IOMMU_PGFAULT_PERM_READ (1 << 0)
> +#define IOMMU_PGFAULT_PERM_WRITE (1 << 1)
> +#define IOMMU_PGFAULT_PERM_EXEC (1 << 2)
> +#define IOMMU_PGFAULT_PERM_PRIV (1 << 3)
> + __u64 addr;
> + __u64 private_data[2];
> +};

This mixed #define is not the style, these should be in enums,
possibly with kdocs

Use __aligned_u64 also

> +
> +/**
> + * struct iommu_hwpt_response - IOMMU page fault response
> + * @size: sizeof(struct iommu_hwpt_response)
> + * @flags: Must be set to 0
> + * @hwpt_id: hwpt ID of target hardware page table for the response
> + * @dev_id: device ID of target device for the response
> + * @pasid: Process Address Space ID
> + * @grpid: Page Request Group Index
> + * @code: response code. The supported codes include:
> + * 0: Successful; 1: Response Failure; 2: Invalid Request.
> + */
> +struct iommu_hwpt_page_response {
> + __u32 size;
> + __u32 flags;
> + __u32 hwpt_id;
> + __u32 dev_id;
> + __u32 pasid;
> + __u32 grpid;
> + __u32 code;
> +};

Is it OK to have the user pass in all this detailed information? Is it
a security problem if the user lies? Ie shouldn't we only ack page
faults we actually have outstanding?

IOW should iommu_hwpt_pgfault just have a 'response_cookie' generated
by the kernel that should be placed here? The kernel would keep track
of all this internal stuff?

Jason

2023-12-01 15:26:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On Thu, Oct 26, 2023 at 10:49:28AM +0800, Lu Baolu wrote:

> +static ssize_t hwpt_fault_fops_write(struct file *filep,
> + const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + size_t response_size = sizeof(struct iommu_hwpt_page_response);
> + struct hw_pgtable_fault *fault = filep->private_data;
> + struct iommu_hwpt_page_response response;
> + struct iommufd_hw_pagetable *hwpt;
> + struct iopf_group *iter, *group;
> + struct iommufd_device *idev;
> + size_t done = 0;
> + int rc = 0;
> +
> + if (*ppos || count % response_size)
> + return -ESPIPE;
> +
> + mutex_lock(&fault->mutex);
> + while (!list_empty(&fault->response) && count > done) {
> + rc = copy_from_user(&response, buf + done, response_size);
> + if (rc)
> + break;
> +
> + /* Get the device that this response targets at. */
> + idev = container_of(iommufd_get_object(fault->ictx,
> + response.dev_id,
> + IOMMUFD_OBJ_DEVICE),
> + struct iommufd_device, obj);
> + if (IS_ERR(idev)) {
> + rc = PTR_ERR(idev);
> + break;
> + }

See here it might be better to have a per-fd list of outstanding
faults per-fd and then the cookie would just index that list, then you
get everything in one shot instead of having to do a xarray looking
and then a linear list search

> +static const struct file_operations hwpt_fault_fops = {
> + .owner = THIS_MODULE,
> + .read = hwpt_fault_fops_read,
> + .write = hwpt_fault_fops_write,
> +};

nonseekable_open() behavior should be integrated into this

> +static int hw_pagetable_get_fault_fd(struct hw_pgtable_fault *fault)
> +{
> + struct file *filep;
> + int fdno;
> +
> + fdno = get_unused_fd_flags(O_CLOEXEC);
> + if (fdno < 0)
> + return fdno;
> +
> + filep = anon_inode_getfile("[iommufd-pgfault]", &hwpt_fault_fops,
> + fault, O_RDWR);
> + if (IS_ERR(filep)) {
> + put_unused_fd(fdno);
> + return PTR_ERR(filep);
> + }
> +
> + fd_install(fdno, filep);
> + fault->fault_file = filep;
> + fault->fault_fd = fdno;

fd_install must be the very last thing before returning success from a
system call because we cannot undo it.

There are other failure paths before here and the final return

Jason

2023-12-04 15:08:09

by Joel Granados

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
>
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
Does this mean the IOPF_CAPABLE HWPT needs to be parented by a HWPT
created with IOMMU_HWPT_ALLOC_NEST_PARENT set?

> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
>
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
>
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.
>
> The iommu selftest framework has been updated to test the IO page fault
> delivery and response functionality.
>
> This series is based on the latest implementation of nested translation
> under discussion [1] and the page fault handling framework refactoring in
> the IOMMU core [2].
>
> The series and related patches are available on GitHub: [3]
>
> [1] https://lore.kernel.org/linux-iommu/[email protected]/
> [2] https://lore.kernel.org/linux-iommu/[email protected]/
> [3] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
>
> Best regards,
> baolu
>
> Change log:
> v2:
> - Move all iommu refactoring patches into a sparated series and discuss
> it in a different thread. The latest patch series [v6] is available at
> https://lore.kernel.org/linux-iommu/[email protected]/
> - We discussed the timeout of the pending page fault messages. We
> agreed that we shouldn't apply any timeout policy for the page fault
> handling in user space.
> https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/
> - Jason suggested that we adopt a simple file descriptor interface for
> reading and responding to I/O page requests, so that user space
> applications can improve performance using io_uring.
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> v1: https://lore.kernel.org/linux-iommu/[email protected]/
>
> Lu Baolu (6):
> iommu: Add iommu page fault cookie helpers
> iommufd: Add iommu page fault uapi data
> iommufd: Initializing and releasing IO page fault data
> iommufd: Deliver fault messages to user space
> iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support
> iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF
>
> include/linux/iommu.h | 9 +
> drivers/iommu/iommu-priv.h | 15 +
> drivers/iommu/iommufd/iommufd_private.h | 12 +
> drivers/iommu/iommufd/iommufd_test.h | 8 +
> include/uapi/linux/iommufd.h | 65 +++++
> tools/testing/selftests/iommu/iommufd_utils.h | 66 ++++-
> drivers/iommu/io-pgfault.c | 50 ++++
> drivers/iommu/iommufd/device.c | 69 ++++-
> drivers/iommu/iommufd/hw_pagetable.c | 260 +++++++++++++++++-
> drivers/iommu/iommufd/selftest.c | 56 ++++
> tools/testing/selftests/iommu/iommufd.c | 24 +-
> .../selftests/iommu/iommufd_fail_nth.c | 2 +-
> 12 files changed, 620 insertions(+), 16 deletions(-)
>
> --
> 2.34.1
>

--

Joel Granados


Attachments:
(No filename) (4.07 kB)
signature.asc (673.00 B)
Download all attachments

2023-12-04 15:32:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Mon, Dec 04, 2023 at 04:07:44PM +0100, Joel Granados wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > Hi folks,
> >
> > This series implements the functionality of delivering IO page faults to
> > user space through the IOMMUFD framework for nested translation. Nested
> Does this mean the IOPF_CAPABLE HWPT needs to be parented by a HWPT
> created with IOMMU_HWPT_ALLOC_NEST_PARENT set?

I would expect no. Both a nested and un-nested configuration should
work.

Jason

2023-12-07 16:34:32

by Joel Granados

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On Thu, Oct 26, 2023 at 10:49:28AM +0800, Lu Baolu wrote:
> Add the file interface that provides a simple and efficient way for
> userspace to handle page faults. The file interface allows userspace
> to read fault messages sequentially, and to respond to the handling
> result by writing to the same file.
>
> Userspace applications are recommended to use io_uring to speed up read
> and write efficiency.
>
> With this done, allow userspace application to allocate a hw page table
> with IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag set.
>
> Signed-off-by: Lu Baolu <[email protected]>
> ---
> drivers/iommu/iommufd/iommufd_private.h | 2 +
> drivers/iommu/iommufd/hw_pagetable.c | 204 +++++++++++++++++++++++-
> 2 files changed, 205 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 0dbaa2dc5b22..ff063bc48150 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -237,6 +237,8 @@ struct hw_pgtable_fault {
> struct mutex mutex;
> struct list_head deliver;
> struct list_head response;
> + struct file *fault_file;
> + int fault_fd;
> };
>
> /*
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 9f94c824cf86..f0aac1bb2d2d 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -3,6 +3,8 @@
> * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> */
> #include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> #include <uapi/linux/iommufd.h>
>
> #include "../iommu-priv.h"
> @@ -38,9 +40,198 @@ static void iommufd_kernel_managed_hwpt_destroy(struct iommufd_object *obj)
> refcount_dec(&hwpt->ioas->obj.users);
> }
>
> +static int iommufd_compose_fault_message(struct iommu_fault *fault,
> + struct iommu_hwpt_pgfault *hwpt_fault,
> + struct device *dev)
> +{
> + struct iommufd_device *idev = iopf_pasid_cookie_get(dev, IOMMU_NO_PASID);
> +
> + if (!idev)
> + return -ENODEV;
> +
> + if (IS_ERR(idev))
> + return PTR_ERR(idev);
> +
> + hwpt_fault->size = sizeof(*hwpt_fault);
> + hwpt_fault->flags = fault->prm.flags;
> + hwpt_fault->dev_id = idev->obj.id;
> + hwpt_fault->pasid = fault->prm.pasid;
> + hwpt_fault->grpid = fault->prm.grpid;
> + hwpt_fault->perm = fault->prm.perm;
> + hwpt_fault->addr = fault->prm.addr;
> + hwpt_fault->private_data[0] = fault->prm.private_data[0];
> + hwpt_fault->private_data[1] = fault->prm.private_data[1];
> +
> + return 0;
> +}
> +
> +static ssize_t hwpt_fault_fops_read(struct file *filep, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
> + struct hw_pgtable_fault *fault = filep->private_data;
> + struct iommu_hwpt_pgfault data;
> + struct iopf_group *group;
> + struct iopf_fault *iopf;
> + size_t done = 0;
> + int rc;
> +
> + if (*ppos || count % fault_size)
> + return -ESPIPE;
> +
> + mutex_lock(&fault->mutex);
> + while (!list_empty(&fault->deliver) && count > done) {
> + group = list_first_entry(&fault->deliver,
> + struct iopf_group, node);
> +
> + if (list_count_nodes(&group->faults) * fault_size > count - done)
> + break;
> +
> + list_for_each_entry(iopf, &group->faults, list) {
> + rc = iommufd_compose_fault_message(&iopf->fault,
> + &data, group->dev);
> + if (rc)
> + goto err_unlock;
> + rc = copy_to_user(buf + done, &data, fault_size);
> + if (rc)
> + goto err_unlock;
> + done += fault_size;
> + }
> +
> + list_move_tail(&group->node, &fault->response);
> + }
> + mutex_unlock(&fault->mutex);
> +
> + return done;
> +err_unlock:
> + mutex_unlock(&fault->mutex);
> + return rc;
> +}
> +
> +static ssize_t hwpt_fault_fops_write(struct file *filep,
> + const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + size_t response_size = sizeof(struct iommu_hwpt_page_response);
> + struct hw_pgtable_fault *fault = filep->private_data;
> + struct iommu_hwpt_page_response response;
> + struct iommufd_hw_pagetable *hwpt;
> + struct iopf_group *iter, *group;
> + struct iommufd_device *idev;
> + size_t done = 0;
> + int rc = 0;
> +
> + if (*ppos || count % response_size)
> + return -ESPIPE;
> +
> + mutex_lock(&fault->mutex);
> + while (!list_empty(&fault->response) && count > done) {
> + rc = copy_from_user(&response, buf + done, response_size);
> + if (rc)
> + break;
> +
> + /* Get the device that this response targets at. */
> + idev = container_of(iommufd_get_object(fault->ictx,
> + response.dev_id,
> + IOMMUFD_OBJ_DEVICE),
> + struct iommufd_device, obj);
> + if (IS_ERR(idev)) {
> + rc = PTR_ERR(idev);
> + break;
> + }
> +
> + /*
> + * Get the hw page table that this response was generated for.
> + * It must match the one stored in the fault data.
> + */
> + hwpt = container_of(iommufd_get_object(fault->ictx,
> + response.hwpt_id,
> + IOMMUFD_OBJ_HW_PAGETABLE),
> + struct iommufd_hw_pagetable, obj);
> + if (IS_ERR(hwpt)) {
> + iommufd_put_object(&idev->obj);
> + rc = PTR_ERR(hwpt);
> + break;
> + }
> +
> + if (hwpt != fault->hwpt) {
> + rc = -EINVAL;
> + goto put_obj;
> + }
> +
> + group = NULL;
> + list_for_each_entry(iter, &fault->response, node) {
> + if (response.grpid != iter->last_fault.fault.prm.grpid)
> + continue;
> +
> + if (idev->dev != iter->dev)
> + continue;
> +
> + if ((iter->last_fault.fault.prm.flags &
> + IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
> + response.pasid != iter->last_fault.fault.prm.pasid)
> + continue;
> +
> + group = iter;
> + break;
> + }
> +
> + if (!group) {
> + rc = -ENODEV;
> + goto put_obj;
> + }
> +
> + rc = iopf_group_response(group, response.code);
> + if (rc)
> + goto put_obj;
> +
> + list_del(&group->node);
> + iopf_free_group(group);
> + done += response_size;
> +put_obj:
> + iommufd_put_object(&hwpt->obj);
> + iommufd_put_object(&idev->obj);
> + if (rc)
> + break;
> + }
> + mutex_unlock(&fault->mutex);
> +
> + return (rc < 0) ? rc : done;
> +}
> +
> +static const struct file_operations hwpt_fault_fops = {
> + .owner = THIS_MODULE,
> + .read = hwpt_fault_fops_read,
> + .write = hwpt_fault_fops_write,
> +};
> +
> +static int hw_pagetable_get_fault_fd(struct hw_pgtable_fault *fault)
> +{
> + struct file *filep;
> + int fdno;
> +
> + fdno = get_unused_fd_flags(O_CLOEXEC);
> + if (fdno < 0)
> + return fdno;
> +
> + filep = anon_inode_getfile("[iommufd-pgfault]", &hwpt_fault_fops,
> + fault, O_RDWR);
> + if (IS_ERR(filep)) {
> + put_unused_fd(fdno);
> + return PTR_ERR(filep);
> + }
> +
> + fd_install(fdno, filep);
> + fault->fault_file = filep;
> + fault->fault_fd = fdno;
> +
> + return 0;
> +}
> +
> static struct hw_pgtable_fault *hw_pagetable_fault_alloc(void)
> {
> struct hw_pgtable_fault *fault;
> + int rc;
>
> fault = kzalloc(sizeof(*fault), GFP_KERNEL);
> if (!fault)
> @@ -50,6 +241,12 @@ static struct hw_pgtable_fault *hw_pagetable_fault_alloc(void)
> INIT_LIST_HEAD(&fault->response);
> mutex_init(&fault->mutex);
>
> + rc = hw_pagetable_get_fault_fd(fault);
> + if (rc) {
> + kfree(fault);
> + return ERR_PTR(rc);
> + }
> +
> return fault;
> }
>
> @@ -58,6 +255,8 @@ static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
> WARN_ON(!list_empty(&fault->deliver));
> WARN_ON(!list_empty(&fault->response));
>
> + fput(fault->fault_file);
> + put_unused_fd(fault->fault_fd);
I have been running your code and have run into some invalid memory in
this line. When `put_unused_fd` is called the files of the current task
is accessed with `current->files`. In my case this is 0x0.

The reason for it being 0x0 is that `do_exit` calls `exit_files` where
the task files get set to NULL; this call is made in `do_exit` before we
execute `exit_task_work`.

'exit_task_work` is the call that eventually arrives here to `hw_pagetable_fault_free`.

The way I have arrived to this state is the following:
1. Version of linux kernel that I'm using : commit 357b5abcba0477f7f1391dd0fa3a919a6f06bdf0 (HEAD, lubaolu/iommufd-io-pgfault-delivery-v2)
2. Version of qemu that Im using : commit 577ef478780597d3f449feb01e853b93fa5c5530 (HEAD, yiliu/zhenzhong/wip/iommufd_nesting_rfcv1)
3. This error happens when my user space app is exiting. (hence the call
to `do_exit`
4. I call the IOMMU_HWPT_ALLOC ioctl with
.flags = IOMMU_HWPT_ALLOC_IOPF_CAPABLE and
.hwpt_type = IOMMU_HWPT_TYPE_DEFAULT
.pt_id = the default ioas id.

I have resolved this in a naive way by just not calling the
put_unused_fd function.

Have you run into this? Is this a path that you were expecting?
Also, please get back to me if you need more information about how I got
to this place. I have provided what I think is enough info, but I might
be missing something obvious.

Best

> kfree(fault);
> }
>
> @@ -347,7 +546,9 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> struct mutex *mutex;
> int rc;
>
> - if (cmd->flags & ~IOMMU_HWPT_ALLOC_NEST_PARENT || cmd->__reserved)
> + if ((cmd->flags & ~(IOMMU_HWPT_ALLOC_NEST_PARENT |
> + IOMMU_HWPT_ALLOC_IOPF_CAPABLE)) ||
> + cmd->__reserved)
> return -EOPNOTSUPP;
> if (!cmd->data_len && cmd->hwpt_type != IOMMU_HWPT_TYPE_DEFAULT)
> return -EINVAL;
> @@ -416,6 +617,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> hwpt->fault->hwpt = hwpt;
> hwpt->domain->iopf_handler = iommufd_hw_pagetable_iopf_handler;
> hwpt->domain->fault_data = hwpt;
> + cmd->out_fault_fd = hwpt->fault->fault_fd;
> }
>
> cmd->out_hwpt_id = hwpt->obj.id;
> --
> 2.34.1
>

--

Joel Granados


Attachments:
(No filename) (9.84 kB)
signature.asc (673.00 B)
Download all attachments

2023-12-07 17:17:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On Thu, Dec 07, 2023 at 05:34:10PM +0100, Joel Granados wrote:
> > @@ -58,6 +255,8 @@ static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
> > WARN_ON(!list_empty(&fault->deliver));
> > WARN_ON(!list_empty(&fault->response));
> >
> > + fput(fault->fault_file);
> > + put_unused_fd(fault->fault_fd);

> I have resolved this in a naive way by just not calling the
> put_unused_fd function.

That is correct.

put_unused_fd() should only be called on error paths prior to the
syscall return.

The design of a FD must follow this pattern

syscall():
fdno = get_unused_fd_flags(O_CLOEXEC);
filep = [..]

// syscall MUST succeed after this statement:
fd_install(fdno, filep);
return 0;

err:
put_unused_fd(fdno)
return -ERRNO

Also the refcounting looks a little strange, the filep reference is
consumed by fd_install, so what is that fput pairing with in fault_free?

Jason

2023-12-08 05:16:14

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On 12/4/23 11:07 PM, Joel Granados wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
> Does this mean the IOPF_CAPABLE HWPT needs to be parented by a HWPT
> created with IOMMU_HWPT_ALLOC_NEST_PARENT set?

No. It's generic, nested translation is simply a use case that is
currently feasible.

Best regards,
baolu

2023-12-08 05:52:41

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On 12/8/23 1:17 AM, Jason Gunthorpe wrote:
> On Thu, Dec 07, 2023 at 05:34:10PM +0100, Joel Granados wrote:
>>> @@ -58,6 +255,8 @@ static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
>>> WARN_ON(!list_empty(&fault->deliver));
>>> WARN_ON(!list_empty(&fault->response));
>>>
>>> + fput(fault->fault_file);
>>> + put_unused_fd(fault->fault_fd);
>> I have resolved this in a naive way by just not calling the
>> put_unused_fd function.
> That is correct.
>
> put_unused_fd() should only be called on error paths prior to the
> syscall return.
>
> The design of a FD must follow this pattern
>
> syscall():
> fdno = get_unused_fd_flags(O_CLOEXEC);
> filep = [..]
>
> // syscall MUST succeed after this statement:
> fd_install(fdno, filep);
> return 0;
>
> err:
> put_unused_fd(fdno)
> return -ERRNO

Yes. Agreed.

>
> Also the refcounting looks a little strange, the filep reference is
> consumed by fd_install, so what is that fput pairing with in fault_free?

fput() pairs with get_unused_fd_flags()? fd_install() does not seem to
increase any reference.

Best regards,
baolu

2023-12-08 06:02:42

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On 12/1/23 10:24 PM, Jason Gunthorpe wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
>> translation is a hardware feature that supports two-stage translation
>> tables for IOMMU. The second-stage translation table is managed by the
>> host VMM, while the first-stage translation table is owned by user
>> space. This allows user space to control the IOMMU mappings for its
>> devices.
>>
>> When an IO page fault occurs on the first-stage translation table, the
>> IOMMU hardware can deliver the page fault to user space through the
>> IOMMUFD framework. User space can then handle the page fault and respond
>> to the device top-down through the IOMMUFD. This allows user space to
>> implement its own IO page fault handling policies.
>>
>> User space indicates its capability of handling IO page faults by
>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>> for page fault delivery. On a successful return of HWPT allocation, the
>> user can retrieve and respond to page faults by reading and writing to
>> the file descriptor (FD) returned in out_fault_fd.
>
> This is probably backwards, userspace should allocate the FD with a
> dedicated ioctl and provide it during domain allocation.

Introducing a dedicated fault FD for fault handling seems promising. It
decouples the fault handling from any specific domain. I suppose we need
different fault fd for recoverable faults (a.k.a. IO page fault) and
unrecoverable faults. Do I understand you correctly?


> If the userspace wants a fd per domain then it should do that. If it
> wants to share fds between domains that should work too.

Yes, it's more flexible. The fault message contains the hwpt obj id, so
user space can recognize the hwpt on which the fault happened.

Best regards,
baolu

2023-12-08 06:29:11

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] iommu: Add iommu page fault cookie helpers

On 12/1/23 10:38 PM, Jason Gunthorpe wrote:
> On Thu, Oct 26, 2023 at 10:49:25AM +0800, Lu Baolu wrote:
>
>> +void *iopf_pasid_cookie_get(struct device *dev, ioasid_t pasid)
>> +{
>> + struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
>> + void *curr;
>> +
>> + if (!iopf_param)
>> + return ERR_PTR(-ENODEV);
>> +
>> + xa_lock(&iopf_param->pasid_cookie);
>> + curr = xa_load(&iopf_param->pasid_cookie, pasid);
>> + xa_unlock(&iopf_param->pasid_cookie);
> No need for this locking, the caller has to provide some kind of
> locking to protect the returned pointer.
>
> I'm not sure how this can work really..
>
> What iommfd wants is to increment the device object refcount under
> this xa_lock.
>
> I'm not sure this is the right arrangement: Basically you want to
> have a cookie per domain attachment for iopf domains that is forwarded
> to the handler.
>
> So maybe this entire thing is not quite right, instead of having a
> generic iopf attached to the domain the iopf should be supplied at
> domain attach time? Something like:
>
> iommu_domain_attach_iopf(struct iommu_domain *, struct device *,
> ioasid_t pasid, struct iopf *, void *cookie);
>
> The per-attach cookie would be passed to the iopf function
> automatically by the infrastructure.
>
> Detach would have the necessary locking to ensure that no handler is
> running across detach
>
> Then the cookie is logically placed in the API and properly protected
> with natural locking we already need.

Great idea! In a subsequent series, we could arrange the enabling and
disabling of IOPF in this API, thereby eliminating the calling of
iommu_dev_enable/disable_feature(dev, IOMMU_DEV_FEAT_IOPF) from the
device drivers.

Best regards,
baolu

2023-12-08 06:39:58

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] iommufd: Add iommu page fault uapi data


On 12/1/23 11:14 PM, Jason Gunthorpe wrote:
> On Thu, Oct 26, 2023 at 10:49:26AM +0800, Lu Baolu wrote:
>
>> + * @IOMMU_HWPT_ALLOC_IOPF_CAPABLE: User is capable of handling IO page faults.
>
> This does not seem like the best name?
>
> Probably like this given my remark in the cover letter:
>
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -359,6 +359,7 @@ struct iommu_vfio_ioas {
> enum iommufd_hwpt_alloc_flags {
> IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
> IOMMU_HWPT_ALLOC_DIRTY_TRACKING = 1 << 1,
> + IOMMU_HWPT_IOPFD_FD_VALID = 1 << 2,
> };
>
> /**
> @@ -440,6 +441,7 @@ struct iommu_hwpt_alloc {
> __u32 data_type;
> __u32 data_len;
> __aligned_u64 data_uptr;
> + __s32 iopf_fd;
> };
> #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)

Yes. Agreed.

>> @@ -679,6 +688,62 @@ struct iommu_dev_data_arm_smmuv3 {
>> __u32 sid;
>> };
>>
>> +/**
>> + * struct iommu_hwpt_pgfault - iommu page fault data
>> + * @size: sizeof(struct iommu_hwpt_pgfault)
>> + * @flags: Combination of IOMMU_PGFAULT_FLAGS_ flags.
>> + * - PASID_VALID: @pasid field is valid
>> + * - LAST_PAGE: the last page fault in a group
>> + * - PRIV_DATA: @private_data field is valid
>> + * - RESP_NEEDS_PASID: the page response must have the same
>> + * PASID value as the page request.
>> + * @dev_id: id of the originated device
>> + * @pasid: Process Address Space ID
>> + * @grpid: Page Request Group Index
>> + * @perm: requested page permissions (IOMMU_PGFAULT_PERM_* values)
>> + * @addr: page address
>> + * @private_data: device-specific private information
>> + */
>> +struct iommu_hwpt_pgfault {
>> + __u32 size;
>> + __u32 flags;
>> +#define IOMMU_PGFAULT_FLAGS_PASID_VALID (1 << 0)
>> +#define IOMMU_PGFAULT_FLAGS_LAST_PAGE (1 << 1)
>> +#define IOMMU_PGFAULT_FLAGS_PRIV_DATA (1 << 2)
>> +#define IOMMU_PGFAULT_FLAGS_RESP_NEEDS_PASID (1 << 3)
>> + __u32 dev_id;
>> + __u32 pasid;
>> + __u32 grpid;
>> + __u32 perm;
>> +#define IOMMU_PGFAULT_PERM_READ (1 << 0)
>> +#define IOMMU_PGFAULT_PERM_WRITE (1 << 1)
>> +#define IOMMU_PGFAULT_PERM_EXEC (1 << 2)
>> +#define IOMMU_PGFAULT_PERM_PRIV (1 << 3)
>> + __u64 addr;
>> + __u64 private_data[2];
>> +};
>
> This mixed #define is not the style, these should be in enums,
> possibly with kdocs
>
> Use __aligned_u64 also

Sure.

>
>> +
>> +/**
>> + * struct iommu_hwpt_response - IOMMU page fault response
>> + * @size: sizeof(struct iommu_hwpt_response)
>> + * @flags: Must be set to 0
>> + * @hwpt_id: hwpt ID of target hardware page table for the response
>> + * @dev_id: device ID of target device for the response
>> + * @pasid: Process Address Space ID
>> + * @grpid: Page Request Group Index
>> + * @code: response code. The supported codes include:
>> + * 0: Successful; 1: Response Failure; 2: Invalid Request.
>> + */
>> +struct iommu_hwpt_page_response {
>> + __u32 size;
>> + __u32 flags;
>> + __u32 hwpt_id;
>> + __u32 dev_id;
>> + __u32 pasid;
>> + __u32 grpid;
>> + __u32 code;
>> +};
>
> Is it OK to have the user pass in all this detailed information? Is it
> a security problem if the user lies? Ie shouldn't we only ack page
> faults we actually have outstanding?
>
> IOW should iommu_hwpt_pgfault just have a 'response_cookie' generated
> by the kernel that should be placed here? The kernel would keep track
> of all this internal stuff?

The iommu core has already kept the outstanding faults that have been
awaiting a response. So even if the user lies about a fault, the kernel
does not send the wrong respond message to the device. {device_id,
grpid, code} is just enough from the user. This means the user wants to
respond to the @grpid fault from @device with the @code result.

Best regards,
baolu

2023-12-08 11:43:49

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On 2023/12/1 23:24, Jason Gunthorpe wrote:
> On Thu, Oct 26, 2023 at 10:49:28AM +0800, Lu Baolu wrote:
>
>> +static ssize_t hwpt_fault_fops_write(struct file *filep,
>> + const char __user *buf,
>> + size_t count, loff_t *ppos)
>> +{
>> + size_t response_size = sizeof(struct iommu_hwpt_page_response);
>> + struct hw_pgtable_fault *fault = filep->private_data;
>> + struct iommu_hwpt_page_response response;
>> + struct iommufd_hw_pagetable *hwpt;
>> + struct iopf_group *iter, *group;
>> + struct iommufd_device *idev;
>> + size_t done = 0;
>> + int rc = 0;
>> +
>> + if (*ppos || count % response_size)
>> + return -ESPIPE;
>> +
>> + mutex_lock(&fault->mutex);
>> + while (!list_empty(&fault->response) && count > done) {
>> + rc = copy_from_user(&response, buf + done, response_size);
>> + if (rc)
>> + break;
>> +
>> + /* Get the device that this response targets at. */
>> + idev = container_of(iommufd_get_object(fault->ictx,
>> + response.dev_id,
>> + IOMMUFD_OBJ_DEVICE),
>> + struct iommufd_device, obj);
>> + if (IS_ERR(idev)) {
>> + rc = PTR_ERR(idev);
>> + break;
>> + }
>
> See here it might be better to have a per-fd list of outstanding
> faults per-fd and then the cookie would just index that list, then you
> get everything in one shot instead of having to do a xarray looking
> and then a linear list search

Yours is more efficient. I will do it that way in the next version.

>
>> +static const struct file_operations hwpt_fault_fops = {
>> + .owner = THIS_MODULE,
>> + .read = hwpt_fault_fops_read,
>> + .write = hwpt_fault_fops_write,
>> +};
>
> nonseekable_open() behavior should be integrated into this

Sure.

>
>> +static int hw_pagetable_get_fault_fd(struct hw_pgtable_fault *fault)
>> +{
>> + struct file *filep;
>> + int fdno;
>> +
>> + fdno = get_unused_fd_flags(O_CLOEXEC);
>> + if (fdno < 0)
>> + return fdno;
>> +
>> + filep = anon_inode_getfile("[iommufd-pgfault]", &hwpt_fault_fops,
>> + fault, O_RDWR);
>> + if (IS_ERR(filep)) {
>> + put_unused_fd(fdno);
>> + return PTR_ERR(filep);
>> + }
>> +
>> + fd_install(fdno, filep);
>> + fault->fault_file = filep;
>> + fault->fault_fd = fdno;
>
> fd_install must be the very last thing before returning success from a
> system call because we cannot undo it.

Yes.

>
> There are other failure paths before here and the final return
>
> Jason

Best regards,
baolu

2023-12-08 13:41:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On Fri, Dec 08, 2023 at 01:47:35PM +0800, Baolu Lu wrote:
> On 12/8/23 1:17 AM, Jason Gunthorpe wrote:
> > On Thu, Dec 07, 2023 at 05:34:10PM +0100, Joel Granados wrote:
> > > > @@ -58,6 +255,8 @@ static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
> > > > WARN_ON(!list_empty(&fault->deliver));
> > > > WARN_ON(!list_empty(&fault->response));
> > > > + fput(fault->fault_file);
> > > > + put_unused_fd(fault->fault_fd);
> > > I have resolved this in a naive way by just not calling the
> > > put_unused_fd function.
> > That is correct.
> >
> > put_unused_fd() should only be called on error paths prior to the
> > syscall return.
> >
> > The design of a FD must follow this pattern
> >
> > syscall():
> > fdno = get_unused_fd_flags(O_CLOEXEC);
> > filep = [..]
> > // syscall MUST succeed after this statement:
> > fd_install(fdno, filep);
> > return 0;
> >
> > err:
> > put_unused_fd(fdno)
> > return -ERRNO
>
> Yes. Agreed.
>
> >
> > Also the refcounting looks a little strange, the filep reference is
> > consumed by fd_install, so what is that fput pairing with in fault_free?
>
> fput() pairs with get_unused_fd_flags()? fd_install() does not seem to
> increase any reference.

fd_install() transfers the reference to the fd table and that
reference is put back by the close() system call.

Notice that instantly after fd_install() a concurrent user can free
the filep.

Jason

2023-12-08 13:43:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Fri, Dec 08, 2023 at 01:57:26PM +0800, Baolu Lu wrote:
> On 12/1/23 10:24 PM, Jason Gunthorpe wrote:
> > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > > Hi folks,
> > >
> > > This series implements the functionality of delivering IO page faults to
> > > user space through the IOMMUFD framework for nested translation. Nested
> > > translation is a hardware feature that supports two-stage translation
> > > tables for IOMMU. The second-stage translation table is managed by the
> > > host VMM, while the first-stage translation table is owned by user
> > > space. This allows user space to control the IOMMU mappings for its
> > > devices.
> > >
> > > When an IO page fault occurs on the first-stage translation table, the
> > > IOMMU hardware can deliver the page fault to user space through the
> > > IOMMUFD framework. User space can then handle the page fault and respond
> > > to the device top-down through the IOMMUFD. This allows user space to
> > > implement its own IO page fault handling policies.
> > >
> > > User space indicates its capability of handling IO page faults by
> > > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > > for page fault delivery. On a successful return of HWPT allocation, the
> > > user can retrieve and respond to page faults by reading and writing to
> > > the file descriptor (FD) returned in out_fault_fd.
> >
> > This is probably backwards, userspace should allocate the FD with a
> > dedicated ioctl and provide it during domain allocation.
>
> Introducing a dedicated fault FD for fault handling seems promising. It
> decouples the fault handling from any specific domain. I suppose we need
> different fault fd for recoverable faults (a.k.a. IO page fault) and
> unrecoverable faults. Do I understand you correctly?

I haven't thought that far ahead :) Once you have a generic fault FD
concept it can be sliced in different ways. If there is a technical
need to seperate recoverable/unrecoverable then the FD flavour should
be specified during FD creation. Otherwise just let userspace do
whatever it wants.

Jason

2023-12-12 13:10:43

by Joel Granados

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] iommufd: Initializing and releasing IO page fault data

On Thu, Oct 26, 2023 at 10:49:27AM +0800, Lu Baolu wrote:
> Add some housekeeping code for IO page fault dilivery. Add a fault field
> in the iommufd_hw_pagetable structure to store pending IO page faults and
> other related data.
>
> The fault field is allocated and initialized when an IOPF-capable user
> HWPT is allocated. It is indicated by the IOMMU_HWPT_ALLOC_IOPF_CAPABLE
> flag being set in the allocation user data. The fault field exists until
> the HWPT is destroyed. This also means that you can determine whether a
> HWPT is IOPF-capable by checking the fault field.
>
> When an IOPF-capable HWPT is attached to a device (could also be a PASID of
> a device in the future), the iommufd device pointer is saved for the pasid
> of the device. The pointer is recalled and all pending iopf groups are
> discarded after the HWPT is detached from the device.
>
> Signed-off-by: Lu Baolu <[email protected]>
> ---
> include/linux/iommu.h | 6 +++
> drivers/iommu/iommufd/iommufd_private.h | 10 ++++
> drivers/iommu/iommufd/device.c | 69 +++++++++++++++++++++++--
> drivers/iommu/iommufd/hw_pagetable.c | 56 +++++++++++++++++++-
> 4 files changed, 137 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 615d8a5f9dee..600ca3842c8a 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -130,6 +130,12 @@ struct iopf_group {
> struct work_struct work;
> struct device *dev;
> struct iommu_domain *domain;
> +
> + /*
> + * Used by iopf handlers, like iommufd, to hook the iopf group
> + * on its own lists.
> + */
> + struct list_head node;
> };
>
> /**
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 1bd412cff2d6..0dbaa2dc5b22 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -230,6 +230,15 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,
>
> int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
>
> +struct hw_pgtable_fault {
> + struct iommufd_ctx *ictx;
> + struct iommufd_hw_pagetable *hwpt;
> + /* Protect below iopf lists. */
> + struct mutex mutex;
> + struct list_head deliver;
> + struct list_head response;
> +};
> +
> /*
> * A HW pagetable is called an iommu_domain inside the kernel. This user object
> * allows directly creating and inspecting the domains. Domains that have kernel
> @@ -239,6 +248,7 @@ int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
> struct iommufd_hw_pagetable {
> struct iommufd_object obj;
> struct iommu_domain *domain;
> + struct hw_pgtable_fault *fault;
>
> void (*abort)(struct iommufd_object *obj);
> void (*destroy)(struct iommufd_object *obj);
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 645ab5d290fe..0a8e03d5e7c5 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -456,6 +456,16 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> if (rc)
> goto err_unlock;
>
> + if (hwpt->fault) {
> + void *curr;
> +
> + curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
I'm hitting an error here when I try to attach to a hwpt that I created
previously with the `IOMMU_HWPT_ALLOC_IOPF_CAPABLE` flag.

I get an -ENODEV from iopf_pasid_cookie_set which is triggered by
dev->iommu->fault_param being 0x0.

I looked around and I see that the fault param gets set in
iopf_queue_add_device which is called from iommu_dev_enable_feature
only. Furthermore iommu_dev_enable_feature is only called in idxd and
uacce drivers.

Questions:
1. Should iopf_queue_add_device get called from the
IOMMU_HWPT_ALLOC_IOPF_CAPABLE ioctl call? This make sense to me as
this is where the device and the IOPF are related from user space.
2. This is not intended to work only with idxd and uacce. right?

Best
> + if (IS_ERR(curr)) {
> + rc = PTR_ERR(curr);
> + goto err_unresv;
> + }
> + }
> +
> /*
> * Only attach to the group once for the first device that is in the
> * group. All the other devices will follow this attachment. The user
> @@ -466,17 +476,20 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> if (list_empty(&idev->igroup->device_list)) {
> rc = iommufd_group_setup_msi(idev->igroup, hwpt);
> if (rc)
> - goto err_unresv;
> + goto err_unset;
>
> rc = iommu_attach_group(hwpt->domain, idev->igroup->group);
> if (rc)
> - goto err_unresv;
> + goto err_unset;
> idev->igroup->hwpt = hwpt;
> }
> refcount_inc(&hwpt->obj.users);
> list_add_tail(&idev->group_item, &idev->igroup->device_list);
> mutex_unlock(&idev->igroup->lock);
> return 0;
> +err_unset:
> + if (hwpt->fault)
> + iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
> err_unresv:
> iommufd_device_remove_rr(idev, hwpt);
> err_unlock:
> @@ -484,6 +497,30 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> return rc;
> }
>
> +/*
> + * Discard all pending page faults. Called when a hw pagetable is detached
> + * from a device. The iommu core guarantees that all page faults have been
> + * responded, hence there's no need to respond it again.
> + */
> +static void iommufd_hw_pagetable_discard_iopf(struct iommufd_hw_pagetable *hwpt)
> +{
> + struct iopf_group *group, *next;
> +
> + if (!hwpt->fault)
> + return;
> +
> + mutex_lock(&hwpt->fault->mutex);
> + list_for_each_entry_safe(group, next, &hwpt->fault->deliver, node) {
> + list_del(&group->node);
> + iopf_free_group(group);
> + }
> + list_for_each_entry_safe(group, next, &hwpt->fault->response, node) {
> + list_del(&group->node);
> + iopf_free_group(group);
> + }
> + mutex_unlock(&hwpt->fault->mutex);
> +}
> +
> struct iommufd_hw_pagetable *
> iommufd_hw_pagetable_detach(struct iommufd_device *idev)
> {
> @@ -491,6 +528,8 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
>
> mutex_lock(&idev->igroup->lock);
> list_del(&idev->group_item);
> + if (hwpt->fault)
> + iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
> if (list_empty(&idev->igroup->device_list)) {
> iommu_detach_group(hwpt->domain, idev->igroup->group);
> idev->igroup->hwpt = NULL;
> @@ -498,6 +537,8 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
> iommufd_device_remove_rr(idev, hwpt);
> mutex_unlock(&idev->igroup->lock);
>
> + iommufd_hw_pagetable_discard_iopf(hwpt);
> +
> /* Caller must destroy hwpt */
> return hwpt;
> }
> @@ -563,9 +604,24 @@ iommufd_device_do_replace(struct iommufd_device *idev,
> if (rc)
> goto err_unresv;
>
> + if (old_hwpt->fault) {
> + iommufd_hw_pagetable_discard_iopf(old_hwpt);
> + iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
> + }
> +
> + if (hwpt->fault) {
> + void *curr;
> +
> + curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
> + if (IS_ERR(curr)) {
> + rc = PTR_ERR(curr);
> + goto err_unresv;
> + }
> + }
> +
> rc = iommu_group_replace_domain(igroup->group, hwpt->domain);
> if (rc)
> - goto err_unresv;
> + goto err_unset;
>
> if (iommufd_hw_pagetable_compare_ioas(old_hwpt, hwpt)) {
> list_for_each_entry(cur, &igroup->device_list, group_item)
> @@ -583,8 +639,15 @@ iommufd_device_do_replace(struct iommufd_device *idev,
> &old_hwpt->obj.users));
> mutex_unlock(&idev->igroup->lock);
>
> + iommufd_hw_pagetable_discard_iopf(old_hwpt);
> +
> /* Caller must destroy old_hwpt */
> return old_hwpt;
> +err_unset:
> + if (hwpt->fault)
> + iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
> + if (old_hwpt->fault)
> + iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
> err_unresv:
> if (iommufd_hw_pagetable_compare_ioas(old_hwpt, hwpt)) {
> list_for_each_entry(cur, &igroup->device_list, group_item)
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 72c46de1396b..9f94c824cf86 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -38,9 +38,38 @@ static void iommufd_kernel_managed_hwpt_destroy(struct iommufd_object *obj)
> refcount_dec(&hwpt->ioas->obj.users);
> }
>
> +static struct hw_pgtable_fault *hw_pagetable_fault_alloc(void)
> +{
> + struct hw_pgtable_fault *fault;
> +
> + fault = kzalloc(sizeof(*fault), GFP_KERNEL);
> + if (!fault)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&fault->deliver);
> + INIT_LIST_HEAD(&fault->response);
> + mutex_init(&fault->mutex);
> +
> + return fault;
> +}
> +
> +static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
> +{
> + WARN_ON(!list_empty(&fault->deliver));
> + WARN_ON(!list_empty(&fault->response));
> +
> + kfree(fault);
> +}
> +
> void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
> {
> - container_of(obj, struct iommufd_hw_pagetable, obj)->destroy(obj);
> + struct iommufd_hw_pagetable *hwpt =
> + container_of(obj, struct iommufd_hw_pagetable, obj);
> +
> + if (hwpt->fault)
> + hw_pagetable_fault_free(hwpt->fault);
> +
> + hwpt->destroy(obj);
> }
>
> static void iommufd_user_managed_hwpt_abort(struct iommufd_object *obj)
> @@ -289,6 +318,17 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx,
> return ERR_PTR(rc);
> }
>
> +static int iommufd_hw_pagetable_iopf_handler(struct iopf_group *group)
> +{
> + struct iommufd_hw_pagetable *hwpt = group->domain->fault_data;
> +
> + mutex_lock(&hwpt->fault->mutex);
> + list_add_tail(&group->node, &hwpt->fault->deliver);
> + mutex_unlock(&hwpt->fault->mutex);
> +
> + return 0;
> +}
> +
> int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> {
> struct iommufd_hw_pagetable *(*alloc_fn)(
> @@ -364,6 +404,20 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> goto out_unlock;
> }
>
> + if (cmd->flags & IOMMU_HWPT_ALLOC_IOPF_CAPABLE) {
> + hwpt->fault = hw_pagetable_fault_alloc();
> + if (IS_ERR(hwpt->fault)) {
> + rc = PTR_ERR(hwpt->fault);
> + hwpt->fault = NULL;
> + goto out_hwpt;
> + }
> +
> + hwpt->fault->ictx = ucmd->ictx;
> + hwpt->fault->hwpt = hwpt;
> + hwpt->domain->iopf_handler = iommufd_hw_pagetable_iopf_handler;
> + hwpt->domain->fault_data = hwpt;
> + }
> +
> cmd->out_hwpt_id = hwpt->obj.id;
> rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> if (rc)
> --
> 2.34.1
>

--

Joel Granados


Attachments:
(No filename) (10.41 kB)
signature.asc (673.00 B)
Download all attachments

2023-12-12 14:12:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] iommufd: Initializing and releasing IO page fault data

On Tue, Dec 12, 2023 at 02:10:08PM +0100, Joel Granados wrote:

> > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> > index 645ab5d290fe..0a8e03d5e7c5 100644
> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -456,6 +456,16 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> > if (rc)
> > goto err_unlock;
> >
> > + if (hwpt->fault) {
> > + void *curr;
> > +
> > + curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
> I'm hitting an error here when I try to attach to a hwpt that I created
> previously with the `IOMMU_HWPT_ALLOC_IOPF_CAPABLE` flag.
>
> I get an -ENODEV from iopf_pasid_cookie_set which is triggered by
> dev->iommu->fault_param being 0x0.
>
> I looked around and I see that the fault param gets set in
> iopf_queue_add_device which is called from iommu_dev_enable_feature
> only. Furthermore iommu_dev_enable_feature is only called in idxd and
> uacce drivers.
>
> Questions:
> 1. Should iopf_queue_add_device get called from the
> IOMMU_HWPT_ALLOC_IOPF_CAPABLE ioctl call? This make sense to me as
> this is where the device and the IOPF are related from user space.

It probably needs to call the set feature thing in the short term.

In the medium term I would like the drivers to manage the iopf based
on domain attachment not explicit feature asks

> 2. This is not intended to work only with idxd and uacce. right?

It should work everywhere, I suspect Intel Team didn't hit this
because they are testing IDXD SIOV? Can you guys also test it as a PF
assignment?

Jason

2023-12-13 02:09:53

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] iommufd: Initializing and releasing IO page fault data

On 12/12/23 10:12 PM, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 02:10:08PM +0100, Joel Granados wrote:
>
>>> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
>>> index 645ab5d290fe..0a8e03d5e7c5 100644
>>> --- a/drivers/iommu/iommufd/device.c
>>> +++ b/drivers/iommu/iommufd/device.c
>>> @@ -456,6 +456,16 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>>> if (rc)
>>> goto err_unlock;
>>>
>>> + if (hwpt->fault) {
>>> + void *curr;
>>> +
>>> + curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
>> I'm hitting an error here when I try to attach to a hwpt that I created
>> previously with the `IOMMU_HWPT_ALLOC_IOPF_CAPABLE` flag.
>>
>> I get an -ENODEV from iopf_pasid_cookie_set which is triggered by
>> dev->iommu->fault_param being 0x0.
>>
>> I looked around and I see that the fault param gets set in
>> iopf_queue_add_device which is called from iommu_dev_enable_feature
>> only. Furthermore iommu_dev_enable_feature is only called in idxd and
>> uacce drivers.
>>
>> Questions:
>> 1. Should iopf_queue_add_device get called from the
>> IOMMU_HWPT_ALLOC_IOPF_CAPABLE ioctl call? This make sense to me as
>> this is where the device and the IOPF are related from user space.
> It probably needs to call the set feature thing in the short term.
>
> In the medium term I would like the drivers to manage the iopf based
> on domain attachment not explicit feature asks

Yes, it's the same as my plan.

>
>> 2. This is not intended to work only with idxd and uacce. right?
> It should work everywhere, I suspect Intel Team didn't hit this
> because they are testing IDXD SIOV?

Yes.

> Can you guys also test it as a PF
> assignment?

For PF assignment, probably the driver (vfio-pci) needs to enable iopf.

Best regards,
baolu

2023-12-13 02:15:53

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 3/6] iommufd: Initializing and releasing IO page fault data

> From: Baolu Lu <[email protected]>
> Sent: Wednesday, December 13, 2023 10:05 AM
> >
> >> 2. This is not intended to work only with idxd and uacce. right?
> > It should work everywhere, I suspect Intel Team didn't hit this
> > because they are testing IDXD SIOV?
>
> Yes.
>
> > Can you guys also test it as a PF
> > assignment?
>
> For PF assignment, probably the driver (vfio-pci) needs to enable iopf.
>

We haven't merged anything for SIOV yet.

so the base of this series should be PCI functions (PF or VF) and vfio-pci
has to be extended with whatever required to support iopf.

2023-12-13 13:20:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] iommufd: Initializing and releasing IO page fault data

On Wed, Dec 13, 2023 at 02:15:28AM +0000, Tian, Kevin wrote:
> > From: Baolu Lu <[email protected]>
> > Sent: Wednesday, December 13, 2023 10:05 AM
> > >
> > >> 2. This is not intended to work only with idxd and uacce. right?
> > > It should work everywhere, I suspect Intel Team didn't hit this
> > > because they are testing IDXD SIOV?
> >
> > Yes.
> >
> > > Can you guys also test it as a PF
> > > assignment?
> >
> > For PF assignment, probably the driver (vfio-pci) needs to enable iopf.
> >
>
> We haven't merged anything for SIOV yet.
>
> so the base of this series should be PCI functions (PF or VF) and vfio-pci
> has to be extended with whatever required to support iopf.

Right. I suggest you target full idxd device assignment to a guest
with working PRI/etc as a validation.

Jason

Subject: RE: [PATCH v2 4/6] iommufd: Deliver fault messages to user space



> -----Original Message-----
> From: Lu Baolu <[email protected]>
> Sent: Thursday, October 26, 2023 3:49 AM
> To: Jason Gunthorpe <[email protected]>; Kevin Tian <[email protected]>; Joerg
> Roedel <[email protected]>; Will Deacon <[email protected]>; Robin Murphy
> <[email protected]>; Jean-Philippe Brucker <[email protected]>;
> Nicolin Chen <[email protected]>; Yi Liu <[email protected]>; Jacob Pan
> <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Lu
> Baolu <[email protected]>
> Subject: [PATCH v2 4/6] iommufd: Deliver fault messages to user space
>
[...]

Hi,

> +static ssize_t hwpt_fault_fops_write(struct file *filep,
> + const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + size_t response_size = sizeof(struct iommu_hwpt_page_response);
> + struct hw_pgtable_fault *fault = filep->private_data;
> + struct iommu_hwpt_page_response response;
> + struct iommufd_hw_pagetable *hwpt;
> + struct iopf_group *iter, *group;
> + struct iommufd_device *idev;
> + size_t done = 0;
> + int rc = 0;
> +
> + if (*ppos || count % response_size)
> + return -ESPIPE;
> +
> + mutex_lock(&fault->mutex);
> + while (!list_empty(&fault->response) && count > done) {
> + rc = copy_from_user(&response, buf + done, response_size);
> + if (rc)
> + break;
> +
> + /* Get the device that this response targets at. */
> + idev = container_of(iommufd_get_object(fault->ictx,
> + response.dev_id,
> + IOMMUFD_OBJ_DEVICE),
> + struct iommufd_device, obj);
> + if (IS_ERR(idev)) {
> + rc = PTR_ERR(idev);
> + break;
> + }
> +
> + /*
> + * Get the hw page table that this response was generated for.
> + * It must match the one stored in the fault data.
> + */
> + hwpt = container_of(iommufd_get_object(fault->ictx,
> + response.hwpt_id,
> +
> IOMMUFD_OBJ_HW_PAGETABLE),
> + struct iommufd_hw_pagetable, obj);
> + if (IS_ERR(hwpt)) {
> + iommufd_put_object(&idev->obj);
> + rc = PTR_ERR(hwpt);
> + break;
> + }
> +
> + if (hwpt != fault->hwpt) {
> + rc = -EINVAL;
> + goto put_obj;
> + }
> +
> + group = NULL;
> + list_for_each_entry(iter, &fault->response, node) {
> + if (response.grpid != iter->last_fault.fault.prm.grpid)
> + continue;
> +
> + if (idev->dev != iter->dev)
> + continue;
> +
> + if ((iter->last_fault.fault.prm.flags &
> + IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
> + response.pasid != iter->last_fault.fault.prm.pasid)
> + continue;

I am trying to get vSVA working with this series and got hit by the above check.
On ARM platforms, page responses to stall events(CMD_RESUME) do not have
an associated pasid. I think, either we need to check here using
IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID or remove the check
as it will be eventually done in iommu_page_response().

Thanks,
Shameer


2024-01-12 21:56:31

by Joel Granados

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
>
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
>
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
>
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.
>
> The iommu selftest framework has been updated to test the IO page fault
> delivery and response functionality.
>
> This series is based on the latest implementation of nested translation
> under discussion [1] and the page fault handling framework refactoring in
> the IOMMU core [2].
>
> The series and related patches are available on GitHub: [3]
>
> [1] https://lore.kernel.org/linux-iommu/[email protected]/
> [2] https://lore.kernel.org/linux-iommu/[email protected]/
> [3] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
I was working with this branch that included Yi Liu's
wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
https://lore.kernel.org/all/[email protected].
Is there an updated version of the page fault work that is rebased on
top of Liu's new version?

Thx in advance

Best

>
> Best regards,
> baolu
>
> Change log:
> v2:
> - Move all iommu refactoring patches into a sparated series and discuss
> it in a different thread. The latest patch series [v6] is available at
> https://lore.kernel.org/linux-iommu/[email protected]/
> - We discussed the timeout of the pending page fault messages. We
> agreed that we shouldn't apply any timeout policy for the page fault
> handling in user space.
> https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/
> - Jason suggested that we adopt a simple file descriptor interface for
> reading and responding to I/O page requests, so that user space
> applications can improve performance using io_uring.
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> v1: https://lore.kernel.org/linux-iommu/[email protected]/
>
> Lu Baolu (6):
> iommu: Add iommu page fault cookie helpers
> iommufd: Add iommu page fault uapi data
> iommufd: Initializing and releasing IO page fault data
> iommufd: Deliver fault messages to user space
> iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support
> iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF
>
> include/linux/iommu.h | 9 +
> drivers/iommu/iommu-priv.h | 15 +
> drivers/iommu/iommufd/iommufd_private.h | 12 +
> drivers/iommu/iommufd/iommufd_test.h | 8 +
> include/uapi/linux/iommufd.h | 65 +++++
> tools/testing/selftests/iommu/iommufd_utils.h | 66 ++++-
> drivers/iommu/io-pgfault.c | 50 ++++
> drivers/iommu/iommufd/device.c | 69 ++++-
> drivers/iommu/iommufd/hw_pagetable.c | 260 +++++++++++++++++-
> drivers/iommu/iommufd/selftest.c | 56 ++++
> tools/testing/selftests/iommu/iommufd.c | 24 +-
> .../selftests/iommu/iommufd_fail_nth.c | 2 +-
> 12 files changed, 620 insertions(+), 16 deletions(-)
>
> --
> 2.34.1
>

--

Joel Granados


Attachments:
(No filename) (4.28 kB)
signature.asc (673.00 B)
Download all attachments

2024-01-14 13:13:36

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On 2024/1/13 5:56, Joel Granados wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
>> translation is a hardware feature that supports two-stage translation
>> tables for IOMMU. The second-stage translation table is managed by the
>> host VMM, while the first-stage translation table is owned by user
>> space. This allows user space to control the IOMMU mappings for its
>> devices.
>>
>> When an IO page fault occurs on the first-stage translation table, the
>> IOMMU hardware can deliver the page fault to user space through the
>> IOMMUFD framework. User space can then handle the page fault and respond
>> to the device top-down through the IOMMUFD. This allows user space to
>> implement its own IO page fault handling policies.
>>
>> User space indicates its capability of handling IO page faults by
>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>> for page fault delivery. On a successful return of HWPT allocation, the
>> user can retrieve and respond to page faults by reading and writing to
>> the file descriptor (FD) returned in out_fault_fd.
>>
>> The iommu selftest framework has been updated to test the IO page fault
>> delivery and response functionality.
>>
>> This series is based on the latest implementation of nested translation
>> under discussion [1] and the page fault handling framework refactoring in
>> the IOMMU core [2].
>>
>> The series and related patches are available on GitHub: [3]
>>
>> [1]https://lore.kernel.org/linux-iommu/[email protected]/
>> [2]https://lore.kernel.org/linux-iommu/[email protected]/
>> [3]https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
> I was working with this branch that included Yi Liu's
> wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
> https://lore.kernel.org/all/[email protected].
> Is there an updated version of the page fault work that is rebased on
> top of Liu's new version?

Yes. I am preparing the new version and will post it for discussion
after the merge window.

Best regards,
baolu

2024-01-14 17:19:22

by Joel Granados

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On Sun, Jan 14, 2024 at 09:13:19PM +0800, Baolu Lu wrote:
> On 2024/1/13 5:56, Joel Granados wrote:
> > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> >> Hi folks,
> >>
> >> This series implements the functionality of delivering IO page faults to
> >> user space through the IOMMUFD framework for nested translation. Nested
> >> translation is a hardware feature that supports two-stage translation
> >> tables for IOMMU. The second-stage translation table is managed by the
> >> host VMM, while the first-stage translation table is owned by user
> >> space. This allows user space to control the IOMMU mappings for its
> >> devices.
> >>
> >> When an IO page fault occurs on the first-stage translation table, the
> >> IOMMU hardware can deliver the page fault to user space through the
> >> IOMMUFD framework. User space can then handle the page fault and respond
> >> to the device top-down through the IOMMUFD. This allows user space to
> >> implement its own IO page fault handling policies.
> >>
> >> User space indicates its capability of handling IO page faults by
> >> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> >> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> >> for page fault delivery. On a successful return of HWPT allocation, the
> >> user can retrieve and respond to page faults by reading and writing to
> >> the file descriptor (FD) returned in out_fault_fd.
> >>
> >> The iommu selftest framework has been updated to test the IO page fault
> >> delivery and response functionality.
> >>
> >> This series is based on the latest implementation of nested translation
> >> under discussion [1] and the page fault handling framework refactoring in
> >> the IOMMU core [2].
> >>
> >> The series and related patches are available on GitHub: [3]
> >>
> >> [1]https://lore.kernel.org/linux-iommu/[email protected]/
> >> [2]https://lore.kernel.org/linux-iommu/[email protected]/
> >> [3]https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
> > I was working with this branch that included Yi Liu's
> > wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
> > https://lore.kernel.org/all/[email protected].
> > Is there an updated version of the page fault work that is rebased on
> > top of Liu's new version?
>
> Yes. I am preparing the new version and will post it for discussion
> after the merge window.
Great to hear and thx for getting back to me.

I'll be on the look out for your post. Would it be possible for you to
add me to the CC when you send it?

Best

--

Joel Granados


Attachments:
(No filename) (2.68 kB)
signature.asc (673.00 B)
Download all attachments

2024-01-15 01:31:02

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

On 1/15/24 1:18 AM, Joel Granados wrote:
> On Sun, Jan 14, 2024 at 09:13:19PM +0800, Baolu Lu wrote:
>> On 2024/1/13 5:56, Joel Granados wrote:
>>> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>>>> Hi folks,
>>>>
>>>> This series implements the functionality of delivering IO page faults to
>>>> user space through the IOMMUFD framework for nested translation. Nested
>>>> translation is a hardware feature that supports two-stage translation
>>>> tables for IOMMU. The second-stage translation table is managed by the
>>>> host VMM, while the first-stage translation table is owned by user
>>>> space. This allows user space to control the IOMMU mappings for its
>>>> devices.
>>>>
>>>> When an IO page fault occurs on the first-stage translation table, the
>>>> IOMMU hardware can deliver the page fault to user space through the
>>>> IOMMUFD framework. User space can then handle the page fault and respond
>>>> to the device top-down through the IOMMUFD. This allows user space to
>>>> implement its own IO page fault handling policies.
>>>>
>>>> User space indicates its capability of handling IO page faults by
>>>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>>>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>>>> for page fault delivery. On a successful return of HWPT allocation, the
>>>> user can retrieve and respond to page faults by reading and writing to
>>>> the file descriptor (FD) returned in out_fault_fd.
>>>>
>>>> The iommu selftest framework has been updated to test the IO page fault
>>>> delivery and response functionality.
>>>>
>>>> This series is based on the latest implementation of nested translation
>>>> under discussion [1] and the page fault handling framework refactoring in
>>>> the IOMMU core [2].
>>>>
>>>> The series and related patches are available on GitHub: [3]
>>>>
>>>> [1]https://lore.kernel.org/linux-iommu/[email protected]/
>>>> [2]https://lore.kernel.org/linux-iommu/[email protected]/
>>>> [3]https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
>>> I was working with this branch that included Yi Liu's
>>> wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
>>> https://lore.kernel.org/all/[email protected].
>>> Is there an updated version of the page fault work that is rebased on
>>> top of Liu's new version?
>> Yes. I am preparing the new version and will post it for discussion
>> after the merge window.
> Great to hear and thx for getting back to me.
>
> I'll be on the look out for your post. Would it be possible for you to
> add me to the CC when you send it?

Sure.

Best regards,
baolu

2024-01-15 16:47:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On Fri, Jan 12, 2024 at 05:46:13PM +0000, Shameerali Kolothum Thodi wrote:
>
>
> > -----Original Message-----
> > From: Lu Baolu <[email protected]>
> > Sent: Thursday, October 26, 2023 3:49 AM
> > To: Jason Gunthorpe <[email protected]>; Kevin Tian <[email protected]>; Joerg
> > Roedel <[email protected]>; Will Deacon <[email protected]>; Robin Murphy
> > <[email protected]>; Jean-Philippe Brucker <[email protected]>;
> > Nicolin Chen <[email protected]>; Yi Liu <[email protected]>; Jacob Pan
> > <[email protected]>
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; Lu
> > Baolu <[email protected]>
> > Subject: [PATCH v2 4/6] iommufd: Deliver fault messages to user space
> >
> [...]
>
> Hi,
>
> > +static ssize_t hwpt_fault_fops_write(struct file *filep,
> > + const char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + size_t response_size = sizeof(struct iommu_hwpt_page_response);
> > + struct hw_pgtable_fault *fault = filep->private_data;
> > + struct iommu_hwpt_page_response response;
> > + struct iommufd_hw_pagetable *hwpt;
> > + struct iopf_group *iter, *group;
> > + struct iommufd_device *idev;
> > + size_t done = 0;
> > + int rc = 0;
> > +
> > + if (*ppos || count % response_size)
> > + return -ESPIPE;
> > +
> > + mutex_lock(&fault->mutex);
> > + while (!list_empty(&fault->response) && count > done) {
> > + rc = copy_from_user(&response, buf + done, response_size);
> > + if (rc)
> > + break;
> > +
> > + /* Get the device that this response targets at. */
> > + idev = container_of(iommufd_get_object(fault->ictx,
> > + response.dev_id,
> > + IOMMUFD_OBJ_DEVICE),
> > + struct iommufd_device, obj);
> > + if (IS_ERR(idev)) {
> > + rc = PTR_ERR(idev);
> > + break;
> > + }
> > +
> > + /*
> > + * Get the hw page table that this response was generated for.
> > + * It must match the one stored in the fault data.
> > + */
> > + hwpt = container_of(iommufd_get_object(fault->ictx,
> > + response.hwpt_id,
> > +
> > IOMMUFD_OBJ_HW_PAGETABLE),
> > + struct iommufd_hw_pagetable, obj);
> > + if (IS_ERR(hwpt)) {
> > + iommufd_put_object(&idev->obj);
> > + rc = PTR_ERR(hwpt);
> > + break;
> > + }
> > +
> > + if (hwpt != fault->hwpt) {
> > + rc = -EINVAL;
> > + goto put_obj;
> > + }
> > +
> > + group = NULL;
> > + list_for_each_entry(iter, &fault->response, node) {
> > + if (response.grpid != iter->last_fault.fault.prm.grpid)
> > + continue;
> > +
> > + if (idev->dev != iter->dev)
> > + continue;
> > +
> > + if ((iter->last_fault.fault.prm.flags &
> > + IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
> > + response.pasid != iter->last_fault.fault.prm.pasid)
> > + continue;
>
> I am trying to get vSVA working with this series and got hit by the above check.
> On ARM platforms, page responses to stall events(CMD_RESUME) do not have
> an associated pasid. I think, either we need to check here using
> IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID or remove the check
> as it will be eventually done in iommu_page_response().

That doesn't sound right..

The PASID is the only information we have for userspace to identify
the domain that is being faulted. It cannot be optional on the request
side.

If it is valid when userspace does read() then it should be valid when
userspace does write() too.

It is the only way the kernel can actually match request and response
here.

So, I think you have a userspace issue to not provide the right
pasid??

Jason

Subject: RE: [PATCH v2 4/6] iommufd: Deliver fault messages to user space



> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, January 15, 2024 4:47 PM
> To: Shameerali Kolothum Thodi <[email protected]>
> Cc: Lu Baolu <[email protected]>; Kevin Tian <[email protected]>;
> Joerg Roedel <[email protected]>; Will Deacon <[email protected]>; Robin
> Murphy <[email protected]>; Jean-Philippe Brucker <jean-
> [email protected]>; Nicolin Chen <[email protected]>; Yi Liu
> <[email protected]>; Jacob Pan <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space
>
> On Fri, Jan 12, 2024 at 05:46:13PM +0000, Shameerali Kolothum Thodi wrote:
> >
> >
> > > -----Original Message-----
> > > From: Lu Baolu <[email protected]>
> > > Sent: Thursday, October 26, 2023 3:49 AM
> > > To: Jason Gunthorpe <[email protected]>; Kevin Tian <[email protected]>;
> Joerg
> > > Roedel <[email protected]>; Will Deacon <[email protected]>; Robin
> Murphy
> > > <[email protected]>; Jean-Philippe Brucker <jean-
> [email protected]>;
> > > Nicolin Chen <[email protected]>; Yi Liu <[email protected]>; Jacob
> Pan
> > > <[email protected]>
> > > Cc: [email protected]; [email protected];
> > > [email protected]; [email protected];
> Lu
> > > Baolu <[email protected]>
> > > Subject: [PATCH v2 4/6] iommufd: Deliver fault messages to user space
> > >
> > [...]
> >
> > Hi,
> >
> > > +static ssize_t hwpt_fault_fops_write(struct file *filep,
> > > + const char __user *buf,
> > > + size_t count, loff_t *ppos)
> > > +{
> > > + size_t response_size = sizeof(struct iommu_hwpt_page_response);
> > > + struct hw_pgtable_fault *fault = filep->private_data;
> > > + struct iommu_hwpt_page_response response;
> > > + struct iommufd_hw_pagetable *hwpt;
> > > + struct iopf_group *iter, *group;
> > > + struct iommufd_device *idev;
> > > + size_t done = 0;
> > > + int rc = 0;
> > > +
> > > + if (*ppos || count % response_size)
> > > + return -ESPIPE;
> > > +
> > > + mutex_lock(&fault->mutex);
> > > + while (!list_empty(&fault->response) && count > done) {
> > > + rc = copy_from_user(&response, buf + done, response_size);
> > > + if (rc)
> > > + break;
> > > +
> > > + /* Get the device that this response targets at. */
> > > + idev = container_of(iommufd_get_object(fault->ictx,
> > > + response.dev_id,
> > > + IOMMUFD_OBJ_DEVICE),
> > > + struct iommufd_device, obj);
> > > + if (IS_ERR(idev)) {
> > > + rc = PTR_ERR(idev);
> > > + break;
> > > + }
> > > +
> > > + /*
> > > + * Get the hw page table that this response was generated
> for.
> > > + * It must match the one stored in the fault data.
> > > + */
> > > + hwpt = container_of(iommufd_get_object(fault->ictx,
> > > + response.hwpt_id,
> > > +
> > > IOMMUFD_OBJ_HW_PAGETABLE),
> > > + struct iommufd_hw_pagetable, obj);
> > > + if (IS_ERR(hwpt)) {
> > > + iommufd_put_object(&idev->obj);
> > > + rc = PTR_ERR(hwpt);
> > > + break;
> > > + }
> > > +
> > > + if (hwpt != fault->hwpt) {
> > > + rc = -EINVAL;
> > > + goto put_obj;
> > > + }
> > > +
> > > + group = NULL;
> > > + list_for_each_entry(iter, &fault->response, node) {
> > > + if (response.grpid != iter->last_fault.fault.prm.grpid)
> > > + continue;
> > > +
> > > + if (idev->dev != iter->dev)
> > > + continue;
> > > +
> > > + if ((iter->last_fault.fault.prm.flags &
> > > + IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
> &&
> > > + response.pasid != iter->last_fault.fault.prm.pasid)
> > > + continue;
> >
> > I am trying to get vSVA working with this series and got hit by the above
> check.
> > On ARM platforms, page responses to stall events(CMD_RESUME) do not
> have
> > an associated pasid. I think, either we need to check here using
> > IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID or remove the check
> > as it will be eventually done in iommu_page_response().
>
> That doesn't sound right..
>
> The PASID is the only information we have for userspace to identify
> the domain that is being faulted. It cannot be optional on the request
> side.
>

Yes, it is valid on the request side. But this is on the response side.

> If it is valid when userspace does read() then it should be valid when
> userspace does write() too.
>
> It is the only way the kernel can actually match request and response
> here.

The kernel currently checks the pasid only if IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID
is set.

https://lore.kernel.org/linux-iommu/[email protected]/

> So, I think you have a userspace issue to not provide the right
> pasid??

This is not just ARM stall resume case, but for some PCI devices as well as per
the above commit log. So do we really need to track this in userspace ?

Thanks,
Shameer

2024-01-15 17:59:10

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] iommufd: Deliver fault messages to user space

On Mon, Jan 15, 2024 at 05:44:13PM +0000, Shameerali Kolothum Thodi wrote:

> > If it is valid when userspace does read() then it should be valid when
> > userspace does write() too.
> >
> > It is the only way the kernel can actually match request and response
> > here.
>
> The kernel currently checks the pasid only if IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID
> is set.
>
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> > So, I think you have a userspace issue to not provide the right
> > pasid??
>
> This is not just ARM stall resume case, but for some PCI devices as well as per
> the above commit log. So do we really need to track this in userspace ?

Yes, these weird HW details should not leak into userspace.

The PASID is required on the read() side, userspace should provide it
on the write() side. It is trivial for it to do, there is no reason to
accommodate anything else.

Alternatively I'm wondering if we should supply a serial number to
userspace so it can match the request/response instead of relying on
guessing based on pasid/grpid?

Jason