2024-01-22 10:00:44

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v3 0/8] IOMMUFD: Deliver IO page faults to user space

This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. One feasible use case is the
nested translation. Nested translation is a hardware feature that
supports two-stage translation tables for IOMMU. The second-stage
translation table is managed by the host VMM, while the first-stage
translation table is owned by user space. This allows user space to
control the IOMMU mappings for its devices.

When an IO page fault occurs on the first-stage translation table, the
IOMMU hardware can deliver the page fault to user space through the
IOMMUFD framework. User space can then handle the page fault and respond
to the device top-down through the IOMMUFD. This allows user space to
implement its own IO page fault handling policies.

User space application that is capable of handling IO page faults should
allocate a fault object, and bind the fault object to any domain that it
is willing to handle the fault generatd for them. On a successful return
of fault object allocation, the user can retrieve and respond to page
faults by reading or writing to the file descriptor (FD) returned.

The iommu selftest framework has been updated to test the IO page fault
delivery and response functionality.

This series is based on the page fault handling framework refactoring
in the IOMMU core [1].

The series and related patches are available on GitHub: [2]

[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v3

Best regards,
baolu

Change log:

v3:
- Add iopf domain attach/detach/replace interfaces to manage the
reference counters of hwpt and device, ensuring that both can only be
destroyed after all outstanding IOPFs have been responded to.
- Relocate the fault handling file descriptor from hwpt to a fault
object to enable a single fault handling object to be utilized
across multiple domains.
- Miscellaneous cleanup and performance improvements.

v2: https://lore.kernel.org/linux-iommu/[email protected]/
- Move all iommu refactoring patches into a sparated series and discuss
it in a different thread. The latest patch series [v6] is available at
https://lore.kernel.org/linux-iommu/[email protected]/
- We discussed the timeout of the pending page fault messages. We
agreed that we shouldn't apply any timeout policy for the page fault
handling in user space.
https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/
- Jason suggested that we adopt a simple file descriptor interface for
reading and responding to I/O page requests, so that user space
applications can improve performance using io_uring.
https://lore.kernel.org/linux-iommu/[email protected]/

v1: https://lore.kernel.org/linux-iommu/[email protected]/

Lu Baolu (8):
iommu: Add iopf domain attach/detach/replace interface
iommu/sva: Use iopf domain attach/detach interface
iommufd: Add fault and response message definitions
iommufd: Add iommufd fault object
iommufd: Associate fault object with iommufd_hw_pgtable
iommufd: IOPF-capable hw page table attach/detach/replace
iommufd/selftest: Add IOPF support for mock device
iommufd/selftest: Add coverage for IOPF test

include/linux/iommu.h | 40 +-
drivers/iommu/iommufd/iommufd_private.h | 41 ++
drivers/iommu/iommufd/iommufd_test.h | 8 +
include/uapi/linux/iommufd.h | 91 ++++
tools/testing/selftests/iommu/iommufd_utils.h | 83 +++-
drivers/iommu/io-pgfault.c | 215 ++++++++--
drivers/iommu/iommu-sva.c | 48 ++-
drivers/iommu/iommufd/device.c | 16 +-
drivers/iommu/iommufd/fault.c | 391 ++++++++++++++++++
drivers/iommu/iommufd/hw_pagetable.c | 36 +-
drivers/iommu/iommufd/main.c | 6 +
drivers/iommu/iommufd/selftest.c | 63 +++
tools/testing/selftests/iommu/iommufd.c | 17 +
.../selftests/iommu/iommufd_fail_nth.c | 2 +-
drivers/iommu/iommufd/Makefile | 1 +
15 files changed, 1001 insertions(+), 57 deletions(-)
create mode 100644 drivers/iommu/iommufd/fault.c

--
2.34.1



2024-01-22 10:26:16

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

When allocating a user iommufd_hw_pagetable, the user space is allowed to
associate a fault object with the hw_pagetable by specifying the fault
object ID in the page table allocation data and setting the
IOMMU_HWPT_FAULT_ID_VALID flag bit.

On a successful return of hwpt allocation, the user can retrieve and
respond to page faults by reading and writing the file interface of the
fault object.

Once a fault object has been associated with a hwpt, the hwpt is
iopf-capable, indicated by fault_capable set to true. Attaching,
detaching, or replacing an iopf-capable hwpt to an RID or PASID will
differ from those that are not iopf-capable. The implementation of these
will be introduced in the next patch.

Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/iommufd/iommufd_private.h | 11 ++++++++
include/uapi/linux/iommufd.h | 6 +++++
drivers/iommu/iommufd/fault.c | 14 ++++++++++
drivers/iommu/iommufd/hw_pagetable.c | 36 +++++++++++++++++++------
4 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 52d83e888bd0..2780bed0c6b1 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -293,6 +293,8 @@ int iommufd_check_iova_range(struct io_pagetable *iopt,
struct iommufd_hw_pagetable {
struct iommufd_object obj;
struct iommu_domain *domain;
+ struct iommufd_fault *fault;
+ bool fault_capable : 1;
};

struct iommufd_hwpt_paging {
@@ -446,8 +448,17 @@ struct iommufd_fault {
struct wait_queue_head wait_queue;
};

+static inline struct iommufd_fault *
+iommufd_get_fault(struct iommufd_ucmd *ucmd, u32 id)
+{
+ return container_of(iommufd_get_object(ucmd->ictx, id,
+ IOMMUFD_OBJ_FAULT),
+ struct iommufd_fault, obj);
+}
+
int iommufd_fault_alloc(struct iommufd_ucmd *ucmd);
void iommufd_fault_destroy(struct iommufd_object *obj);
+int iommufd_fault_iopf_handler(struct iopf_group *group);

#ifdef CONFIG_IOMMUFD_TEST
int iommufd_test(struct iommufd_ucmd *ucmd);
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index c32d62b02306..7481cdd57027 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -357,10 +357,13 @@ struct iommu_vfio_ioas {
* the parent HWPT in a nesting configuration.
* @IOMMU_HWPT_ALLOC_DIRTY_TRACKING: Dirty tracking support for device IOMMU is
* enforced on device attachment
+ * @IOMMU_HWPT_FAULT_ID_VALID: The fault_id field of hwpt allocation data is
+ * valid.
*/
enum iommufd_hwpt_alloc_flags {
IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
IOMMU_HWPT_ALLOC_DIRTY_TRACKING = 1 << 1,
+ IOMMU_HWPT_FAULT_ID_VALID = 1 << 2,
};

/**
@@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
* @__reserved: Must be 0
* @data_type: One of enum iommu_hwpt_data_type
* @data_len: Length of the type specific data
+ * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
+ * IOMMU_HWPT_FAULT_ID_VALID is set.
* @data_uptr: User pointer to the type specific data
*
* Explicitly allocate a hardware page table object. This is the same object
@@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
__u32 __reserved;
__u32 data_type;
__u32 data_len;
+ __u32 fault_id;
__aligned_u64 data_uptr;
};
#define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index 9844a85feeb2..e752d1c49dde 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -253,3 +253,17 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)

return rc;
}
+
+int iommufd_fault_iopf_handler(struct iopf_group *group)
+{
+ struct iommufd_hw_pagetable *hwpt = group->cookie->domain->fault_data;
+ struct iommufd_fault *fault = hwpt->fault;
+
+ mutex_lock(&fault->mutex);
+ list_add_tail(&group->node, &fault->deliver);
+ mutex_unlock(&fault->mutex);
+
+ wake_up_interruptible(&fault->wait_queue);
+
+ return 0;
+}
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 3f3f1fa1a0a9..2703d5aea4f5 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -8,6 +8,15 @@
#include "../iommu-priv.h"
#include "iommufd_private.h"

+static void __iommufd_hwpt_destroy(struct iommufd_hw_pagetable *hwpt)
+{
+ if (hwpt->domain)
+ iommu_domain_free(hwpt->domain);
+
+ if (hwpt->fault)
+ iommufd_put_object(hwpt->fault->ictx, &hwpt->fault->obj);
+}
+
void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
{
struct iommufd_hwpt_paging *hwpt_paging =
@@ -22,9 +31,7 @@ void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
hwpt_paging->common.domain);
}

- if (hwpt_paging->common.domain)
- iommu_domain_free(hwpt_paging->common.domain);
-
+ __iommufd_hwpt_destroy(&hwpt_paging->common);
refcount_dec(&hwpt_paging->ioas->obj.users);
}

@@ -49,9 +56,7 @@ void iommufd_hwpt_nested_destroy(struct iommufd_object *obj)
struct iommufd_hwpt_nested *hwpt_nested =
container_of(obj, struct iommufd_hwpt_nested, common.obj);

- if (hwpt_nested->common.domain)
- iommu_domain_free(hwpt_nested->common.domain);
-
+ __iommufd_hwpt_destroy(&hwpt_nested->common);
refcount_dec(&hwpt_nested->parent->common.obj.users);
}

@@ -213,7 +218,8 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
struct iommufd_hw_pagetable *hwpt;
int rc;

- if (flags || !user_data->len || !ops->domain_alloc_user)
+ if ((flags & ~IOMMU_HWPT_FAULT_ID_VALID) ||
+ !user_data->len || !ops->domain_alloc_user)
return ERR_PTR(-EOPNOTSUPP);
if (parent->auto_domain || !parent->nest_parent)
return ERR_PTR(-EINVAL);
@@ -227,7 +233,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
refcount_inc(&parent->common.obj.users);
hwpt_nested->parent = parent;

- hwpt->domain = ops->domain_alloc_user(idev->dev, flags,
+ hwpt->domain = ops->domain_alloc_user(idev->dev, 0,
parent->common.domain, user_data);
if (IS_ERR(hwpt->domain)) {
rc = PTR_ERR(hwpt->domain);
@@ -307,6 +313,20 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
goto out_put_pt;
}

+ if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
+ struct iommufd_fault *fault;
+
+ fault = iommufd_get_fault(ucmd, cmd->fault_id);
+ if (IS_ERR(fault)) {
+ rc = PTR_ERR(fault);
+ goto out_hwpt;
+ }
+ hwpt->fault = fault;
+ hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
+ hwpt->domain->fault_data = hwpt;
+ hwpt->fault_capable = true;
+ }
+
cmd->out_hwpt_id = hwpt->obj.id;
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
if (rc)
--
2.34.1


2024-01-22 12:41:38

by Baolu Lu

[permalink] [raw]
Subject: [PATCH v3 4/8] iommufd: Add iommufd fault object

An iommufd fault object provides an interface for delivering I/O page
faults to user space. These objects are created and destroyed by user
space, and they can be associated with or dissociated from hardware page
table objects during page table allocation or destruction.

User space interacts with the fault object through a file interface. This
interface offers a straightforward and efficient way for user space to
handle page faults. It allows user space to read fault messages
sequentially and respond to them by writing to the same file. The file
interface supports reading messages in poll mode, so it's recommended that
user space applications use io_uring to enhance read and write efficiency.

A fault object can be associated with any iopf-capable iommufd_hw_pgtable
during the pgtable's allocation. All I/O page faults triggered by devices
when accessing the I/O addresses of an iommufd_hw_pgtable are routed
through the fault object to user space. Similarly, user space's responses
to these page faults are routed back to the iommu device driver through
the same fault object.

Signed-off-by: Lu Baolu <[email protected]>
---
include/linux/iommu.h | 2 +
drivers/iommu/iommufd/iommufd_private.h | 23 +++
include/uapi/linux/iommufd.h | 18 ++
drivers/iommu/iommufd/device.c | 1 +
drivers/iommu/iommufd/fault.c | 255 ++++++++++++++++++++++++
drivers/iommu/iommufd/main.c | 6 +
drivers/iommu/iommufd/Makefile | 1 +
7 files changed, 306 insertions(+)
create mode 100644 drivers/iommu/iommufd/fault.c

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 511dc7b4bdb2..4372648ac22e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -145,6 +145,8 @@ struct iopf_group {
/* The device's fault data parameter. */
struct iommu_fault_param *fault_param;
struct iopf_attach_cookie *cookie;
+ /* Used by handler provider to hook the group on its own lists. */
+ struct list_head node;
};

/**
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 991f864d1f9b..52d83e888bd0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -128,6 +128,7 @@ enum iommufd_object_type {
IOMMUFD_OBJ_HWPT_NESTED,
IOMMUFD_OBJ_IOAS,
IOMMUFD_OBJ_ACCESS,
+ IOMMUFD_OBJ_FAULT,
#ifdef CONFIG_IOMMUFD_TEST
IOMMUFD_OBJ_SELFTEST,
#endif
@@ -395,6 +396,8 @@ struct iommufd_device {
/* always the physical device */
struct device *dev;
bool enforce_cache_coherency;
+ /* outstanding faults awaiting response indexed by fault group id */
+ struct xarray faults;
};

static inline struct iommufd_device *
@@ -426,6 +429,26 @@ void iopt_remove_access(struct io_pagetable *iopt,
u32 iopt_access_list_id);
void iommufd_access_destroy_object(struct iommufd_object *obj);

+/*
+ * An iommufd_fault object represents an interface to deliver I/O page faults
+ * to the user space. These objects are created/destroyed by the user space and
+ * associated with hardware page table objects during page-table allocation.
+ */
+struct iommufd_fault {
+ struct iommufd_object obj;
+ struct iommufd_ctx *ictx;
+
+ /* The lists of outstanding faults protected by below mutex. */
+ struct mutex mutex;
+ struct list_head deliver;
+ struct list_head response;
+
+ struct wait_queue_head wait_queue;
+};
+
+int iommufd_fault_alloc(struct iommufd_ucmd *ucmd);
+void iommufd_fault_destroy(struct iommufd_object *obj);
+
#ifdef CONFIG_IOMMUFD_TEST
int iommufd_test(struct iommufd_ucmd *ucmd);
void iommufd_selftest_destroy(struct iommufd_object *obj);
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index d59e839ae49e..c32d62b02306 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -50,6 +50,7 @@ enum {
IOMMUFD_CMD_HWPT_SET_DIRTY_TRACKING,
IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
IOMMUFD_CMD_HWPT_INVALIDATE,
+ IOMMUFD_CMD_FAULT_ALLOC,
};

/**
@@ -759,4 +760,21 @@ struct iommu_hwpt_page_response {
__u32 code;
__u64 addr;
};
+
+/**
+ * struct iommu_fault_alloc - ioctl(IOMMU_FAULT_ALLOC)
+ * @size: sizeof(struct iommu_fault_alloc)
+ * @flags: Must be 0
+ * @out_fault_id: The ID of the new FAULT
+ * @out_fault_fd: The fd of the new FAULT
+ *
+ * Explicitly allocate a fault handling object.
+ */
+struct iommu_fault_alloc {
+ __u32 size;
+ __u32 flags;
+ __u32 out_fault_id;
+ __u32 out_fault_fd;
+};
+#define IOMMU_FAULT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_FAULT_ALLOC)
#endif
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 873630c111c1..d70913ee8fdf 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -215,6 +215,7 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
refcount_inc(&idev->obj.users);
/* igroup refcount moves into iommufd_device */
idev->igroup = igroup;
+ xa_init(&idev->faults);

/*
* If the caller fails after this success it must call
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
new file mode 100644
index 000000000000..9844a85feeb2
--- /dev/null
+++ b/drivers/iommu/iommufd/fault.c
@@ -0,0 +1,255 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2024 Intel Corporation
+ */
+#define pr_fmt(fmt) "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/iommufd.h>
+#include <linux/poll.h>
+#include <linux/anon_inodes.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+static int device_add_fault(struct iopf_group *group)
+{
+ struct iommufd_device *idev = group->cookie->private;
+ void *curr;
+
+ curr = xa_cmpxchg(&idev->faults, group->last_fault.fault.prm.grpid,
+ NULL, group, GFP_KERNEL);
+
+ return curr ? xa_err(curr) : 0;
+}
+
+static void device_remove_fault(struct iopf_group *group)
+{
+ struct iommufd_device *idev = group->cookie->private;
+
+ xa_store(&idev->faults, group->last_fault.fault.prm.grpid,
+ NULL, GFP_KERNEL);
+}
+
+static struct iopf_group *device_get_fault(struct iommufd_device *idev,
+ unsigned long grpid)
+{
+ return xa_load(&idev->faults, grpid);
+}
+
+void iommufd_fault_destroy(struct iommufd_object *obj)
+{
+ struct iommufd_fault *fault = container_of(obj, struct iommufd_fault, obj);
+ struct iopf_group *group, *next;
+
+ mutex_lock(&fault->mutex);
+ list_for_each_entry_safe(group, next, &fault->deliver, node) {
+ list_del(&group->node);
+ iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
+ iopf_free_group(group);
+ }
+ list_for_each_entry_safe(group, next, &fault->response, node) {
+ list_del(&group->node);
+ device_remove_fault(group);
+ iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
+ iopf_free_group(group);
+ }
+ mutex_unlock(&fault->mutex);
+
+ mutex_destroy(&fault->mutex);
+}
+
+static void iommufd_compose_fault_message(struct iommu_fault *fault,
+ struct iommu_hwpt_pgfault *hwpt_fault,
+ struct iommufd_device *idev)
+{
+ hwpt_fault->size = sizeof(*hwpt_fault);
+ hwpt_fault->flags = fault->prm.flags;
+ hwpt_fault->dev_id = idev->obj.id;
+ hwpt_fault->pasid = fault->prm.pasid;
+ hwpt_fault->grpid = fault->prm.grpid;
+ hwpt_fault->perm = fault->prm.perm;
+ hwpt_fault->addr = fault->prm.addr;
+}
+
+static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
+ struct iommufd_fault *fault = filep->private_data;
+ struct iommu_hwpt_pgfault data;
+ struct iommufd_device *idev;
+ struct iopf_group *group;
+ struct iopf_fault *iopf;
+ size_t done = 0;
+ int rc;
+
+ if (*ppos || count % fault_size)
+ return -ESPIPE;
+
+ mutex_lock(&fault->mutex);
+ while (!list_empty(&fault->deliver) && count > done) {
+ group = list_first_entry(&fault->deliver,
+ struct iopf_group, node);
+
+ if (list_count_nodes(&group->faults) * fault_size > count - done)
+ break;
+
+ idev = (struct iommufd_device *)group->cookie->private;
+ list_for_each_entry(iopf, &group->faults, list) {
+ iommufd_compose_fault_message(&iopf->fault, &data, idev);
+ rc = copy_to_user(buf + done, &data, fault_size);
+ if (rc)
+ goto err_unlock;
+ done += fault_size;
+ }
+
+ rc = device_add_fault(group);
+ if (rc)
+ goto err_unlock;
+
+ list_move_tail(&group->node, &fault->response);
+ }
+ mutex_unlock(&fault->mutex);
+
+ return done;
+err_unlock:
+ mutex_unlock(&fault->mutex);
+ return rc;
+}
+
+static ssize_t iommufd_fault_fops_write(struct file *filep, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ size_t response_size = sizeof(struct iommu_hwpt_page_response);
+ struct iommufd_fault *fault = filep->private_data;
+ struct iommu_hwpt_page_response response;
+ struct iommufd_device *idev;
+ struct iopf_group *group;
+ size_t done = 0;
+ int rc;
+
+ if (*ppos || count % response_size)
+ return -ESPIPE;
+
+ while (!list_empty(&fault->response) && count > done) {
+ rc = copy_from_user(&response, buf + done, response_size);
+ if (rc)
+ break;
+
+ idev = container_of(iommufd_get_object(fault->ictx,
+ response.dev_id,
+ IOMMUFD_OBJ_DEVICE),
+ struct iommufd_device, obj);
+ if (IS_ERR(idev))
+ break;
+
+ group = device_get_fault(idev, response.grpid);
+ if (!group ||
+ response.addr != group->last_fault.fault.prm.addr ||
+ ((group->last_fault.fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
+ response.pasid != group->last_fault.fault.prm.pasid)) {
+ iommufd_put_object(fault->ictx, &idev->obj);
+ break;
+ }
+
+ iopf_group_response(group, response.code);
+
+ mutex_lock(&fault->mutex);
+ list_del(&group->node);
+ mutex_unlock(&fault->mutex);
+
+ device_remove_fault(group);
+ iopf_free_group(group);
+ done += response_size;
+
+ iommufd_put_object(fault->ictx, &idev->obj);
+ }
+
+ return done;
+}
+
+static __poll_t iommufd_fault_fops_poll(struct file *filep,
+ struct poll_table_struct *wait)
+{
+ struct iommufd_fault *fault = filep->private_data;
+ __poll_t pollflags = 0;
+
+ poll_wait(filep, &fault->wait_queue, wait);
+ mutex_lock(&fault->mutex);
+ if (!list_empty(&fault->deliver))
+ pollflags = EPOLLIN | EPOLLRDNORM;
+ mutex_unlock(&fault->mutex);
+
+ return pollflags;
+}
+
+static const struct file_operations iommufd_fault_fops = {
+ .owner = THIS_MODULE,
+ .open = nonseekable_open,
+ .read = iommufd_fault_fops_read,
+ .write = iommufd_fault_fops_write,
+ .poll = iommufd_fault_fops_poll,
+ .llseek = no_llseek,
+};
+
+static int get_fault_fd(struct iommufd_fault *fault)
+{
+ struct file *filep;
+ int fdno;
+
+ fdno = get_unused_fd_flags(O_CLOEXEC);
+ if (fdno < 0)
+ return fdno;
+
+ filep = anon_inode_getfile("[iommufd-pgfault]", &iommufd_fault_fops,
+ fault, O_RDWR);
+ if (IS_ERR(filep)) {
+ put_unused_fd(fdno);
+ return PTR_ERR(filep);
+ }
+
+ fd_install(fdno, filep);
+
+ return fdno;
+}
+
+int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
+{
+ struct iommu_fault_alloc *cmd = ucmd->cmd;
+ struct iommufd_fault *fault;
+ int rc;
+
+ if (cmd->flags)
+ return -EOPNOTSUPP;
+
+ fault = iommufd_object_alloc(ucmd->ictx, fault, IOMMUFD_OBJ_FAULT);
+ if (IS_ERR(fault))
+ return PTR_ERR(fault);
+
+ fault->ictx = ucmd->ictx;
+ INIT_LIST_HEAD(&fault->deliver);
+ INIT_LIST_HEAD(&fault->response);
+ mutex_init(&fault->mutex);
+ init_waitqueue_head(&fault->wait_queue);
+
+ rc = get_fault_fd(fault);
+ if (rc < 0)
+ goto out_abort;
+
+ cmd->out_fault_id = fault->obj.id;
+ cmd->out_fault_fd = rc;
+
+ rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+ if (rc)
+ goto out_abort;
+ iommufd_object_finalize(ucmd->ictx, &fault->obj);
+
+ return 0;
+out_abort:
+ iommufd_object_abort_and_destroy(ucmd->ictx, &fault->obj);
+
+ return rc;
+}
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 39b32932c61e..792db077d33e 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -332,6 +332,7 @@ union ucmd_buffer {
struct iommu_ioas_unmap unmap;
struct iommu_option option;
struct iommu_vfio_ioas vfio_ioas;
+ struct iommu_fault_alloc fault;
#ifdef CONFIG_IOMMUFD_TEST
struct iommu_test_cmd test;
#endif
@@ -381,6 +382,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
val64),
IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
__reserved),
+ IOCTL_OP(IOMMU_FAULT_ALLOC, iommufd_fault_alloc, struct iommu_fault_alloc,
+ out_fault_fd),
#ifdef CONFIG_IOMMUFD_TEST
IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
#endif
@@ -513,6 +516,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
.destroy = iommufd_hwpt_nested_destroy,
.abort = iommufd_hwpt_nested_abort,
},
+ [IOMMUFD_OBJ_FAULT] = {
+ .destroy = iommufd_fault_destroy,
+ },
#ifdef CONFIG_IOMMUFD_TEST
[IOMMUFD_OBJ_SELFTEST] = {
.destroy = iommufd_selftest_destroy,
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 34b446146961..b94a74366eed 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -6,6 +6,7 @@ iommufd-y := \
ioas.o \
main.o \
pages.o \
+ fault.o \
vfio_compat.o

iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
--
2.34.1


2024-02-07 08:15:02

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

> From: Lu Baolu <[email protected]>
> Sent: Monday, January 22, 2024 3:39 PM
>
> +
> +int iommufd_fault_iopf_handler(struct iopf_group *group)
> +{
> + struct iommufd_hw_pagetable *hwpt = group->cookie->domain-
> >fault_data;
> + struct iommufd_fault *fault = hwpt->fault;
> +

why not directly using iommufd_fault as the fault_data?

2024-02-21 06:06:34

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On 2024/2/7 16:14, Tian, Kevin wrote:
>> From: Lu Baolu <[email protected]>
>> Sent: Monday, January 22, 2024 3:39 PM
>>
>> +
>> +int iommufd_fault_iopf_handler(struct iopf_group *group)
>> +{
>> + struct iommufd_hw_pagetable *hwpt = group->cookie->domain-
>>> fault_data;
>> + struct iommufd_fault *fault = hwpt->fault;
>> +
>
> why not directly using iommufd_fault as the fault_data?

The relationship among these structures is:

iommufd_hwpt -> iommu_domain
^
|
v
iommufd_fault

It appears preferable to hook the hwpt instead of iommufd_fault to an
iommu_domain structure.

Best regards,
baolu


2024-03-02 02:36:42

by zhangfei

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On Mon, 22 Jan 2024 at 15:46, Lu Baolu <[email protected]> wrote:
>
> When allocating a user iommufd_hw_pagetable, the user space is allowed to
> associate a fault object with the hw_pagetable by specifying the fault
> object ID in the page table allocation data and setting the
> IOMMU_HWPT_FAULT_ID_VALID flag bit.
>
> On a successful return of hwpt allocation, the user can retrieve and
> respond to page faults by reading and writing the file interface of the
> fault object.
>
> Once a fault object has been associated with a hwpt, the hwpt is
> iopf-capable, indicated by fault_capable set to true. Attaching,
> detaching, or replacing an iopf-capable hwpt to an RID or PASID will
> differ from those that are not iopf-capable. The implementation of these
> will be introduced in the next patch.
>
> Signed-off-by: Lu Baolu <[email protected]>
> ---
> drivers/iommu/iommufd/iommufd_private.h | 11 ++++++++
> include/uapi/linux/iommufd.h | 6 +++++
> drivers/iommu/iommufd/fault.c | 14 ++++++++++
> drivers/iommu/iommufd/hw_pagetable.c | 36 +++++++++++++++++++------
> 4 files changed, 59 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 52d83e888bd0..2780bed0c6b1 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -293,6 +293,8 @@ int iommufd_check_iova_range(struct io_pagetable *iopt,
> struct iommufd_hw_pagetable {
> struct iommufd_object obj;
> struct iommu_domain *domain;
> + struct iommufd_fault *fault;
> + bool fault_capable : 1;
> };
>
> struct iommufd_hwpt_paging {
> @@ -446,8 +448,17 @@ struct iommufd_fault {
> struct wait_queue_head wait_queue;
> };
>
> +static inline struct iommufd_fault *
> +iommufd_get_fault(struct iommufd_ucmd *ucmd, u32 id)
> +{
> + return container_of(iommufd_get_object(ucmd->ictx, id,
> + IOMMUFD_OBJ_FAULT),
> + struct iommufd_fault, obj);
> +}
> +
> int iommufd_fault_alloc(struct iommufd_ucmd *ucmd);
> void iommufd_fault_destroy(struct iommufd_object *obj);
> +int iommufd_fault_iopf_handler(struct iopf_group *group);
>
> #ifdef CONFIG_IOMMUFD_TEST
> int iommufd_test(struct iommufd_ucmd *ucmd);
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index c32d62b02306..7481cdd57027 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -357,10 +357,13 @@ struct iommu_vfio_ioas {
> * the parent HWPT in a nesting configuration.
> * @IOMMU_HWPT_ALLOC_DIRTY_TRACKING: Dirty tracking support for device IOMMU is
> * enforced on device attachment
> + * @IOMMU_HWPT_FAULT_ID_VALID: The fault_id field of hwpt allocation data is
> + * valid.
> */
> enum iommufd_hwpt_alloc_flags {
> IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
> IOMMU_HWPT_ALLOC_DIRTY_TRACKING = 1 << 1,
> + IOMMU_HWPT_FAULT_ID_VALID = 1 << 2,
> };
>
> /**
> @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
> * @__reserved: Must be 0
> * @data_type: One of enum iommu_hwpt_data_type
> * @data_len: Length of the type specific data
> + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
> + * IOMMU_HWPT_FAULT_ID_VALID is set.
> * @data_uptr: User pointer to the type specific data
> *
> * Explicitly allocate a hardware page table object. This is the same object
> @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
> __u32 __reserved;
> __u32 data_type;
> __u32 data_len;
> + __u32 fault_id;
> __aligned_u64 data_uptr;
> };
> #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
> diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
> index 9844a85feeb2..e752d1c49dde 100644
> --- a/drivers/iommu/iommufd/fault.c
> +++ b/drivers/iommu/iommufd/fault.c
> @@ -253,3 +253,17 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
>
> return rc;
> }
> +
> +int iommufd_fault_iopf_handler(struct iopf_group *group)
> +{
> + struct iommufd_hw_pagetable *hwpt = group->cookie->domain->fault_data;
> + struct iommufd_fault *fault = hwpt->fault;
> +
> + mutex_lock(&fault->mutex);
> + list_add_tail(&group->node, &fault->deliver);
> + mutex_unlock(&fault->mutex);
> +
> + wake_up_interruptible(&fault->wait_queue);
> +
> + return 0;
> +}
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 3f3f1fa1a0a9..2703d5aea4f5 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -8,6 +8,15 @@
> #include "../iommu-priv.h"
> #include "iommufd_private.h"
>
> +static void __iommufd_hwpt_destroy(struct iommufd_hw_pagetable *hwpt)
> +{
> + if (hwpt->domain)
> + iommu_domain_free(hwpt->domain);
> +
> + if (hwpt->fault)
> + iommufd_put_object(hwpt->fault->ictx, &hwpt->fault->obj);
> +}
> +
> void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
> {
> struct iommufd_hwpt_paging *hwpt_paging =
> @@ -22,9 +31,7 @@ void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
> hwpt_paging->common.domain);
> }
>
> - if (hwpt_paging->common.domain)
> - iommu_domain_free(hwpt_paging->common.domain);
> -
> + __iommufd_hwpt_destroy(&hwpt_paging->common);
> refcount_dec(&hwpt_paging->ioas->obj.users);
> }
>
> @@ -49,9 +56,7 @@ void iommufd_hwpt_nested_destroy(struct iommufd_object *obj)
> struct iommufd_hwpt_nested *hwpt_nested =
> container_of(obj, struct iommufd_hwpt_nested, common.obj);
>
> - if (hwpt_nested->common.domain)
> - iommu_domain_free(hwpt_nested->common.domain);
> -
> + __iommufd_hwpt_destroy(&hwpt_nested->common);
> refcount_dec(&hwpt_nested->parent->common.obj.users);
> }
>
> @@ -213,7 +218,8 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
> struct iommufd_hw_pagetable *hwpt;
> int rc;
>
> - if (flags || !user_data->len || !ops->domain_alloc_user)
> + if ((flags & ~IOMMU_HWPT_FAULT_ID_VALID) ||
> + !user_data->len || !ops->domain_alloc_user)
> return ERR_PTR(-EOPNOTSUPP);
> if (parent->auto_domain || !parent->nest_parent)
> return ERR_PTR(-EINVAL);
> @@ -227,7 +233,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
> refcount_inc(&parent->common.obj.users);
> hwpt_nested->parent = parent;
>
> - hwpt->domain = ops->domain_alloc_user(idev->dev, flags,
> + hwpt->domain = ops->domain_alloc_user(idev->dev, 0,
> parent->common.domain, user_data);

Why remove flags? typo or any reason?
arm_smmu_domain_alloc_user can not get flags from the user app.
User should set IOMMU_HWPT_FAULT_ID_VALID, right?

Thanks

2024-03-06 15:16:15

by zhangfei

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

Hi, Baolu

On Sat, 2 Mar 2024 at 10:36, Zhangfei Gao <[email protected]> wrote:
>
> On Mon, 22 Jan 2024 at 15:46, Lu Baolu <[email protected]> wrote:
> >
> > When allocating a user iommufd_hw_pagetable, the user space is allowed to
> > associate a fault object with the hw_pagetable by specifying the fault
> > object ID in the page table allocation data and setting the
> > IOMMU_HWPT_FAULT_ID_VALID flag bit.
> >
> > On a successful return of hwpt allocation, the user can retrieve and
> > respond to page faults by reading and writing the file interface of the
> > fault object.
> >
> > Once a fault object has been associated with a hwpt, the hwpt is
> > iopf-capable, indicated by fault_capable set to true. Attaching,
> > detaching, or replacing an iopf-capable hwpt to an RID or PASID will
> > differ from those that are not iopf-capable. The implementation of these
> > will be introduced in the next patch.
> >
> > Signed-off-by: Lu Baolu <[email protected]>
> > ---
> > drivers/iommu/iommufd/iommufd_private.h | 11 ++++++++
> > include/uapi/linux/iommufd.h | 6 +++++
> > drivers/iommu/iommufd/fault.c | 14 ++++++++++
> > drivers/iommu/iommufd/hw_pagetable.c | 36 +++++++++++++++++++------
> > 4 files changed, 59 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> > index 52d83e888bd0..2780bed0c6b1 100644
> > --- a/drivers/iommu/iommufd/iommufd_private.h
> > +++ b/drivers/iommu/iommufd/iommufd_private.h
> > @@ -293,6 +293,8 @@ int iommufd_check_iova_range(struct io_pagetable *iopt,
> > struct iommufd_hw_pagetable {
> > struct iommufd_object obj;
> > struct iommu_domain *domain;
> > + struct iommufd_fault *fault;
> > + bool fault_capable : 1;
> > };
> >
> > struct iommufd_hwpt_paging {
> > @@ -446,8 +448,17 @@ struct iommufd_fault {
> > struct wait_queue_head wait_queue;
> > };
> >
> > +static inline struct iommufd_fault *
> > +iommufd_get_fault(struct iommufd_ucmd *ucmd, u32 id)
> > +{
> > + return container_of(iommufd_get_object(ucmd->ictx, id,
> > + IOMMUFD_OBJ_FAULT),
> > + struct iommufd_fault, obj);
> > +}
> > +
> > int iommufd_fault_alloc(struct iommufd_ucmd *ucmd);
> > void iommufd_fault_destroy(struct iommufd_object *obj);
> > +int iommufd_fault_iopf_handler(struct iopf_group *group);
> >
> > #ifdef CONFIG_IOMMUFD_TEST
> > int iommufd_test(struct iommufd_ucmd *ucmd);
> > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> > index c32d62b02306..7481cdd57027 100644
> > --- a/include/uapi/linux/iommufd.h
> > +++ b/include/uapi/linux/iommufd.h
> > @@ -357,10 +357,13 @@ struct iommu_vfio_ioas {
> > * the parent HWPT in a nesting configuration.
> > * @IOMMU_HWPT_ALLOC_DIRTY_TRACKING: Dirty tracking support for device IOMMU is
> > * enforced on device attachment
> > + * @IOMMU_HWPT_FAULT_ID_VALID: The fault_id field of hwpt allocation data is
> > + * valid.
> > */
> > enum iommufd_hwpt_alloc_flags {
> > IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
> > IOMMU_HWPT_ALLOC_DIRTY_TRACKING = 1 << 1,
> > + IOMMU_HWPT_FAULT_ID_VALID = 1 << 2,
> > };
> >
> > /**
> > @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
> > * @__reserved: Must be 0
> > * @data_type: One of enum iommu_hwpt_data_type
> > * @data_len: Length of the type specific data
> > + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
> > + * IOMMU_HWPT_FAULT_ID_VALID is set.
> > * @data_uptr: User pointer to the type specific data
> > *
> > * Explicitly allocate a hardware page table object. This is the same object
> > @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
> > __u32 __reserved;
> > __u32 data_type;
> > __u32 data_len;
> > + __u32 fault_id;
> > __aligned_u64 data_uptr;
> > };
> > #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
> > diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
> > index 9844a85feeb2..e752d1c49dde 100644
> > --- a/drivers/iommu/iommufd/fault.c
> > +++ b/drivers/iommu/iommufd/fault.c
> > @@ -253,3 +253,17 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
> >
> > return rc;
> > }
> > +
> > +int iommufd_fault_iopf_handler(struct iopf_group *group)
> > +{
> > + struct iommufd_hw_pagetable *hwpt = group->cookie->domain->fault_data;
> > + struct iommufd_fault *fault = hwpt->fault;
> > +
> > + mutex_lock(&fault->mutex);
> > + list_add_tail(&group->node, &fault->deliver);
> > + mutex_unlock(&fault->mutex);
> > +
> > + wake_up_interruptible(&fault->wait_queue);
> > +
> > + return 0;
> > +}
> > diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> > index 3f3f1fa1a0a9..2703d5aea4f5 100644
> > --- a/drivers/iommu/iommufd/hw_pagetable.c
> > +++ b/drivers/iommu/iommufd/hw_pagetable.c
> > @@ -8,6 +8,15 @@
> > #include "../iommu-priv.h"
> > #include "iommufd_private.h"
> >
> > +static void __iommufd_hwpt_destroy(struct iommufd_hw_pagetable *hwpt)
> > +{
> > + if (hwpt->domain)
> > + iommu_domain_free(hwpt->domain);
> > +
> > + if (hwpt->fault)
> > + iommufd_put_object(hwpt->fault->ictx, &hwpt->fault->obj);
> > +}
> > +
> > void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
> > {
> > struct iommufd_hwpt_paging *hwpt_paging =
> > @@ -22,9 +31,7 @@ void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
> > hwpt_paging->common.domain);
> > }
> >
> > - if (hwpt_paging->common.domain)
> > - iommu_domain_free(hwpt_paging->common.domain);
> > -
> > + __iommufd_hwpt_destroy(&hwpt_paging->common);
> > refcount_dec(&hwpt_paging->ioas->obj.users);
> > }
> >
> > @@ -49,9 +56,7 @@ void iommufd_hwpt_nested_destroy(struct iommufd_object *obj)
> > struct iommufd_hwpt_nested *hwpt_nested =
> > container_of(obj, struct iommufd_hwpt_nested, common.obj);
> >
> > - if (hwpt_nested->common.domain)
> > - iommu_domain_free(hwpt_nested->common.domain);
> > -
> > + __iommufd_hwpt_destroy(&hwpt_nested->common);
> > refcount_dec(&hwpt_nested->parent->common.obj.users);
> > }
> >
> > @@ -213,7 +218,8 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
> > struct iommufd_hw_pagetable *hwpt;
> > int rc;
> >
> > - if (flags || !user_data->len || !ops->domain_alloc_user)
> > + if ((flags & ~IOMMU_HWPT_FAULT_ID_VALID) ||
> > + !user_data->len || !ops->domain_alloc_user)
> > return ERR_PTR(-EOPNOTSUPP);
> > if (parent->auto_domain || !parent->nest_parent)
> > return ERR_PTR(-EINVAL);
> > @@ -227,7 +233,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
> > refcount_inc(&parent->common.obj.users);
> > hwpt_nested->parent = parent;
> >
> > - hwpt->domain = ops->domain_alloc_user(idev->dev, flags,
> > + hwpt->domain = ops->domain_alloc_user(idev->dev, 0,
> > parent->common.domain, user_data);
>
> Why remove flags? typo or any reason?
> arm_smmu_domain_alloc_user can not get flags from the user app.
> User should set IOMMU_HWPT_FAULT_ID_VALID, right?
>

Double checked, this does not send flags, 0 is OK,
Only domain_alloc_user in iommufd_hwpt_paging_alloc requires flags.

In my debug, I need this patch, otherwise NULL pointer errors happen
since SVA is not set.

Subject: [PATCH] iommufd: enable/disable IOMMU_DEV_FEAT_SVA when
iopf_enable/disable

When iopf_enable, IOMMU_DEV_FEAT_SVA has to be enabled as well,
which will call iopf_queue_add_device(master->smmu->evtq.iopf, dev).
Otherwise NULL pointer will happen since fault_param is NULL.

And disable IOMMU_DEV_FEAT_SVA in iopf_disable accordingly.

Signed-off-by: Zhangfei Gao <[email protected]>
---
drivers/iommu/iommufd/fault.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index a4a49f3cd4c2..89837d83e757 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -289,6 +289,12 @@ static int iommufd_fault_iopf_enable(struct
iommufd_device *idev)
if (ret)
return ret;

+ ret = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_SVA);
+ if (ret) {
+ iommu_dev_disable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
+ return ret;
+ }
+
idev->iopf_enabled = true;

return 0;
@@ -299,6 +305,7 @@ static void iommufd_fault_iopf_disable(struct
iommufd_device *idev)
if (!idev->iopf_enabled)
return;

+ iommu_dev_disable_feature(idev->dev, IOMMU_DEV_FEAT_SVA);
iommu_dev_disable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
idev->iopf_enabled = false;
}


And iommufd_fault_domain_replace_dev
if (iopf_enabled_originally && !hwpt->fault_capable)
iommufd_fault_iopf_disable(idev);
will cause iopf enable/ disable / enable, is this required, a bit confused.

Thanks

2024-03-06 16:04:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On Wed, Mar 06, 2024 at 11:15:50PM +0800, Zhangfei Gao wrote:
>
> Double checked, this does not send flags, 0 is OK,
> Only domain_alloc_user in iommufd_hwpt_paging_alloc requires flags.
>
> In my debug, I need this patch, otherwise NULL pointer errors happen
> since SVA is not set.

This is some driver bug, we need to get rid of these
iommu_dev_enable_feature() requirements.

Attaching domains with properties should be entirely sufficient.

Jason

2024-03-07 02:57:18

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On 2024/3/7 0:01, Jason Gunthorpe wrote:
> On Wed, Mar 06, 2024 at 11:15:50PM +0800, Zhangfei Gao wrote:
>> Double checked, this does not send flags, 0 is OK,
>> Only domain_alloc_user in iommufd_hwpt_paging_alloc requires flags.
>>
>> In my debug, I need this patch, otherwise NULL pointer errors happen
>> since SVA is not set.
> This is some driver bug, we need to get rid of these
> iommu_dev_enable_feature() requirements.

Yes. Especially iopf should be independent of SVA.

The problem in the arm smmu v3 driver is that enabling iopf is actually
done in the enabling SVA path, while the enabling iopf path does nothing
except for some checks. It doesn't matter if iopf is tied with SVA, but
when it comes to delivering iopf to user space, we need to decouple it.

Best regards,
baolu

2024-03-08 17:22:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On Thu, Mar 07, 2024 at 09:54:53AM +0800, Baolu Lu wrote:
> On 2024/3/7 0:01, Jason Gunthorpe wrote:
> > On Wed, Mar 06, 2024 at 11:15:50PM +0800, Zhangfei Gao wrote:
> > > Double checked, this does not send flags, 0 is OK,
> > > Only domain_alloc_user in iommufd_hwpt_paging_alloc requires flags.
> > >
> > > In my debug, I need this patch, otherwise NULL pointer errors happen
> > > since SVA is not set.
> > This is some driver bug, we need to get rid of these
> > iommu_dev_enable_feature() requirements.
>
> Yes. Especially iopf should be independent of SVA.
>
> The problem in the arm smmu v3 driver is that enabling iopf is actually
> done in the enabling SVA path, while the enabling iopf path does nothing
> except for some checks. It doesn't matter if iopf is tied with SVA, but
> when it comes to delivering iopf to user space, we need to decouple it.

Yes. Each driver is going to need to get a to a NOP for the feature
things then we can remove them.

I did half the ARM driver in my part 2, the iopf bit still needs
doing there.

Jason

2024-03-08 18:04:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

On Mon, Jan 22, 2024 at 03:38:59PM +0800, Lu Baolu wrote:
> --- /dev/null
> +++ b/drivers/iommu/iommufd/fault.c
> @@ -0,0 +1,255 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2024 Intel Corporation
> + */
> +#define pr_fmt(fmt) "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/iommufd.h>
> +#include <linux/poll.h>
> +#include <linux/anon_inodes.h>
> +#include <uapi/linux/iommufd.h>
> +
> +#include "iommufd_private.h"
> +
> +static int device_add_fault(struct iopf_group *group)
> +{
> + struct iommufd_device *idev = group->cookie->private;
> + void *curr;
> +
> + curr = xa_cmpxchg(&idev->faults, group->last_fault.fault.prm.grpid,
> + NULL, group, GFP_KERNEL);
> +
> + return curr ? xa_err(curr) : 0;
> +}
> +
> +static void device_remove_fault(struct iopf_group *group)
> +{
> + struct iommufd_device *idev = group->cookie->private;
> +
> + xa_store(&idev->faults, group->last_fault.fault.prm.grpid,
> + NULL, GFP_KERNEL);

xa_erase ?

Is grpid OK to use this way? Doesn't it come from the originating
device?

> +void iommufd_fault_destroy(struct iommufd_object *obj)
> +{
> + struct iommufd_fault *fault = container_of(obj, struct iommufd_fault, obj);
> + struct iopf_group *group, *next;
> +
> + mutex_lock(&fault->mutex);
> + list_for_each_entry_safe(group, next, &fault->deliver, node) {
> + list_del(&group->node);
> + iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
> + iopf_free_group(group);
> + }
> + list_for_each_entry_safe(group, next, &fault->response, node) {
> + list_del(&group->node);
> + device_remove_fault(group);
> + iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
> + iopf_free_group(group);
> + }
> + mutex_unlock(&fault->mutex);
> +
> + mutex_destroy(&fault->mutex);

It is really weird to lock a mutex we are about to destroy? At this
point the refcount on the iommufd_object is zero so there had better
not be any other threads with access to this pointer!

> +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
> + struct iommufd_fault *fault = filep->private_data;
> + struct iommu_hwpt_pgfault data;
> + struct iommufd_device *idev;
> + struct iopf_group *group;
> + struct iopf_fault *iopf;
> + size_t done = 0;
> + int rc;
> +
> + if (*ppos || count % fault_size)
> + return -ESPIPE;
> +
> + mutex_lock(&fault->mutex);
> + while (!list_empty(&fault->deliver) && count > done) {
> + group = list_first_entry(&fault->deliver,
> + struct iopf_group, node);
> +
> + if (list_count_nodes(&group->faults) * fault_size > count - done)
> + break;
> +
> + idev = (struct iommufd_device *)group->cookie->private;
> + list_for_each_entry(iopf, &group->faults, list) {
> + iommufd_compose_fault_message(&iopf->fault, &data, idev);
> + rc = copy_to_user(buf + done, &data, fault_size);
> + if (rc)
> + goto err_unlock;
> + done += fault_size;
> + }
> +
> + rc = device_add_fault(group);

See I wonder if this should be some xa_alloc or something instead of
trying to use the grpid?

> + while (!list_empty(&fault->response) && count > done) {
> + rc = copy_from_user(&response, buf + done, response_size);
> + if (rc)
> + break;
> +
> + idev = container_of(iommufd_get_object(fault->ictx,
> + response.dev_id,
> + IOMMUFD_OBJ_DEVICE),
> + struct iommufd_device, obj);

It seems unfortunate we do this lookup for every iteration of the loop,
I would lift it up and cache it..

> + if (IS_ERR(idev))
> + break;
> +
> + group = device_get_fault(idev, response.grpid);

This looks locked wrong, it should hold the fault mutex here and we
should probably do xchng to zero it at the same time we fetch it.

> + if (!group ||
> + response.addr != group->last_fault.fault.prm.addr ||
> + ((group->last_fault.fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
> + response.pasid != group->last_fault.fault.prm.pasid)) {

Why? If grpid is unique then just trust it.

> + iommufd_put_object(fault->ictx, &idev->obj);
> + break;
> + }
> +
> + iopf_group_response(group, response.code);
> +
> + mutex_lock(&fault->mutex);
> + list_del(&group->node);
> + mutex_unlock(&fault->mutex);
> +
> + device_remove_fault(group);
> + iopf_free_group(group);
> + done += response_size;
> +
> + iommufd_put_object(fault->ictx, &idev->obj);
> + }
> +
> + return done;
> +}
> +
> +static __poll_t iommufd_fault_fops_poll(struct file *filep,
> + struct poll_table_struct *wait)
> +{
> + struct iommufd_fault *fault = filep->private_data;
> + __poll_t pollflags = 0;
> +
> + poll_wait(filep, &fault->wait_queue, wait);
> + mutex_lock(&fault->mutex);
> + if (!list_empty(&fault->deliver))
> + pollflags = EPOLLIN | EPOLLRDNORM;
> + mutex_unlock(&fault->mutex);
> +
> + return pollflags;
> +}
> +
> +static const struct file_operations iommufd_fault_fops = {
> + .owner = THIS_MODULE,
> + .open = nonseekable_open,
> + .read = iommufd_fault_fops_read,
> + .write = iommufd_fault_fops_write,
> + .poll = iommufd_fault_fops_poll,
> + .llseek = no_llseek,
> +};

No release? That seems wrong..

> +static int get_fault_fd(struct iommufd_fault *fault)
> +{
> + struct file *filep;
> + int fdno;
> +
> + fdno = get_unused_fd_flags(O_CLOEXEC);
> + if (fdno < 0)
> + return fdno;
> +
> + filep = anon_inode_getfile("[iommufd-pgfault]", &iommufd_fault_fops,
> + fault, O_RDWR);
^^^^^^

See here you stick the iommufd_object into the FD but we don't
refcount it...

And iommufd_fault_destroy() doesn't do anything about this, so it just
UAFs the fault memory in the FD.

You need to refcount the iommufd_object here and add a release
function for the fd to put it back

*and* this FD needs to take a reference on the base iommufd_ctx fd too
as we can't tolerate them being destroyed out of sequence.

> +int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
> +{
> + struct iommu_fault_alloc *cmd = ucmd->cmd;
> + struct iommufd_fault *fault;
> + int rc;
> +
> + if (cmd->flags)
> + return -EOPNOTSUPP;
> +
> + fault = iommufd_object_alloc(ucmd->ictx, fault, IOMMUFD_OBJ_FAULT);
> + if (IS_ERR(fault))
> + return PTR_ERR(fault);
> +
> + fault->ictx = ucmd->ictx;
> + INIT_LIST_HEAD(&fault->deliver);
> + INIT_LIST_HEAD(&fault->response);
> + mutex_init(&fault->mutex);
> + init_waitqueue_head(&fault->wait_queue);
> +
> + rc = get_fault_fd(fault);
> + if (rc < 0)
> + goto out_abort;

And this is ordered wrong, fd_install has to happen after return to
user space is guarenteed as it cannot be undone.

> + cmd->out_fault_id = fault->obj.id;
> + cmd->out_fault_fd = rc;
> +
> + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> + if (rc)
> + goto out_abort;
> + iommufd_object_finalize(ucmd->ictx, &fault->obj);

fd_install() goes here

> + return 0;
> +out_abort:
> + iommufd_object_abort_and_destroy(ucmd->ictx, &fault->obj);
> +
> + return rc;
> +}

Jason

2024-03-08 19:05:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On Mon, Jan 22, 2024 at 03:39:00PM +0800, Lu Baolu wrote:

> @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
> * @__reserved: Must be 0
> * @data_type: One of enum iommu_hwpt_data_type
> * @data_len: Length of the type specific data
> + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
> + * IOMMU_HWPT_FAULT_ID_VALID is set.
> * @data_uptr: User pointer to the type specific data
> *
> * Explicitly allocate a hardware page table object. This is the same object
> @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
> __u32 __reserved;
> __u32 data_type;
> __u32 data_len;
> + __u32 fault_id;
> __aligned_u64 data_uptr;
> };

?? We can't add fault_id in the middle of the struct??

> + if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
> + struct iommufd_fault *fault;
> +
> + fault = iommufd_get_fault(ucmd, cmd->fault_id);
> + if (IS_ERR(fault)) {
> + rc = PTR_ERR(fault);
> + goto out_hwpt;
> + }
> + hwpt->fault = fault;
> + hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> + hwpt->domain->fault_data = hwpt;
> + hwpt->fault_capable = true;

I wonder if there should be an iommu API to make a domain fault
capable?

Jason

2024-03-15 01:47:05

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

On 3/9/24 2:03 AM, Jason Gunthorpe wrote:
> On Mon, Jan 22, 2024 at 03:38:59PM +0800, Lu Baolu wrote:
>> --- /dev/null
>> +++ b/drivers/iommu/iommufd/fault.c
>> @@ -0,0 +1,255 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright (C) 2024 Intel Corporation
>> + */
>> +#define pr_fmt(fmt) "iommufd: " fmt
>> +
>> +#include <linux/file.h>
>> +#include <linux/fs.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/iommufd.h>
>> +#include <linux/poll.h>
>> +#include <linux/anon_inodes.h>
>> +#include <uapi/linux/iommufd.h>
>> +
>> +#include "iommufd_private.h"
>> +
>> +static int device_add_fault(struct iopf_group *group)
>> +{
>> + struct iommufd_device *idev = group->cookie->private;
>> + void *curr;
>> +
>> + curr = xa_cmpxchg(&idev->faults, group->last_fault.fault.prm.grpid,
>> + NULL, group, GFP_KERNEL);
>> +
>> + return curr ? xa_err(curr) : 0;
>> +}
>> +
>> +static void device_remove_fault(struct iopf_group *group)
>> +{
>> + struct iommufd_device *idev = group->cookie->private;
>> +
>> + xa_store(&idev->faults, group->last_fault.fault.prm.grpid,
>> + NULL, GFP_KERNEL);
>
> xa_erase ?

Yes. Sure.

> Is grpid OK to use this way? Doesn't it come from the originating
> device?

The group ID is generated by the hardware. Here, we use it as an index
in the fault array to ensure it can be quickly retrieved in the page
fault response path.

>> +void iommufd_fault_destroy(struct iommufd_object *obj)
>> +{
>> + struct iommufd_fault *fault = container_of(obj, struct iommufd_fault, obj);
>> + struct iopf_group *group, *next;
>> +
>> + mutex_lock(&fault->mutex);
>> + list_for_each_entry_safe(group, next, &fault->deliver, node) {
>> + list_del(&group->node);
>> + iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
>> + iopf_free_group(group);
>> + }
>> + list_for_each_entry_safe(group, next, &fault->response, node) {
>> + list_del(&group->node);
>> + device_remove_fault(group);
>> + iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
>> + iopf_free_group(group);
>> + }
>> + mutex_unlock(&fault->mutex);
>> +
>> + mutex_destroy(&fault->mutex);
>
> It is really weird to lock a mutex we are about to destroy? At this
> point the refcount on the iommufd_object is zero so there had better
> not be any other threads with access to this pointer!

You are right. I will remove the lock/unlock and add a comment instead.

>
>> +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
>> + size_t count, loff_t *ppos)
>> +{
>> + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
>> + struct iommufd_fault *fault = filep->private_data;
>> + struct iommu_hwpt_pgfault data;
>> + struct iommufd_device *idev;
>> + struct iopf_group *group;
>> + struct iopf_fault *iopf;
>> + size_t done = 0;
>> + int rc;
>> +
>> + if (*ppos || count % fault_size)
>> + return -ESPIPE;
>> +
>> + mutex_lock(&fault->mutex);
>> + while (!list_empty(&fault->deliver) && count > done) {
>> + group = list_first_entry(&fault->deliver,
>> + struct iopf_group, node);
>> +
>> + if (list_count_nodes(&group->faults) * fault_size > count - done)
>> + break;
>> +
>> + idev = (struct iommufd_device *)group->cookie->private;
>> + list_for_each_entry(iopf, &group->faults, list) {
>> + iommufd_compose_fault_message(&iopf->fault, &data, idev);
>> + rc = copy_to_user(buf + done, &data, fault_size);
>> + if (rc)
>> + goto err_unlock;
>> + done += fault_size;
>> + }
>> +
>> + rc = device_add_fault(group);
>
> See I wonder if this should be some xa_alloc or something instead of
> trying to use the grpid?

So this magic number will be passed to user space in the fault message.
And the user will then include this number in its response message. The
response message is valid only when the magic number matches. Do I get
you correctly?

>
>> + while (!list_empty(&fault->response) && count > done) {
>> + rc = copy_from_user(&response, buf + done, response_size);
>> + if (rc)
>> + break;
>> +
>> + idev = container_of(iommufd_get_object(fault->ictx,
>> + response.dev_id,
>> + IOMMUFD_OBJ_DEVICE),
>> + struct iommufd_device, obj);
>
> It seems unfortunate we do this lookup for every iteration of the loop,
> I would lift it up and cache it..

Okay.

>
>> + if (IS_ERR(idev))
>> + break;
>> +
>> + group = device_get_fault(idev, response.grpid);
>
> This looks locked wrong, it should hold the fault mutex here and we
> should probably do xchng to zero it at the same time we fetch it.

Okay, will fix it.

>
>> + if (!group ||
>> + response.addr != group->last_fault.fault.prm.addr ||
>> + ((group->last_fault.fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
>> + response.pasid != group->last_fault.fault.prm.pasid)) {
>
> Why? If grpid is unique then just trust it.

I was just considering that we should add some sanity check here to
ensure the response message is composed in the right way.

>
>> + iommufd_put_object(fault->ictx, &idev->obj);
>> + break;
>> + }
>> +
>> + iopf_group_response(group, response.code);
>> +
>> + mutex_lock(&fault->mutex);
>> + list_del(&group->node);
>> + mutex_unlock(&fault->mutex);
>> +
>> + device_remove_fault(group);
>> + iopf_free_group(group);
>> + done += response_size;
>> +
>> + iommufd_put_object(fault->ictx, &idev->obj);
>> + }
>> +
>> + return done;
>> +}
>> +
>> +static __poll_t iommufd_fault_fops_poll(struct file *filep,
>> + struct poll_table_struct *wait)
>> +{
>> + struct iommufd_fault *fault = filep->private_data;
>> + __poll_t pollflags = 0;
>> +
>> + poll_wait(filep, &fault->wait_queue, wait);
>> + mutex_lock(&fault->mutex);
>> + if (!list_empty(&fault->deliver))
>> + pollflags = EPOLLIN | EPOLLRDNORM;
>> + mutex_unlock(&fault->mutex);
>> +
>> + return pollflags;
>> +}
>> +
>> +static const struct file_operations iommufd_fault_fops = {
>> + .owner = THIS_MODULE,
>> + .open = nonseekable_open,
>> + .read = iommufd_fault_fops_read,
>> + .write = iommufd_fault_fops_write,
>> + .poll = iommufd_fault_fops_poll,
>> + .llseek = no_llseek,
>> +};
>
> No release? That seems wrong..

Yes. Will fix it.

>
>> +static int get_fault_fd(struct iommufd_fault *fault)
>> +{
>> + struct file *filep;
>> + int fdno;
>> +
>> + fdno = get_unused_fd_flags(O_CLOEXEC);
>> + if (fdno < 0)
>> + return fdno;
>> +
>> + filep = anon_inode_getfile("[iommufd-pgfault]", &iommufd_fault_fops,
>> + fault, O_RDWR);
> ^^^^^^
>
> See here you stick the iommufd_object into the FD but we don't
> refcount it...
>
> And iommufd_fault_destroy() doesn't do anything about this, so it just
> UAFs the fault memory in the FD.
>
> You need to refcount the iommufd_object here and add a release
> function for the fd to put it back
>
> *and* this FD needs to take a reference on the base iommufd_ctx fd too
> as we can't tolerate them being destroyed out of sequence.

Good catch. I will fix it.

>
>> +int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
>> +{
>> + struct iommu_fault_alloc *cmd = ucmd->cmd;
>> + struct iommufd_fault *fault;
>> + int rc;
>> +
>> + if (cmd->flags)
>> + return -EOPNOTSUPP;
>> +
>> + fault = iommufd_object_alloc(ucmd->ictx, fault, IOMMUFD_OBJ_FAULT);
>> + if (IS_ERR(fault))
>> + return PTR_ERR(fault);
>> +
>> + fault->ictx = ucmd->ictx;
>> + INIT_LIST_HEAD(&fault->deliver);
>> + INIT_LIST_HEAD(&fault->response);
>> + mutex_init(&fault->mutex);
>> + init_waitqueue_head(&fault->wait_queue);
>> +
>> + rc = get_fault_fd(fault);
>> + if (rc < 0)
>> + goto out_abort;
>
> And this is ordered wrong, fd_install has to happen after return to
> user space is guarenteed as it cannot be undone.
>
>> + cmd->out_fault_id = fault->obj.id;
>> + cmd->out_fault_fd = rc;
>> +
>> + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
>> + if (rc)
>> + goto out_abort;
>> + iommufd_object_finalize(ucmd->ictx, &fault->obj);
>
> fd_install() goes here

Yes. Sure.

>
>> + return 0;
>> +out_abort:
>> + iommufd_object_abort_and_destroy(ucmd->ictx, &fault->obj);
>> +
>> + return rc;
>> +}
>
> Jason

Best regards,
baolu

2024-03-15 05:22:38

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On 3/9/24 3:05 AM, Jason Gunthorpe wrote:
> On Mon, Jan 22, 2024 at 03:39:00PM +0800, Lu Baolu wrote:
>
>> @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
>> * @__reserved: Must be 0
>> * @data_type: One of enum iommu_hwpt_data_type
>> * @data_len: Length of the type specific data
>> + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
>> + * IOMMU_HWPT_FAULT_ID_VALID is set.
>> * @data_uptr: User pointer to the type specific data
>> *
>> * Explicitly allocate a hardware page table object. This is the same object
>> @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
>> __u32 __reserved;
>> __u32 data_type;
>> __u32 data_len;
>> + __u32 fault_id;
>> __aligned_u64 data_uptr;
>> };
>
> ?? We can't add fault_id in the middle of the struct??

Yes. I should add the new field at the end.

By the way, with a __u32 added, this data structure is not 64-byte-
aligned anymore. Do we need to add another unused u32 entry, or just let
the compiler handle it?

>
>> + if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
>> + struct iommufd_fault *fault;
>> +
>> + fault = iommufd_get_fault(ucmd, cmd->fault_id);
>> + if (IS_ERR(fault)) {
>> + rc = PTR_ERR(fault);
>> + goto out_hwpt;
>> + }
>> + hwpt->fault = fault;
>> + hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
>> + hwpt->domain->fault_data = hwpt;
>> + hwpt->fault_capable = true;
>
> I wonder if there should be an iommu API to make a domain fault
> capable?

The iommu core identifies a fault-capable domain by checking its
domain->iopf_handler. Anyway, what's the difference between a fault or
non-fault capable domain from iommu core's point of view?

Best regards,
baolu

Subject: RE: [PATCH v3 4/8] iommufd: Add iommufd fault object



> -----Original Message-----
> From: Lu Baolu <[email protected]>
> Sent: Monday, January 22, 2024 7:39 AM
> To: Jason Gunthorpe <[email protected]>; Kevin Tian <[email protected]>; Joerg
> Roedel <[email protected]>; Will Deacon <[email protected]>; Robin Murphy
> <[email protected]>; Jean-Philippe Brucker <[email protected]>;
> Nicolin Chen <[email protected]>; Yi Liu <[email protected]>; Jacob Pan
> <[email protected]>; Joel Granados <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Lu Baolu <[email protected]>
> Subject: [PATCH v3 4/8] iommufd: Add iommufd fault object
>
> An iommufd fault object provides an interface for delivering I/O page
> faults to user space. These objects are created and destroyed by user
> space, and they can be associated with or dissociated from hardware page
> table objects during page table allocation or destruction.
>
> User space interacts with the fault object through a file interface. This
> interface offers a straightforward and efficient way for user space to
> handle page faults. It allows user space to read fault messages
> sequentially and respond to them by writing to the same file. The file
> interface supports reading messages in poll mode, so it's recommended that
> user space applications use io_uring to enhance read and write efficiency.
>
> A fault object can be associated with any iopf-capable iommufd_hw_pgtable
> during the pgtable's allocation. All I/O page faults triggered by devices
> when accessing the I/O addresses of an iommufd_hw_pgtable are routed
> through the fault object to user space. Similarly, user space's responses
> to these page faults are routed back to the iommu device driver through
> the same fault object.
>
> Signed-off-by: Lu Baolu <[email protected]>

[...]

> +static __poll_t iommufd_fault_fops_poll(struct file *filep,
> + struct poll_table_struct *wait)
> +{
> + struct iommufd_fault *fault = filep->private_data;
> + __poll_t pollflags = 0;
> +
> + poll_wait(filep, &fault->wait_queue, wait);
> + mutex_lock(&fault->mutex);
> + if (!list_empty(&fault->deliver))
> + pollflags = EPOLLIN | EPOLLRDNORM;
> + mutex_unlock(&fault->mutex);
> +
> + return pollflags;
> +}
> +
> +static const struct file_operations iommufd_fault_fops = {
> + .owner = THIS_MODULE,
> + .open = nonseekable_open,
> + .read = iommufd_fault_fops_read,
> + .write = iommufd_fault_fops_write,
> + .poll = iommufd_fault_fops_poll,
> + .llseek = no_llseek,
> +};

Hi

I am trying to enable Qemu vSVA support on ARM with this series.
I am using io_uring APIs with the fault fd to handle the page fault
in the Qemu.

Please find the implementation here[1]. This is still a work in progress
and is based on Nicolin's latest nested Qemu branch.

And I am running into a problem when we have the poll interface added
for the fault fd in kernel.

What I have noticed is that,
-read interface works fine and I can receive struct tiommu_hwpt_pgfault data.
-But once Guest handles the page faults and returns the page response,
the write to fault fd never reaches the kernel. The sequence is like below,

sqe = io_uring_get_sqe(ring);
io_uring_prep_write(sqe, hwpt->fault_fd, resp, sizeof(*resp), 0);
io_uring_sqe_set_data(sqe, resp);
io_uring_submit(ring);
ret = io_uring_wait_cqe(ring, &cqe);
....
Please find the function here[2]

The above cqe wait never returns and hardware times out without receiving
page response. My understanding of io_uring default op is that it tries to
issue an sqe as non-blocking first. But it looks like the above write sequence
ends up in kernel poll_wait() as well.Not sure how we can avoid that for
write.

All works fine if I comment out the poll for the fault_fd from the kernel.
But then of course Qemu ends up repeatedly reading the ring Queue for
any pending page fault.

It might be something I am missing in my understanding of io_uring APIs.
Just thought of checking with you if you have any Qemu implementation
using io_uring APIs to test this.

Also appreciate any pointers in resolving this.

Thanks,
Shameer
[1] https://github.com/hisilicon/qemu/tree/iommufd_vsmmu-02292024-vsva-wip
[2] https://github.com/hisilicon/qemu/blob/2b984fb5c692a03e6f5463d005670d2e2a2c7304/hw/arm/smmuv3.c#L1310


2024-03-22 17:07:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On Fri, Mar 15, 2024 at 09:16:43AM +0800, Baolu Lu wrote:
> On 3/9/24 3:05 AM, Jason Gunthorpe wrote:
> > On Mon, Jan 22, 2024 at 03:39:00PM +0800, Lu Baolu wrote:
> >
> > > @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
> > > * @__reserved: Must be 0
> > > * @data_type: One of enum iommu_hwpt_data_type
> > > * @data_len: Length of the type specific data
> > > + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
> > > + * IOMMU_HWPT_FAULT_ID_VALID is set.
> > > * @data_uptr: User pointer to the type specific data
> > > *
> > > * Explicitly allocate a hardware page table object. This is the same object
> > > @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
> > > __u32 __reserved;
> > > __u32 data_type;
> > > __u32 data_len;
> > > + __u32 fault_id;
> > > __aligned_u64 data_uptr;
> > > };
> >
> > ?? We can't add fault_id in the middle of the struct??
>
> Yes. I should add the new field at the end.
>
> By the way, with a __u32 added, this data structure is not 64-byte-
> aligned anymore. Do we need to add another unused u32 entry, or just let
> the compiler handle it?

Yes, add a reserved u32 to ensure the structs is always without
implicit padding.

> >
> > > + if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
> > > + struct iommufd_fault *fault;
> > > +
> > > + fault = iommufd_get_fault(ucmd, cmd->fault_id);
> > > + if (IS_ERR(fault)) {
> > > + rc = PTR_ERR(fault);
> > > + goto out_hwpt;
> > > + }
> > > + hwpt->fault = fault;
> > > + hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> > > + hwpt->domain->fault_data = hwpt;
> > > + hwpt->fault_capable = true;
> >
> > I wonder if there should be an iommu API to make a domain fault
> > capable?
>
> The iommu core identifies a fault-capable domain by checking its
> domain->iopf_handler. Anyway, what's the difference between a fault or
> non-fault capable domain from iommu core's point of view?

From the core? Nothing. I'm just wondering from an API perspective if
we should have a little inline to indicate it.

Jason

2024-03-22 17:11:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

On Fri, Mar 15, 2024 at 09:46:06AM +0800, Baolu Lu wrote:
> On 3/9/24 2:03 AM, Jason Gunthorpe wrote:
> > On Mon, Jan 22, 2024 at 03:38:59PM +0800, Lu Baolu wrote:
> > > --- /dev/null
> > > +++ b/drivers/iommu/iommufd/fault.c
> > > @@ -0,0 +1,255 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Copyright (C) 2024 Intel Corporation
> > > + */
> > > +#define pr_fmt(fmt) "iommufd: " fmt
> > > +
> > > +#include <linux/file.h>
> > > +#include <linux/fs.h>
> > > +#include <linux/module.h>
> > > +#include <linux/mutex.h>
> > > +#include <linux/iommufd.h>
> > > +#include <linux/poll.h>
> > > +#include <linux/anon_inodes.h>
> > > +#include <uapi/linux/iommufd.h>
> > > +
> > > +#include "iommufd_private.h"
> > > +
> > > +static int device_add_fault(struct iopf_group *group)
> > > +{
> > > + struct iommufd_device *idev = group->cookie->private;
> > > + void *curr;
> > > +
> > > + curr = xa_cmpxchg(&idev->faults, group->last_fault.fault.prm.grpid,
> > > + NULL, group, GFP_KERNEL);
> > > +
> > > + return curr ? xa_err(curr) : 0;
> > > +}
> > > +
> > > +static void device_remove_fault(struct iopf_group *group)
> > > +{
> > > + struct iommufd_device *idev = group->cookie->private;
> > > +
> > > + xa_store(&idev->faults, group->last_fault.fault.prm.grpid,
> > > + NULL, GFP_KERNEL);
> >
> > xa_erase ?
>
> Yes. Sure.
>
> > Is grpid OK to use this way? Doesn't it come from the originating
> > device?
>
> The group ID is generated by the hardware. Here, we use it as an index
> in the fault array to ensure it can be quickly retrieved in the page
> fault response path.

I'm nervous about this, we are trusting HW outside the kernel to
provide unique grp id's which are integral to how the kernel
operates..

> > > +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
> > > + size_t count, loff_t *ppos)
> > > +{
> > > + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
> > > + struct iommufd_fault *fault = filep->private_data;
> > > + struct iommu_hwpt_pgfault data;
> > > + struct iommufd_device *idev;
> > > + struct iopf_group *group;
> > > + struct iopf_fault *iopf;
> > > + size_t done = 0;
> > > + int rc;
> > > +
> > > + if (*ppos || count % fault_size)
> > > + return -ESPIPE;
> > > +
> > > + mutex_lock(&fault->mutex);
> > > + while (!list_empty(&fault->deliver) && count > done) {
> > > + group = list_first_entry(&fault->deliver,
> > > + struct iopf_group, node);
> > > +
> > > + if (list_count_nodes(&group->faults) * fault_size > count - done)
> > > + break;
> > > +
> > > + idev = (struct iommufd_device *)group->cookie->private;
> > > + list_for_each_entry(iopf, &group->faults, list) {
> > > + iommufd_compose_fault_message(&iopf->fault, &data, idev);
> > > + rc = copy_to_user(buf + done, &data, fault_size);
> > > + if (rc)
> > > + goto err_unlock;
> > > + done += fault_size;
> > > + }
> > > +
> > > + rc = device_add_fault(group);
> >
> > See I wonder if this should be some xa_alloc or something instead of
> > trying to use the grpid?
>
> So this magic number will be passed to user space in the fault message.
> And the user will then include this number in its response message. The
> response message is valid only when the magic number matches. Do I get
> you correctly?

Yes, then it is simple xa_alloc() and xa_load() without any other
searching and we don't have to rely on the grpid to be correctly
formed by the PCI device.

But I don't know about performance xa_alloc() is pretty fast but
trusting the grpid would be faster..

IMHO from a uapi perspective we should have a definate "cookie" that
gets echo'd back. If the kernel uses xa_alloc or grpid to build that
cookie it doesn't matter to the uAPI.

Jason

2024-03-22 17:22:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

On Wed, Mar 20, 2024 at 04:18:05PM +0000, Shameerali Kolothum Thodi wrote:
>
> What I have noticed is that,
> -read interface works fine and I can receive struct tiommu_hwpt_pgfault data.
> -But once Guest handles the page faults and returns the page response,
> the write to fault fd never reaches the kernel. The sequence is like below,
>
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_write(sqe, hwpt->fault_fd, resp, sizeof(*resp), 0);
> io_uring_sqe_set_data(sqe, resp);
> io_uring_submit(ring);
> ret = io_uring_wait_cqe(ring, &cqe);
> ....
> Please find the function here[2]
>
> The above cqe wait never returns and hardware times out without receiving
> page response. My understanding of io_uring default op is that it tries to
> issue an sqe as non-blocking first. But it looks like the above write sequence
> ends up in kernel poll_wait() as well.Not sure how we can avoid that for
> write.

Ah, right, it is because poll can't be choosy about read/write, it has
to work equally for both directions. iommufd_fault_fops_poll() never
returns EPOLLOUT

It should just always return EPOLLOUT because we don't have any queue
to manage.

Jason

2024-03-25 12:46:44

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

On 3/25/24 11:26 AM, Baolu Lu wrote:
> On 3/23/24 1:22 AM, Jason Gunthorpe wrote:
>> On Wed, Mar 20, 2024 at 04:18:05PM +0000, Shameerali Kolothum Thodi
>> wrote:
>>> What I have noticed is that,
>>> -read interface works fine and I can receive struct
>>> tiommu_hwpt_pgfault data.
>>> -But once Guest handles the page faults and returns the page response,
>>>   the write to fault fd never reaches the kernel. The sequence is
>>> like below,
>>>    sqe = io_uring_get_sqe(ring);
>>>    io_uring_prep_write(sqe, hwpt->fault_fd, resp, sizeof(*resp), 0);
>>>    io_uring_sqe_set_data(sqe, resp);
>>>    io_uring_submit(ring);
>>>    ret = io_uring_wait_cqe(ring, &cqe);
>>>    ....
>>> Please find the function here[2]
>>>
>>> The above cqe wait never returns and hardware times out without
>>> receiving
>>> page response. My understanding of io_uring default op is that it
>>> tries to
>>> issue an sqe as non-blocking first. But it looks like the above write
>>> sequence
>>> ends up in kernel poll_wait() as well.Not sure how we can avoid that for
>>> write.
>> Ah, right, it is because poll can't be choosy about read/write, it has
>> to work equally for both directions. iommufd_fault_fops_poll() never
>> returns EPOLLOUT
>>
>> It should just always return EPOLLOUT because we don't have any queue
>> to manage.
>
> Are you suggesting the poll file operation to be like below?
>
> static __poll_t iommufd_fault_fops_poll(struct file *filep,
>                                         struct poll_table_struct *wait)
> {
>         struct iommufd_fault *fault = filep->private_data;
>         __poll_t pollflags = EPOLLOUT;
>
>         poll_wait(filep, &fault->wait_queue, wait);
>         mutex_lock(&fault->mutex);
>         if (!list_empty(&fault->deliver))
>                 pollflags = EPOLLIN | EPOLLRDNORM;

If I understood it correctly, here it should be,

pollflags |= EPOLLIN | EPOLLRDNORM;

>         mutex_unlock(&fault->mutex);
>
>         return pollflags;
> }

Best regards,
baolu

2024-03-25 12:54:16

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

On 2024/3/23 1:06, Jason Gunthorpe wrote:
> On Fri, Mar 15, 2024 at 09:16:43AM +0800, Baolu Lu wrote:
>> On 3/9/24 3:05 AM, Jason Gunthorpe wrote:
>>> On Mon, Jan 22, 2024 at 03:39:00PM +0800, Lu Baolu wrote:
>>>
>>>> @@ -411,6 +414,8 @@ enum iommu_hwpt_data_type {
>>>> * @__reserved: Must be 0
>>>> * @data_type: One of enum iommu_hwpt_data_type
>>>> * @data_len: Length of the type specific data
>>>> + * @fault_id: The ID of IOMMUFD_FAULT object. Valid only if flags field of
>>>> + * IOMMU_HWPT_FAULT_ID_VALID is set.
>>>> * @data_uptr: User pointer to the type specific data
>>>> *
>>>> * Explicitly allocate a hardware page table object. This is the same object
>>>> @@ -441,6 +446,7 @@ struct iommu_hwpt_alloc {
>>>> __u32 __reserved;
>>>> __u32 data_type;
>>>> __u32 data_len;
>>>> + __u32 fault_id;
>>>> __aligned_u64 data_uptr;
>>>> };
>>>
>>> ?? We can't add fault_id in the middle of the struct??
>>
>> Yes. I should add the new field at the end.
>>
>> By the way, with a __u32 added, this data structure is not 64-byte-
>> aligned anymore. Do we need to add another unused u32 entry, or just let
>> the compiler handle it?
>
> Yes, add a reserved u32 to ensure the structs is always without
> implicit padding.

Sure.

>
>>>
>>>> + if (cmd->flags & IOMMU_HWPT_FAULT_ID_VALID) {
>>>> + struct iommufd_fault *fault;
>>>> +
>>>> + fault = iommufd_get_fault(ucmd, cmd->fault_id);
>>>> + if (IS_ERR(fault)) {
>>>> + rc = PTR_ERR(fault);
>>>> + goto out_hwpt;
>>>> + }
>>>> + hwpt->fault = fault;
>>>> + hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
>>>> + hwpt->domain->fault_data = hwpt;
>>>> + hwpt->fault_capable = true;
>>>
>>> I wonder if there should be an iommu API to make a domain fault
>>> capable?
>>
>> The iommu core identifies a fault-capable domain by checking its
>> domain->iopf_handler. Anyway, what's the difference between a fault or
>> non-fault capable domain from iommu core's point of view?
>
> From the core? Nothing. I'm just wondering from an API perspective if
> we should have a little inline to indicate it.

I have no objection if there's a consumer for it.

Best regards,
baolu


2024-03-25 12:54:44

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

On 2024/3/23 1:09, Jason Gunthorpe wrote:
> On Fri, Mar 15, 2024 at 09:46:06AM +0800, Baolu Lu wrote:
>> On 3/9/24 2:03 AM, Jason Gunthorpe wrote:
>>> On Mon, Jan 22, 2024 at 03:38:59PM +0800, Lu Baolu wrote:
>>>> --- /dev/null
>>>> +++ b/drivers/iommu/iommufd/fault.c
>>>> @@ -0,0 +1,255 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/* Copyright (C) 2024 Intel Corporation
>>>> + */
>>>> +#define pr_fmt(fmt) "iommufd: " fmt
>>>> +
>>>> +#include <linux/file.h>
>>>> +#include <linux/fs.h>
>>>> +#include <linux/module.h>
>>>> +#include <linux/mutex.h>
>>>> +#include <linux/iommufd.h>
>>>> +#include <linux/poll.h>
>>>> +#include <linux/anon_inodes.h>
>>>> +#include <uapi/linux/iommufd.h>
>>>> +
>>>> +#include "iommufd_private.h"
>>>> +
>>>> +static int device_add_fault(struct iopf_group *group)
>>>> +{
>>>> + struct iommufd_device *idev = group->cookie->private;
>>>> + void *curr;
>>>> +
>>>> + curr = xa_cmpxchg(&idev->faults, group->last_fault.fault.prm.grpid,
>>>> + NULL, group, GFP_KERNEL);
>>>> +
>>>> + return curr ? xa_err(curr) : 0;
>>>> +}
>>>> +
>>>> +static void device_remove_fault(struct iopf_group *group)
>>>> +{
>>>> + struct iommufd_device *idev = group->cookie->private;
>>>> +
>>>> + xa_store(&idev->faults, group->last_fault.fault.prm.grpid,
>>>> + NULL, GFP_KERNEL);
>>>
>>> xa_erase ?
>>
>> Yes. Sure.
>>
>>> Is grpid OK to use this way? Doesn't it come from the originating
>>> device?
>>
>> The group ID is generated by the hardware. Here, we use it as an index
>> in the fault array to ensure it can be quickly retrieved in the page
>> fault response path.
>
> I'm nervous about this, we are trusting HW outside the kernel to
> provide unique grp id's which are integral to how the kernel
> operates..

Agreed.

>
>>>> +static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
>>>> + size_t count, loff_t *ppos)
>>>> +{
>>>> + size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
>>>> + struct iommufd_fault *fault = filep->private_data;
>>>> + struct iommu_hwpt_pgfault data;
>>>> + struct iommufd_device *idev;
>>>> + struct iopf_group *group;
>>>> + struct iopf_fault *iopf;
>>>> + size_t done = 0;
>>>> + int rc;
>>>> +
>>>> + if (*ppos || count % fault_size)
>>>> + return -ESPIPE;
>>>> +
>>>> + mutex_lock(&fault->mutex);
>>>> + while (!list_empty(&fault->deliver) && count > done) {
>>>> + group = list_first_entry(&fault->deliver,
>>>> + struct iopf_group, node);
>>>> +
>>>> + if (list_count_nodes(&group->faults) * fault_size > count - done)
>>>> + break;
>>>> +
>>>> + idev = (struct iommufd_device *)group->cookie->private;
>>>> + list_for_each_entry(iopf, &group->faults, list) {
>>>> + iommufd_compose_fault_message(&iopf->fault, &data, idev);
>>>> + rc = copy_to_user(buf + done, &data, fault_size);
>>>> + if (rc)
>>>> + goto err_unlock;
>>>> + done += fault_size;
>>>> + }
>>>> +
>>>> + rc = device_add_fault(group);
>>>
>>> See I wonder if this should be some xa_alloc or something instead of
>>> trying to use the grpid?
>>
>> So this magic number will be passed to user space in the fault message.
>> And the user will then include this number in its response message. The
>> response message is valid only when the magic number matches. Do I get
>> you correctly?
>
> Yes, then it is simple xa_alloc() and xa_load() without any other
> searching and we don't have to rely on the grpid to be correctly
> formed by the PCI device.
>
> But I don't know about performance xa_alloc() is pretty fast but
> trusting the grpid would be faster..
>
> IMHO from a uapi perspective we should have a definate "cookie" that
> gets echo'd back. If the kernel uses xa_alloc or grpid to build that
> cookie it doesn't matter to the uAPI.

Okay, I will head in this direction.

Best regards,
baolu


2024-03-25 18:14:51

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v3 4/8] iommufd: Add iommufd fault object

On 3/23/24 1:22 AM, Jason Gunthorpe wrote:
> On Wed, Mar 20, 2024 at 04:18:05PM +0000, Shameerali Kolothum Thodi wrote:
>> What I have noticed is that,
>> -read interface works fine and I can receive struct tiommu_hwpt_pgfault data.
>> -But once Guest handles the page faults and returns the page response,
>> the write to fault fd never reaches the kernel. The sequence is like below,
>>
>> sqe = io_uring_get_sqe(ring);
>> io_uring_prep_write(sqe, hwpt->fault_fd, resp, sizeof(*resp), 0);
>> io_uring_sqe_set_data(sqe, resp);
>> io_uring_submit(ring);
>> ret = io_uring_wait_cqe(ring, &cqe);
>> ....
>> Please find the function here[2]
>>
>> The above cqe wait never returns and hardware times out without receiving
>> page response. My understanding of io_uring default op is that it tries to
>> issue an sqe as non-blocking first. But it looks like the above write sequence
>> ends up in kernel poll_wait() as well.Not sure how we can avoid that for
>> write.
> Ah, right, it is because poll can't be choosy about read/write, it has
> to work equally for both directions. iommufd_fault_fops_poll() never
> returns EPOLLOUT
>
> It should just always return EPOLLOUT because we don't have any queue
> to manage.

Are you suggesting the poll file operation to be like below?

static __poll_t iommufd_fault_fops_poll(struct file *filep,
struct poll_table_struct *wait)
{
struct iommufd_fault *fault = filep->private_data;
__poll_t pollflags = EPOLLOUT;

poll_wait(filep, &fault->wait_queue, wait);
mutex_lock(&fault->mutex);
if (!list_empty(&fault->deliver))
pollflags = EPOLLIN | EPOLLRDNORM;
mutex_unlock(&fault->mutex);

return pollflags;
}

The diff is,

diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index ede16702d433..a33f8aa92575 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -175,7 +175,7 @@ static __poll_t iommufd_fault_fops_poll(struct file
*filep,
struct poll_table_struct *wait)
{
struct iommufd_fault *fault = filep->private_data;
- __poll_t pollflags = 0;
+ __poll_t pollflags = EPOLLOUT;

poll_wait(filep, &fault->wait_queue, wait);
mutex_lock(&fault->mutex);


I was originally thinking that poll file operation is specifically
designed for polling on read events associated with IOMMU faults.

Best regards,
baolu