2024-04-13 03:48:25

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

This is an experimental RFC series for VIOMMU infrastructure, using NVIDIA
Tegra241 (Grace) CMDQV as a test instance.

VIOMMU obj is used to represent a virtual interface (iommu) backed by an
underlying IOMMU's HW-accelerated feature for virtualizaion: for example,
NVIDIA's VINTF (v-interface for CMDQV) and AMD"s vIOMMU.

VQUEUE obj is used to represent a virtual command queue (buffer) backed by
an underlying IOMMU command queue to passthrough for VMs to use directly:
for example, NVIDIA's Virtual Command Queue and AMD's Command Buffer.

NVIDIA's CMDQV requires a pair of physical and virtual device Stream IDs
to process ATC invalidation commands by ARM SMMU. So, set/unset_dev_id ops
and ioctls are introduced to VIOMMU.

Also, a passthrough queue has a pair of start and tail pointers/indexes in
the real HW registers, which should be mmaped to user space for hypervisor
to map to VM's mmio region directly. Thus, iommufd needs an mmap op too.

Some todos/opens:
1. Add selftest coverages for new ioctls
2. The mmap needs a way to get viommu_id. Currently it's getting from
vma->vm_pgoff, which might not be ideal.
3. This series is only verified with a single passthrough device that's
hehind a physical ARM SMMU. So, devices behind two+ IOMMUs might need
some additional support (and verifications).
4. Requires for comments from AMD folks to support AMD's vIOMMU feature.

This series is on Github (for review and reference only):
https://github.com/nicolinc/iommufd/commits/vcmdq_user_space-rfc-v1

Real HW tests wre conducted with this QEMU branch:
https://github.com/nicolinc/qemu/commits/wip/iommufd_vcmdq/

Thanks

Nicolin Chen (14):
iommufd: Move iommufd_object to public iommufd header
iommufd: Swap _iommufd_object_alloc and __iommufd_object_alloc
iommufd: Prepare for viommu structures and functions
iommufd: Add struct iommufd_viommu and iommufd_viommu_ops
iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
iommufd: Add viommu set/unset_dev_id ops
iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl
iommufd/selftest: Add IOMMU_VIOMMU_SET_DEV_ID test coverage
iommufd/selftest: Add IOMMU_TEST_OP_MV_CHECK_DEV_ID
iommufd: Add struct iommufd_vqueue and its related viommu ops
iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC
iommufd: Add mmap infrastructure
iommu/tegra241-cmdqv: Add user-space use support

drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 ++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 19 ++
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 284 +++++++++++++++++-
drivers/iommu/iommufd/Makefile | 3 +-
drivers/iommu/iommufd/device.c | 11 +
drivers/iommu/iommufd/hw_pagetable.c | 4 +-
drivers/iommu/iommufd/iommufd_private.h | 71 +++--
drivers/iommu/iommufd/iommufd_test.h | 5 +
drivers/iommu/iommufd/main.c | 69 ++++-
drivers/iommu/iommufd/selftest.c | 100 ++++++
drivers/iommu/iommufd/viommu.c | 235 +++++++++++++++
include/linux/iommu.h | 16 +
include/linux/iommufd.h | 100 ++++++
include/uapi/linux/iommufd.h | 98 ++++++
tools/testing/selftests/iommu/iommufd.c | 44 +++
tools/testing/selftests/iommu/iommufd_utils.h | 71 +++++
16 files changed, 1103 insertions(+), 46 deletions(-)
create mode 100644 drivers/iommu/iommufd/viommu.c

--
2.43.0



2024-04-13 03:48:27

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 03/14] iommufd: Prepare for viommu structures and functions

The following changes will introduce a new iommufd_viommu and its related
structures and functions. These new core-level structures will be embedded
in the driver-level structures. So a helper to allocate the bundle of core
and driver strucutres will be nice to have.

As a preparatory change, introduce a viommu.c file and put a new structure
allocator in it. Since some of the object allocation will be similar, also
move the common part to a new level-3 allocator ___iommufd_object_alloc().

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/Makefile | 3 ++-
drivers/iommu/iommufd/iommufd_private.h | 20 ++++++++++++++++++++
drivers/iommu/iommufd/main.c | 9 +++------
drivers/iommu/iommufd/viommu.c | 19 +++++++++++++++++++
4 files changed, 44 insertions(+), 7 deletions(-)
create mode 100644 drivers/iommu/iommufd/viommu.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 34b446146961..76c2bc41af40 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -6,7 +6,8 @@ iommufd-y := \
ioas.o \
main.o \
pages.o \
- vfio_compat.o
+ vfio_compat.o \
+ viommu.o

iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 3acbc67dd5f0..eccc565ed38e 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -156,6 +156,26 @@ void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
void iommufd_object_finalize(struct iommufd_ctx *ictx,
struct iommufd_object *obj);

+static inline struct iommufd_object *___iommufd_object_alloc(size_t size)
+{
+ struct iommufd_object *obj;
+
+ obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
+ if (!obj)
+ return ERR_PTR(-ENOMEM);
+
+ /* Starts out bias'd by 1 until it is removed from the xarray */
+ refcount_set(&obj->shortterm_users, 1);
+ refcount_set(&obj->users, 1);
+
+ /*
+ * The allocation of an obj->id needs an ictx, so it has to be done
+ * after this ___iommufd_object_alloc() callback.
+ */
+
+ return obj;
+}
+
enum {
REMOVE_WAIT_SHORTTERM = 1,
};
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index a51ab766e183..5187942b375d 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -36,13 +36,10 @@ struct iommufd_object *__iommufd_object_alloc(struct iommufd_ctx *ictx,
struct iommufd_object *obj;
int rc;

- obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
- if (!obj)
- return ERR_PTR(-ENOMEM);
+ obj = ___iommufd_object_alloc(size);
+ if (IS_ERR(obj))
+ return obj;
obj->type = type;
- /* Starts out bias'd by 1 until it is removed from the xarray */
- refcount_set(&obj->shortterm_users, 1);
- refcount_set(&obj->users, 1);

/*
* Reserve an ID in the xarray but do not publish the pointer yet since
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
new file mode 100644
index 000000000000..f77d6972d552
--- /dev/null
+++ b/drivers/iommu/iommufd/viommu.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
+ */
+
+#include <linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+#define viommu_struct_alloc(name) \
+ struct iommufd_##name *_iommufd_##name##_alloc(size_t size) \
+ { \
+ struct iommufd_object *obj; \
+ if (WARN_ON(size < sizeof(struct iommufd_##name))) \
+ return NULL; \
+ obj = ___iommufd_object_alloc(size); \
+ if (IS_ERR(obj)) \
+ return NULL; \
+ return container_of(obj, struct iommufd_##name, obj); \
+ }
--
2.43.0


2024-04-13 03:49:15

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

Add a new iommufd_viommu core structure to represent a vIOMMU instance in
the user space, typically backed by a HW-accelerated feature of an IOMMU,
e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and AMD Hardware
Accelerated Virtualized IOMMU (vIOMMU).

Define a new iommufd_viommu_ops to hold the free op and any future op.

A driver should embed this core structure in its driver viommu structure
and call the new iommufd_viommu_alloc() helper to allocate a core/driver
structure bundle and fill its core viommu->ops:
struct my_driver_viommu {
struct iommufd_viommu core;
....
};

static const struct iommufd_viommu_ops my_driver_viommu_ops = {
.free = my_driver_viommu_free,
};

struct my_driver_viommu *my_viommu =
iommufd_viommu_alloc(my_driver_viommu, core);
my_viommu->core.ops = &my_driver_viommu_ops;

Lastly, add viommu_alloc to iommu_ops so iommufd can link to the driver to
allocate a viommu object.

Any future viommu op will be added to the new struct iommufd_viommu_ops.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/viommu.c | 2 ++
include/linux/iommu.h | 15 ++++++++++++++
include/linux/iommufd.h | 38 ++++++++++++++++++++++++++++++++++
3 files changed, 55 insertions(+)

diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index f77d6972d552..3886b1dd1f13 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -17,3 +17,5 @@
return NULL; \
return container_of(obj, struct iommufd_##name, obj); \
}
+
+viommu_struct_alloc(viommu);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2e925b5eba53..4d4461146288 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -145,6 +145,8 @@ struct iopf_queue {
struct mutex lock;
};

+struct iommufd_viommu;
+
/* iommu fault flags */
#define IOMMU_FAULT_READ 0x0
#define IOMMU_FAULT_WRITE 0x1
@@ -527,6 +529,14 @@ static inline int __iommu_copy_struct_from_user_array(
* @of_xlate: add OF master IDs to iommu grouping
* @is_attach_deferred: Check if domain attach should be deferred from iommu
* driver init to device driver init (default no)
+ * @viommu_alloc: Allocate an iommufd_viommu as a user space IOMMU instance,
+ * associated to a nested parent @domain, for iommu-specific
+ * hardware acceleration. The @viommu_type must be defined in
+ * the include/uapi/linux/iommufd.h header.
+ * It is suggested to call iommufd_viommu_alloc() helper for
+ * a bundled allocation of the core and the driver structure,
+ * and a driver in gernal should assign an iommufd_viommu_ops
+ * to the core structure's viommu->ops.
* @dev_enable/disable_feat: per device entries to enable/disable
* iommu specific features.
* @page_response: handle page request response
@@ -570,6 +580,11 @@ struct iommu_ops {
int (*of_xlate)(struct device *dev, const struct of_phandle_args *args);
bool (*is_attach_deferred)(struct device *dev);

+ /* User space instance allocation by the iommu driver */
+ struct iommufd_viommu *(*viommu_alloc)(struct device *dev,
+ unsigned int viommu_type,
+ struct iommu_domain *domain);
+
/* Per device IOMMU features */
int (*dev_enable_feat)(struct device *dev, enum iommu_dev_features f);
int (*dev_disable_feat)(struct device *dev, enum iommu_dev_features f);
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index a0cb08a4b653..650acfac307a 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -16,6 +16,7 @@ struct iommufd_device;
struct page;
struct iommufd_ctx;
struct iommufd_access;
+struct iommufd_hwpt_paging;
struct file;
struct iommu_group;

@@ -77,6 +78,26 @@ void iommufd_access_detach(struct iommufd_access *access);

void iommufd_ctx_get(struct iommufd_ctx *ictx);

+struct iommufd_viommu {
+ struct iommufd_object obj;
+ struct iommufd_ctx *ictx;
+ struct iommu_device *iommu_dev;
+ struct iommufd_hwpt_paging *hwpt;
+
+ const struct iommufd_viommu_ops *ops;
+
+ unsigned int type;
+};
+
+/**
+ * struct iommufd_viommu_ops - viommu specific operations
+ * @free: Free all driver-specific parts of an iommufd_viommu. The memory
+ * of the entire viommu will be free-ed by iommufd core
+ */
+struct iommufd_viommu_ops {
+ void (*free)(struct iommufd_viommu *viommu);
+};
+
#if IS_ENABLED(CONFIG_IOMMUFD)
struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
struct iommufd_ctx *iommufd_ctx_from_fd(int fd);
@@ -93,6 +114,8 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
int iommufd_vfio_compat_ioas_get_id(struct iommufd_ctx *ictx, u32 *out_ioas_id);
int iommufd_vfio_compat_ioas_create(struct iommufd_ctx *ictx);
int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx);
+
+struct iommufd_viommu *_iommufd_viommu_alloc(size_t size);
#else /* !CONFIG_IOMMUFD */
static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
{
@@ -133,5 +156,20 @@ static inline int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx)
{
return -EOPNOTSUPP;
}
+
+static inline struct iommufd_viommu *_iommufd_viommu_alloc(size_t size)
+{
+ return NULL;
+}
#endif /* CONFIG_IOMMUFD */
+
+/*
+ * Helpers for IOMMU driver to allocate driver structures that will be freed by
+ * the iommufd core. A driver is only responsible for its own struct cleanup.
+ */
+#define iommufd_viommu_alloc(drv_struct, member) \
+ container_of(_iommufd_viommu_alloc(sizeof(struct drv_struct) + \
+ BUILD_BUG_ON_ZERO(offsetof( \
+ struct drv_struct, member))), \
+ struct drv_struct, member)
#endif
--
2.43.0


2024-04-13 03:49:25

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

Corresponding to the new iommufd_viommu core structure that represents a
vIOMMU instance in the user space for HW-accelerated features, add a new
IOMMUFD_OBJ_VIOMMU and its ioctl for user space to allocate it.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/iommufd_private.h | 3 +
drivers/iommu/iommufd/main.c | 6 ++
drivers/iommu/iommufd/viommu.c | 83 +++++++++++++++++++++++++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 30 +++++++++
5 files changed, 123 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index eccc565ed38e..ae90b4493109 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -424,6 +424,9 @@ void iopt_remove_access(struct io_pagetable *iopt,
u32 iopt_access_list_id);
void iommufd_access_destroy_object(struct iommufd_object *obj);

+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_viommu_destroy(struct iommufd_object *obj);
+
#ifdef CONFIG_IOMMUFD_TEST
int iommufd_test(struct iommufd_ucmd *ucmd);
void iommufd_selftest_destroy(struct iommufd_object *obj);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 5187942b375d..9de7e3e63ce4 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -323,6 +323,7 @@ union ucmd_buffer {
struct iommu_hwpt_set_dirty_tracking set_dirty_tracking;
struct iommu_ioas_alloc alloc;
struct iommu_ioas_allow_iovas allow_iovas;
+ struct iommu_viommu_alloc viommu;
struct iommu_ioas_copy ioas_copy;
struct iommu_ioas_iova_ranges iova_ranges;
struct iommu_ioas_map map;
@@ -378,6 +379,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
val64),
IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
__reserved),
+ IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
+ struct iommu_viommu_alloc, out_viommu_id),
#ifdef CONFIG_IOMMUFD_TEST
IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
#endif
@@ -510,6 +513,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
.destroy = iommufd_hwpt_nested_destroy,
.abort = iommufd_hwpt_nested_abort,
},
+ [IOMMUFD_OBJ_VIOMMU] = {
+ .destroy = iommufd_viommu_destroy,
+ },
#ifdef CONFIG_IOMMUFD_TEST
[IOMMUFD_OBJ_SELFTEST] = {
.destroy = iommufd_selftest_destroy,
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 3886b1dd1f13..079e0ff79942 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -19,3 +19,86 @@
}

viommu_struct_alloc(viommu);
+
+void iommufd_viommu_destroy(struct iommufd_object *obj)
+{
+ struct iommufd_viommu *viommu =
+ container_of(obj, struct iommufd_viommu, obj);
+
+ if (viommu->ops && viommu->ops->free)
+ viommu->ops->free(viommu);
+ refcount_dec(&viommu->hwpt->common.obj.users);
+}
+
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+ struct iommu_viommu_alloc *cmd = ucmd->cmd;
+ struct iommufd_hwpt_paging *hwpt_paging;
+ struct iommu_device *iommu_dev;
+ struct iommufd_viommu *viommu;
+ struct iommufd_device *idev;
+ int rc;
+
+ if (cmd->flags)
+ return -EOPNOTSUPP;
+
+ idev = iommufd_get_device(ucmd, cmd->dev_id);
+ if (IS_ERR(idev))
+ return PTR_ERR(idev);
+ iommu_dev = idev->dev->iommu->iommu_dev;
+
+ if (!iommu_dev->ops->viommu_alloc) {
+ rc = -EOPNOTSUPP;
+ goto out_put_idev;
+ }
+
+ hwpt_paging = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
+ if (IS_ERR(hwpt_paging)) {
+ rc = PTR_ERR(hwpt_paging);
+ goto out_put_idev;
+ }
+
+ if (!hwpt_paging->nest_parent) {
+ rc = -EINVAL;
+ goto out_put_hwpt;
+ }
+
+ viommu = iommu_dev->ops->viommu_alloc(idev->dev, cmd->type,
+ hwpt_paging->common.domain);
+ if (IS_ERR(viommu)) {
+ rc = PTR_ERR(viommu);
+ goto out_put_hwpt;
+ }
+
+ /* iommufd_object_finalize will store the viommu->obj.id */
+ rc = xa_alloc(&ucmd->ictx->objects, &viommu->obj.id, XA_ZERO_ENTRY,
+ xa_limit_31b, GFP_KERNEL_ACCOUNT);
+ if (rc)
+ goto out_free;
+
+ viommu->obj.type = IOMMUFD_OBJ_VIOMMU;
+ viommu->type = cmd->type;
+
+ viommu->ictx = ucmd->ictx;
+ viommu->hwpt = hwpt_paging;
+ viommu->iommu_dev = idev->dev->iommu->iommu_dev;
+ cmd->out_viommu_id = viommu->obj.id;
+ rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+ if (rc)
+ goto out_erase_xa;
+ iommufd_object_finalize(ucmd->ictx, &viommu->obj);
+ refcount_inc(&viommu->hwpt->common.obj.users);
+ goto out_put_hwpt;
+
+out_erase_xa:
+ xa_erase(&ucmd->ictx->objects, viommu->obj.id);
+out_free:
+ if (viommu->ops && viommu->ops->free)
+ viommu->ops->free(viommu);
+ kfree(viommu);
+out_put_hwpt:
+ iommufd_put_object(ucmd->ictx, &hwpt_paging->common.obj);
+out_put_idev:
+ iommufd_put_object(ucmd->ictx, &idev->obj);
+ return rc;
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 650acfac307a..dec10c6bb261 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -28,6 +28,7 @@ enum iommufd_object_type {
IOMMUFD_OBJ_HWPT_NESTED,
IOMMUFD_OBJ_IOAS,
IOMMUFD_OBJ_ACCESS,
+ IOMMUFD_OBJ_VIOMMU,
#ifdef CONFIG_IOMMUFD_TEST
IOMMUFD_OBJ_SELFTEST,
#endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 1dfeaa2e649e..2b0825d69846 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -50,6 +50,7 @@ enum {
IOMMUFD_CMD_HWPT_SET_DIRTY_TRACKING,
IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
IOMMUFD_CMD_HWPT_INVALIDATE,
+ IOMMUFD_CMD_VIOMMU_ALLOC,
};

/**
@@ -692,4 +693,33 @@ struct iommu_hwpt_invalidate {
__u32 __reserved;
};
#define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE)
+
+/**
+ * enum iommu_viommu_type - VIOMMU Type
+ * @IOMMU_VIOMMU_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
+ */
+enum iommu_viommu_type {
+ IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
+};
+
+/**
+ * struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
+ * @size: sizeof(struct iommu_viommu_alloc)
+ * @flags: Must be 0
+ * @type: Type of the VIOMMU object. Must be defined in enum iommu_viommu_type
+ * @dev_id: The device to allocate this virtual IOMMU for
+ * @hwpt_id: ID of a nested parent HWPT
+ * @out_viommu_id: Output virtual IOMMU ID for the allocated object
+ *
+ * Allocate an virtual IOMMU object that holds a (shared) nested parent HWPT
+ */
+struct iommu_viommu_alloc {
+ __u32 size;
+ __u32 flags;
+ __u32 type;
+ __u32 dev_id;
+ __u32 hwpt_id;
+ __u32 out_viommu_id;
+};
+#define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
#endif
--
2.43.0


2024-04-13 03:49:39

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

Introduce a new ioctl to set a per-viommu device virtual id that should be
linked to the physical device id (or just a struct device pointer).

Since a viommu (user space IOMMU instance) can have multiple devices while
it's not ideal to confine a device to one single user space IOMMU instance
either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their
structures to bind them bidirectionally.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/device.c | 11 +++++
drivers/iommu/iommufd/iommufd_private.h | 10 +++++
drivers/iommu/iommufd/main.c | 2 +
drivers/iommu/iommufd/viommu.c | 58 +++++++++++++++++++++++++
include/linux/iommufd.h | 2 +
include/uapi/linux/iommufd.h | 20 +++++++++
6 files changed, 103 insertions(+)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 873630c111c1..68086f3074b6 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -135,7 +135,16 @@ void iommufd_device_destroy(struct iommufd_object *obj)
{
struct iommufd_device *idev =
container_of(obj, struct iommufd_device, obj);
+ struct iommufd_viommu *viommu;
+ unsigned long index;

+ xa_for_each(&idev->viommus, index, viommu) {
+ if (viommu->ops->unset_dev_id)
+ viommu->ops->unset_dev_id(viommu, idev->dev);
+ xa_erase(&viommu->idevs, idev->obj.id);
+ xa_erase(&idev->viommus, index);
+ }
+ xa_destroy(&idev->viommus);
iommu_device_release_dma_owner(idev->dev);
iommufd_put_group(idev->igroup);
if (!iommufd_selftest_is_mock_dev(idev->dev))
@@ -216,6 +225,8 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
/* igroup refcount moves into iommufd_device */
idev->igroup = igroup;

+ xa_init_flags(&idev->viommus, XA_FLAGS_ALLOC1);
+
/*
* If the caller fails after this success it must call
* iommufd_unbind_device() which is safe since we hold this refcount.
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index ae90b4493109..9ba8f4ecc221 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -392,6 +392,7 @@ struct iommufd_device {
struct list_head group_item;
/* always the physical device */
struct device *dev;
+ struct xarray viommus;
bool enforce_cache_coherency;
};

@@ -424,8 +425,17 @@ void iopt_remove_access(struct io_pagetable *iopt,
u32 iopt_access_list_id);
void iommufd_access_destroy_object(struct iommufd_object *obj);

+static inline struct iommufd_viommu *
+iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
+{
+ return container_of(iommufd_get_object(ucmd->ictx, id,
+ IOMMUFD_OBJ_VIOMMU),
+ struct iommufd_viommu, obj);
+}
+
int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
void iommufd_viommu_destroy(struct iommufd_object *obj);
+int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd);

#ifdef CONFIG_IOMMUFD_TEST
int iommufd_test(struct iommufd_ucmd *ucmd);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 9de7e3e63ce4..16efc3346a2a 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -381,6 +381,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
__reserved),
IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
struct iommu_viommu_alloc, out_viommu_id),
+ IOCTL_OP(IOMMU_VIOMMU_SET_DEV_ID, iommufd_viommu_set_device_id,
+ struct iommu_viommu_set_dev_id, id),
#ifdef CONFIG_IOMMUFD_TEST
IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
#endif
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 079e0ff79942..71baca0c75de 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -69,6 +69,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
rc = PTR_ERR(viommu);
goto out_put_hwpt;
}
+ xa_init_flags(&viommu->idevs, XA_FLAGS_ALLOC1);

/* iommufd_object_finalize will store the viommu->obj.id */
rc = xa_alloc(&ucmd->ictx->objects, &viommu->obj.id, XA_ZERO_ENTRY,
@@ -102,3 +103,60 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
iommufd_put_object(ucmd->ictx, &idev->obj);
return rc;
}
+
+int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd)
+{
+ struct iommu_viommu_set_dev_id *cmd = ucmd->cmd;
+ unsigned int dev_id, viommu_id;
+ struct iommufd_viommu *viommu;
+ struct iommufd_device *idev;
+ int rc;
+
+ idev = iommufd_get_device(ucmd, cmd->dev_id);
+ if (IS_ERR(idev))
+ return PTR_ERR(idev);
+ dev_id = idev->obj.id;
+
+ viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
+ if (IS_ERR(viommu)) {
+ rc = PTR_ERR(viommu);
+ goto out_put_idev;
+ }
+
+ if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
+ rc = -EINVAL;
+ goto out_put_viommu;
+ }
+
+ if (!viommu->ops->set_dev_id || !viommu->ops->unset_dev_id) {
+ rc = -EOPNOTSUPP;
+ goto out_put_viommu;
+ }
+
+ rc = xa_alloc(&idev->viommus, &viommu_id, viommu,
+ XA_LIMIT(viommu->obj.id, viommu->obj.id),
+ GFP_KERNEL_ACCOUNT);
+ if (rc)
+ goto out_put_viommu;
+
+ rc = xa_alloc(&viommu->idevs, &dev_id, idev,
+ XA_LIMIT(dev_id, dev_id), GFP_KERNEL_ACCOUNT);
+ if (rc)
+ goto out_xa_erase_viommu;
+
+ rc = viommu->ops->set_dev_id(viommu, idev->dev, cmd->id);
+ if (rc)
+ goto out_xa_erase_idev;
+
+ goto out_put_viommu;
+
+out_xa_erase_idev:
+ xa_erase(&viommu->idevs, idev->obj.id);
+out_xa_erase_viommu:
+ xa_erase(&idev->viommus, viommu->obj.id);
+out_put_viommu:
+ iommufd_put_object(ucmd->ictx, &viommu->obj);
+out_put_idev:
+ iommufd_put_object(ucmd->ictx, &idev->obj);
+ return rc;
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index ca6ac8a1ffd0..2be302b82f47 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -10,6 +10,7 @@
#include <linux/errno.h>
#include <linux/err.h>
#include <linux/refcount.h>
+#include <linux/xarray.h>

struct device;
struct iommufd_device;
@@ -88,6 +89,7 @@ struct iommufd_viommu {
const struct iommufd_viommu_ops *ops;

unsigned int type;
+ struct xarray idevs;
};

/**
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2b0825d69846..eaa192de63d3 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -51,6 +51,7 @@ enum {
IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
IOMMUFD_CMD_HWPT_INVALIDATE,
IOMMUFD_CMD_VIOMMU_ALLOC,
+ IOMMUFD_CMD_VIOMMU_SET_DEV_ID,
};

/**
@@ -722,4 +723,23 @@ struct iommu_viommu_alloc {
__u32 out_viommu_id;
};
#define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
+
+/**
+ * struct iommu_viommu_set_dev_id - ioctl(IOMMU_VIOMMU_SET_DEV_ID)
+ * @size: sizeof(struct iommu_viommu_set_dev_id)
+ * @viommu_id: viommu ID to associate with the device to store its virtual ID
+ * @dev_id: device ID to set a device virtual ID
+ * @__reserved: Must be 0
+ * @id: Device virtual ID
+ *
+ * Set a viommu-specific virtual ID of a device
+ */
+struct iommu_viommu_set_dev_id {
+ __u32 size;
+ __u32 viommu_id;
+ __u32 dev_id;
+ __u32 __reserved;
+ __aligned_u64 id;
+};
+#define IOMMU_VIOMMU_SET_DEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_DEV_ID)
#endif
--
2.43.0


2024-04-13 03:49:41

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 10/14] iommufd/selftest: Add IOMMU_TEST_OP_MV_CHECK_DEV_ID

This allows verifying the ids xarray of struct mock_viommu, and confirming
the correctness of an IOMMU_VIOMMU_SET_DEV_ID call.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/iommufd_test.h | 5 +++
drivers/iommu/iommufd/selftest.c | 32 +++++++++++++++++++
tools/testing/selftests/iommu/iommufd.c | 7 ++++
tools/testing/selftests/iommu/iommufd_utils.h | 24 ++++++++++++++
4 files changed, 68 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index e854d3f67205..91af979a0c23 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -22,6 +22,7 @@ enum {
IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS,
IOMMU_TEST_OP_DIRTY,
IOMMU_TEST_OP_MD_CHECK_IOTLB,
+ IOMMU_TEST_OP_MV_CHECK_DEVID,
};

enum {
@@ -127,6 +128,10 @@ struct iommu_test_cmd {
__u32 id;
__u32 iotlb;
} check_iotlb;
+ struct {
+ __u32 idev_id;
+ __u32 dev_id;
+ } check_dev_id;
};
__u32 last;
};
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 4caed9304065..b7d0ce3d3659 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -999,6 +999,34 @@ static int iommufd_test_md_check_iotlb(struct iommufd_ucmd *ucmd,
return rc;
}

+static int iommufd_test_mv_check_dev_id(struct iommufd_ucmd *ucmd,
+ u32 viommu_id, unsigned int idev_id,
+ u32 dev_id)
+{
+ struct iommufd_device *idev;
+ struct mock_viommu *mv;
+ int rc = 0;
+
+ mv = container_of(iommufd_get_viommu(ucmd, viommu_id),
+ struct mock_viommu, core);
+ if (IS_ERR(mv))
+ return PTR_ERR(mv);
+
+ idev = iommufd_get_device(ucmd, idev_id);
+ if (IS_ERR(idev)) {
+ rc = PTR_ERR(idev);
+ goto out_put_viommu;
+ }
+
+ if (idev->dev != xa_load(&mv->ids, dev_id))
+ rc = -EINVAL;
+
+ iommufd_put_object(ucmd->ictx, &idev->obj);
+out_put_viommu:
+ iommufd_put_object(ucmd->ictx, &mv->core.obj);
+ return rc;
+}
+
struct selftest_access {
struct iommufd_access *access;
struct file *file;
@@ -1484,6 +1512,10 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
return iommufd_test_md_check_iotlb(ucmd, cmd->id,
cmd->check_iotlb.id,
cmd->check_iotlb.iotlb);
+ case IOMMU_TEST_OP_MV_CHECK_DEVID:
+ return iommufd_test_mv_check_dev_id(ucmd, cmd->id,
+ cmd->check_dev_id.idev_id,
+ cmd->check_dev_id.dev_id);
case IOMMU_TEST_OP_CREATE_ACCESS:
return iommufd_test_create_access(ucmd, cmd->id,
cmd->create_access.flags);
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 378fbf00730e..3af5932c227c 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -288,14 +288,21 @@ TEST_F(iommufd_ioas, viommu)
test_err_viommu_set_dev_id(EINVAL, dev_id, viommu_id, 1ULL << 32);
test_cmd_viommu_set_dev_id(dev_id, viommu_id, 0x99);
test_err_viommu_set_dev_id(EBUSY, dev_id, viommu_id, 0x99);
+ test_cmd_viommu_check_dev_id(viommu_id, dev_id, 0x99);

test_cmd_mock_domain(self->ioas_id, &stdev2, &hwpt2, &device2);
test_err_viommu_set_dev_id(EBUSY, device2, viommu_id, 0x99);
test_cmd_viommu_set_dev_id(device2, viommu_id, 0xaa);
test_err_viommu_set_dev_id(EBUSY, device2, viommu_id, 0xaa);
+ test_cmd_viommu_check_dev_id(viommu_id, dev_id, 0x99);
+ test_cmd_viommu_check_dev_id(viommu_id, device2, 0xaa);
+
test_ioctl_destroy(stdev2);
+ test_cmd_viommu_check_dev_id(viommu_id, dev_id, 0x99);
+ test_err_viommu_check_dev_id(ENOENT, viommu_id, device2, 0xaa);

test_ioctl_destroy(viommu_id);
+ test_err_viommu_check_dev_id(ENOENT, viommu_id, dev_id, 0x99);
test_ioctl_destroy(hwpt_id);
} else {
test_err_viommu_alloc(ENOENT, dev_id, hwpt_id, &viommu_id);
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 81e9184fd1d5..e926fa289baa 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -731,3 +731,27 @@ static int _test_cmd_viommu_set_dev_id(int fd, __u32 device_id,
EXPECT_ERRNO(_errno, \
_test_cmd_viommu_set_dev_id(self->fd, device_id, \
viommu_id, virtual_id))
+
+static int _test_cmd_viommu_check_dev_id(int fd, __u32 viommu_id,
+ __u32 device_id, __u64 virtual_id)
+{
+ struct iommu_test_cmd test_cmd = {
+ .size = sizeof(test_cmd),
+ .op = IOMMU_TEST_OP_MV_CHECK_DEVID,
+ .id = viommu_id,
+ .check_dev_id = {
+ .idev_id = device_id,
+ .dev_id = virtual_id,
+ },
+ };
+ return ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MV_CHECK_DEVID),
+ &test_cmd);
+}
+#define test_cmd_viommu_check_dev_id(viommu_id, device_id, expected) \
+ ASSERT_EQ(0, _test_cmd_viommu_check_dev_id(self->fd, viommu_id, \
+ device_id, expected))
+
+#define test_err_viommu_check_dev_id(_errno, viommu_id, device_id, expected) \
+ EXPECT_ERRNO(_errno, \
+ _test_cmd_viommu_check_dev_id(self->fd, viommu_id, \
+ device_id, expected))
--
2.43.0


2024-04-13 03:49:58

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 02/14] iommufd: Swap _iommufd_object_alloc and __iommufd_object_alloc

Currently, the object allocation function calls:
level-0: iommufd_object_alloc()
level-1: ___iommufd_object_alloc()
level-2: _iommufd_object_alloc()

So the level-1 and level-2 look inverted.

As the following change will add another level-3 helper, to make it clear:
level-0: iommufd_object_alloc()
level-1: _iommufd_object_alloc()
level-2: __iommufd_object_alloc()
level-3: ___iommufd_object_alloc()

Swap the names of the level-1 and level-2 functions.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/hw_pagetable.c | 4 ++--
drivers/iommu/iommufd/iommufd_private.h | 12 ++++++------
drivers/iommu/iommufd/main.c | 6 +++---
3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 33d142f8057d..111b8154cce8 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -115,7 +115,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
if (flags & ~valid_flags)
return ERR_PTR(-EOPNOTSUPP);

- hwpt_paging = __iommufd_object_alloc(
+ hwpt_paging = _iommufd_object_alloc(
ictx, hwpt_paging, IOMMUFD_OBJ_HWPT_PAGING, common.obj);
if (IS_ERR(hwpt_paging))
return ERR_CAST(hwpt_paging);
@@ -218,7 +218,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
if (parent->auto_domain || !parent->nest_parent)
return ERR_PTR(-EINVAL);

- hwpt_nested = __iommufd_object_alloc(
+ hwpt_nested = _iommufd_object_alloc(
ictx, hwpt_nested, IOMMUFD_OBJ_HWPT_NESTED, common.obj);
if (IS_ERR(hwpt_nested))
return ERR_CAST(hwpt_nested);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 3ea0a093ee50..3acbc67dd5f0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -200,12 +200,12 @@ iommufd_object_put_and_try_destroy(struct iommufd_ctx *ictx,
iommufd_object_remove(ictx, obj, obj->id, 0);
}

-struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
- size_t size,
- enum iommufd_object_type type);
+struct iommufd_object *__iommufd_object_alloc(struct iommufd_ctx *ictx,
+ size_t size,
+ enum iommufd_object_type type);

-#define __iommufd_object_alloc(ictx, ptr, type, obj) \
- container_of(_iommufd_object_alloc( \
+#define _iommufd_object_alloc(ictx, ptr, type, obj) \
+ container_of(__iommufd_object_alloc( \
ictx, \
sizeof(*(ptr)) + BUILD_BUG_ON_ZERO( \
offsetof(typeof(*(ptr)), \
@@ -214,7 +214,7 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
typeof(*(ptr)), obj)

#define iommufd_object_alloc(ictx, ptr, type) \
- __iommufd_object_alloc(ictx, ptr, type, obj)
+ _iommufd_object_alloc(ictx, ptr, type, obj)

/*
* The IO Address Space (IOAS) pagetable is a virtual page table backed by the
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 39b32932c61e..a51ab766e183 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -29,9 +29,9 @@ struct iommufd_object_ops {
static const struct iommufd_object_ops iommufd_object_ops[];
static struct miscdevice vfio_misc_dev;

-struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
- size_t size,
- enum iommufd_object_type type)
+struct iommufd_object *__iommufd_object_alloc(struct iommufd_ctx *ictx,
+ size_t size,
+ enum iommufd_object_type type)
{
struct iommufd_object *obj;
int rc;
--
2.43.0


2024-04-13 03:50:33

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 11/14] iommufd: Add struct iommufd_vqueue and its related viommu ops

Inroduce a new core structure and its allocator iommufd_vqueue_alloc().

This can be used for a viommu to allocate a HW-accelerated queue, e.g.
NVIDIA's virtual command queue and AMD vIOMMU's command buffer.

Also add a pair of viommu ops for iommufd to forward user space ioctls
to IOMMU drivers.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/viommu.c | 1 +
include/linux/iommu.h | 1 +
include/linux/iommufd.h | 27 +++++++++++++++++++++++++++
3 files changed, 29 insertions(+)

diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 71baca0c75de..8894878c1e73 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -19,6 +19,7 @@
}

viommu_struct_alloc(viommu);
+viommu_struct_alloc(vqueue);

void iommufd_viommu_destroy(struct iommufd_object *obj)
{
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4d4461146288..475f41f2d41f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -146,6 +146,7 @@ struct iopf_queue {
};

struct iommufd_viommu;
+struct iommufd_vqueue;

/* iommu fault flags */
#define IOMMU_FAULT_READ 0x0
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 2be302b82f47..5b97b04aa145 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -20,6 +20,7 @@ struct iommufd_access;
struct iommufd_hwpt_paging;
struct file;
struct iommu_group;
+struct iommu_user_data;

enum iommufd_object_type {
IOMMUFD_OBJ_NONE,
@@ -97,12 +98,27 @@ struct iommufd_viommu {
* @free: Free all driver-specific parts of an iommufd_viommu. The memory
* of the entire viommu will be free-ed by iommufd core
* @set/unset_dev_id: set/unset a user space virtual id for a device
+ * @vqueue_alloc: Allocate an iommufd_vqueue as a user space command queue for a
+ * @viommu instance. Queue specific @user_data must be defined in
+ * the include/uapi/linux/iommufd.h header.
+ * @vqueue_free: Free all driver-specific parts of an iommufd_vqueue. The memory
+ * of the iommufd_vqueue will be free-ed by iommufd core
*/
struct iommufd_viommu_ops {
void (*free)(struct iommufd_viommu *viommu);
int (*set_dev_id)(struct iommufd_viommu *viommu,
struct device *dev, u64 dev_id);
void (*unset_dev_id)(struct iommufd_viommu *viommu, struct device *dev);
+ struct iommufd_vqueue *(*vqueue_alloc)(
+ struct iommufd_viommu *viommu,
+ const struct iommu_user_data *user_data);
+ void (*vqueue_free)(struct iommufd_vqueue *vqueue);
+};
+
+struct iommufd_vqueue {
+ struct iommufd_object obj;
+ struct iommufd_ctx *ictx;
+ struct iommufd_viommu *viommu;
};

#if IS_ENABLED(CONFIG_IOMMUFD)
@@ -123,6 +139,7 @@ int iommufd_vfio_compat_ioas_create(struct iommufd_ctx *ictx);
int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx);

struct iommufd_viommu *_iommufd_viommu_alloc(size_t size);
+struct iommufd_vqueue *_iommufd_vqueue_alloc(size_t size);
#else /* !CONFIG_IOMMUFD */
static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
{
@@ -168,6 +185,11 @@ static inline struct iommufd_viommu *_iommufd_viommu_alloc(size_t size)
{
return NULL;
}
+
+static inline struct iommufd_vqueue *_iommufd_vqueue_alloc(size_t size)
+{
+ return NULL;
+}
#endif /* CONFIG_IOMMUFD */

/*
@@ -179,4 +201,9 @@ static inline struct iommufd_viommu *_iommufd_viommu_alloc(size_t size)
BUILD_BUG_ON_ZERO(offsetof( \
struct drv_struct, member))), \
struct drv_struct, member)
+#define iommufd_vqueue_alloc(drv_struct, member) \
+ container_of(_iommufd_vqueue_alloc(sizeof(struct drv_struct) + \
+ BUILD_BUG_ON_ZERO(offsetof( \
+ struct drv_struct, member))), \
+ struct drv_struct, member)
#endif
--
2.43.0


2024-04-13 03:50:42

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 13/14] iommufd: Add mmap infrastructure

Add for sharing the kernel page with user space. This allows to pass
through HW resource (VCMDQ MMIO pages for example) to user space VMM
and guest OS. Use vma->vm_pgoff as the carrier of a viommu_id.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/main.c | 40 ++++++++++++++++++++++++++++++++++++
include/linux/iommufd.h | 4 ++++
2 files changed, 44 insertions(+)

diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 96ef81530809..5b401c80cca8 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -16,6 +16,7 @@
#include <linux/mutex.h>
#include <linux/bug.h>
#include <uapi/linux/iommufd.h>
+#include <linux/iommu.h>
#include <linux/iommufd.h>

#include "io_pagetable.h"
@@ -427,11 +428,50 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
return ret;
}

+static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+ struct iommufd_ctx *ictx = filp->private_data;
+ size_t size = vma->vm_end - vma->vm_start;
+ u32 viommu_id = (u32)vma->vm_pgoff;
+ struct iommufd_viommu *viommu;
+ unsigned long pfn;
+ int rc;
+
+ if (size > PAGE_SIZE)
+ return -EINVAL;
+
+ viommu = container_of(iommufd_get_object(ictx, viommu_id,
+ IOMMUFD_OBJ_VIOMMU),
+ struct iommufd_viommu, obj);
+ if (IS_ERR(viommu))
+ return PTR_ERR(viommu);
+
+ if (!viommu->ops->get_mmap_pfn) {
+ rc = -EOPNOTSUPP;
+ goto out_put_viommu;
+ }
+
+ pfn = viommu->ops->get_mmap_pfn(viommu, size);
+ if (!pfn) {
+ rc = -ENOMEM;
+ goto out_put_viommu;
+ }
+
+ vma->vm_pgoff = 0;
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+ rc = remap_pfn_range(vma, vma->vm_start, pfn, size, vma->vm_page_prot);
+out_put_viommu:
+ iommufd_put_object(ictx, &viommu->obj);
+ return rc;
+}
+
static const struct file_operations iommufd_fops = {
.owner = THIS_MODULE,
.open = iommufd_fops_open,
.release = iommufd_fops_release,
.unlocked_ioctl = iommufd_fops_ioctl,
+ .mmap = iommufd_fops_mmap,
};

/**
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 707b6d4b20a3..4a9c6b281d35 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -104,6 +104,8 @@ struct iommufd_viommu {
* the include/uapi/linux/iommufd.h header.
* @vqueue_free: Free all driver-specific parts of an iommufd_vqueue. The memory
* of the iommufd_vqueue will be free-ed by iommufd core
+ * @get_mmap_pfn: Return the PFN of a viommu given a finite size, for user space
+ * to mmap the page(s).
*/
struct iommufd_viommu_ops {
void (*free)(struct iommufd_viommu *viommu);
@@ -114,6 +116,8 @@ struct iommufd_viommu_ops {
struct iommufd_viommu *viommu,
const struct iommu_user_data *user_data);
void (*vqueue_free)(struct iommufd_vqueue *vqueue);
+ unsigned long (*get_mmap_pfn)(struct iommufd_viommu *viommu,
+ size_t pgsize);
};

struct iommufd_vqueue {
--
2.43.0


2024-04-13 03:50:51

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 09/14] iommufd/selftest: Add IOMMU_VIOMMU_SET_DEV_ID test coverage

Add an xarray to store the array of virtual ids to mock_devs, to cover the
new IOMMU_VIOMMU_SET_DEV_ID ioctl.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/selftest.c | 50 +++++++++++++++++++
tools/testing/selftests/iommu/iommufd.c | 14 ++++++
tools/testing/selftests/iommu/iommufd_utils.h | 21 ++++++++
3 files changed, 85 insertions(+)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index e2fc2ec23093..4caed9304065 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -139,6 +139,7 @@ struct mock_dev {
struct device dev;
unsigned long flags;
int id;
+ unsigned int id_user;
};

struct selftest_obj {
@@ -516,6 +517,53 @@ static struct iommu_device *mock_probe_device(struct device *dev)

struct mock_viommu {
struct iommufd_viommu core;
+ struct xarray ids;
+};
+
+static void mock_viommu_free(struct iommufd_viommu *viommu)
+{
+ struct mock_viommu *mv = container_of(viommu, struct mock_viommu, core);
+ struct device *dev;
+ unsigned long index;
+
+ xa_for_each(&mv->ids, index, dev)
+ xa_erase(&mv->ids, index);
+ xa_destroy(&mv->ids);
+}
+
+static int mock_viommu_set_dev_id(struct iommufd_viommu *viommu,
+ struct device *dev, u64 dev_id)
+{
+ struct mock_viommu *mv = container_of(viommu, struct mock_viommu, core);
+ struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
+ u32 id = (u32)dev_id;
+ int rc;
+
+ if (dev_id > UINT_MAX)
+ return -EINVAL;
+ if (mdev->id_user > 0)
+ return -EBUSY;
+ rc = xa_alloc(&mv->ids, &id, dev, XA_LIMIT(id, id), GFP_KERNEL);
+ if (rc)
+ return rc;
+ mdev->id_user = (unsigned int)dev_id;
+ return 0;
+}
+
+static void mock_viommu_unset_dev_id(struct iommufd_viommu *viommu,
+ struct device *dev)
+{
+ struct mock_viommu *mv = container_of(viommu, struct mock_viommu, core);
+ struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
+
+ WARN_ON(dev != xa_erase(&mv->ids, mdev->id_user));
+ mdev->id_user = 0;
+}
+
+static const struct iommufd_viommu_ops mock_viommu_ops = {
+ .free = mock_viommu_free,
+ .set_dev_id = mock_viommu_set_dev_id,
+ .unset_dev_id = mock_viommu_unset_dev_id,
};

static struct iommufd_viommu *mock_viommu_alloc(struct device *dev,
@@ -527,6 +575,8 @@ static struct iommufd_viommu *mock_viommu_alloc(struct device *dev,
mv = iommufd_viommu_alloc(mock_viommu, core);
if (!mv)
return ERR_PTR(-ENOMEM);
+ mv->core.ops = &mock_viommu_ops;
+ xa_init_flags(&mv->ids, XA_FLAGS_ALLOC1);

return &mv->core;
}
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index dd7b7c7a06c6..378fbf00730e 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -271,6 +271,9 @@ TEST_F(iommufd_ioas, viommu)
uint32_t dev_id = self->device_id;
uint32_t viommu_id = 0;
uint32_t hwpt_id = 0;
+ uint32_t device2;
+ uint32_t stdev2;
+ uint32_t hwpt2;

if (dev_id) {
test_err_viommu_alloc(ENOENT, dev_id, hwpt_id, &viommu_id);
@@ -282,10 +285,21 @@ TEST_F(iommufd_ioas, viommu)
IOMMU_HWPT_ALLOC_NEST_PARENT,
&hwpt_id);
test_cmd_viommu_alloc(dev_id, hwpt_id, &viommu_id);
+ test_err_viommu_set_dev_id(EINVAL, dev_id, viommu_id, 1ULL << 32);
+ test_cmd_viommu_set_dev_id(dev_id, viommu_id, 0x99);
+ test_err_viommu_set_dev_id(EBUSY, dev_id, viommu_id, 0x99);
+
+ test_cmd_mock_domain(self->ioas_id, &stdev2, &hwpt2, &device2);
+ test_err_viommu_set_dev_id(EBUSY, device2, viommu_id, 0x99);
+ test_cmd_viommu_set_dev_id(device2, viommu_id, 0xaa);
+ test_err_viommu_set_dev_id(EBUSY, device2, viommu_id, 0xaa);
+ test_ioctl_destroy(stdev2);
+
test_ioctl_destroy(viommu_id);
test_ioctl_destroy(hwpt_id);
} else {
test_err_viommu_alloc(ENOENT, dev_id, hwpt_id, &viommu_id);
+ test_err_viommu_set_dev_id(ENOENT, dev_id, viommu_id, 99);
}
}

diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 037c84189d50..81e9184fd1d5 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -710,3 +710,24 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id,
#define test_err_viommu_alloc(_errno, device_id, hwpt_id, viommu_id) \
EXPECT_ERRNO(_errno, _test_cmd_viommu_alloc(self->fd, device_id, \
hwpt_id, 0, viommu_id))
+
+static int _test_cmd_viommu_set_dev_id(int fd, __u32 device_id,
+ __u32 viommu_id, __u64 virtual_id)
+{
+ struct iommu_viommu_set_dev_id cmd = {
+ .size = sizeof(cmd),
+ .dev_id = device_id,
+ .viommu_id = viommu_id,
+ .id = virtual_id,
+ };
+
+ return ioctl(fd, IOMMU_VIOMMU_SET_DEV_ID, &cmd);
+}
+
+#define test_cmd_viommu_set_dev_id(device_id, viommu_id, virtual_id) \
+ ASSERT_EQ(0, _test_cmd_viommu_set_dev_id(self->fd, device_id, \
+ viommu_id, virtual_id))
+#define test_err_viommu_set_dev_id(_errno, device_id, viommu_id, virtual_id) \
+ EXPECT_ERRNO(_errno, \
+ _test_cmd_viommu_set_dev_id(self->fd, device_id, \
+ viommu_id, virtual_id))
--
2.43.0


2024-04-13 03:51:03

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

Introduce a new IOMMUFD_OBJ_VQUEU to represent a virtual command queue
instance. And add a new ioctl for user space to allocate it.

As an initial version, ddd IOMMU_VQUEUE_DATA_TEGRA241_CMDQV to the enum
iommu_vqueue_data_type and the corresponding iommu_vqueue_tegra241_cmdqv
data structure.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/iommufd_private.h | 2 +
drivers/iommu/iommufd/main.c | 6 +++
drivers/iommu/iommufd/viommu.c | 72 +++++++++++++++++++++++++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 48 +++++++++++++++++
5 files changed, 129 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 9ba8f4ecc221..c994f2db3052 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -436,6 +436,8 @@ iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
void iommufd_viommu_destroy(struct iommufd_object *obj);
int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd);
+int iommufd_vqueue_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_vqueue_destroy(struct iommufd_object *obj);

#ifdef CONFIG_IOMMUFD_TEST
int iommufd_test(struct iommufd_ucmd *ucmd);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 16efc3346a2a..96ef81530809 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -324,6 +324,7 @@ union ucmd_buffer {
struct iommu_ioas_alloc alloc;
struct iommu_ioas_allow_iovas allow_iovas;
struct iommu_viommu_alloc viommu;
+ struct iommu_vqueue_alloc vqueue;
struct iommu_ioas_copy ioas_copy;
struct iommu_ioas_iova_ranges iova_ranges;
struct iommu_ioas_map map;
@@ -383,6 +384,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
struct iommu_viommu_alloc, out_viommu_id),
IOCTL_OP(IOMMU_VIOMMU_SET_DEV_ID, iommufd_viommu_set_device_id,
struct iommu_viommu_set_dev_id, id),
+ IOCTL_OP(IOMMU_VQUEUE_ALLOC, iommufd_vqueue_alloc_ioctl,
+ struct iommu_vqueue_alloc, data_uptr),
#ifdef CONFIG_IOMMUFD_TEST
IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
#endif
@@ -518,6 +521,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
[IOMMUFD_OBJ_VIOMMU] = {
.destroy = iommufd_viommu_destroy,
},
+ [IOMMUFD_OBJ_VQUEUE] = {
+ .destroy = iommufd_vqueue_destroy,
+ },
#ifdef CONFIG_IOMMUFD_TEST
[IOMMUFD_OBJ_SELFTEST] = {
.destroy = iommufd_selftest_destroy,
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 8894878c1e73..56c9ea818bfa 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -161,3 +161,75 @@ int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd)
iommufd_put_object(ucmd->ictx, &idev->obj);
return rc;
}
+
+void iommufd_vqueue_destroy(struct iommufd_object *obj)
+{
+ struct iommufd_vqueue *vqueue =
+ container_of(obj, struct iommufd_vqueue, obj);
+ struct iommufd_viommu *viommu = vqueue->viommu;
+
+ if (viommu->ops->vqueue_free)
+ viommu->ops->vqueue_free(vqueue);
+ refcount_dec(&viommu->obj.users);
+}
+
+int iommufd_vqueue_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+ struct iommu_vqueue_alloc *cmd = ucmd->cmd;
+ const struct iommu_user_data user_data = {
+ .type = cmd->data_type,
+ .uptr = u64_to_user_ptr(cmd->data_uptr),
+ .len = cmd->data_len,
+ };
+ struct iommufd_vqueue *vqueue;
+ struct iommufd_viommu *viommu;
+ int rc;
+
+ if (cmd->flags)
+ return -EOPNOTSUPP;
+ if (!cmd->data_len)
+ return -EINVAL;
+
+ viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
+ if (IS_ERR(viommu))
+ return PTR_ERR(viommu);
+
+ if (!viommu->ops || !viommu->ops->vqueue_alloc) {
+ rc = -EOPNOTSUPP;
+ goto out_put_viommu;
+ }
+
+ vqueue = viommu->ops->vqueue_alloc(
+ viommu, user_data.len ? &user_data : NULL);
+ if (IS_ERR(vqueue)) {
+ rc = PTR_ERR(vqueue);
+ goto out_put_viommu;
+ }
+
+ /* iommufd_object_finalize will store the vqueue->obj.id */
+ rc = xa_alloc(&ucmd->ictx->objects, &vqueue->obj.id, XA_ZERO_ENTRY,
+ xa_limit_31b, GFP_KERNEL_ACCOUNT);
+ if (rc)
+ goto out_free;
+
+ vqueue->obj.type = IOMMUFD_OBJ_VQUEUE;
+
+ vqueue->ictx = ucmd->ictx;
+ vqueue->viommu = viommu;
+ cmd->out_vqueue_id = vqueue->obj.id;
+ rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+ if (rc)
+ goto out_erase_xa;
+ iommufd_object_finalize(ucmd->ictx, &vqueue->obj);
+ refcount_inc(&viommu->obj.users);
+ goto out_put_viommu;
+
+out_erase_xa:
+ xa_erase(&ucmd->ictx->objects, vqueue->obj.id);
+out_free:
+ if (viommu->ops->vqueue_free)
+ viommu->ops->vqueue_free(vqueue);
+out_put_viommu:
+ iommufd_put_object(ucmd->ictx, &viommu->obj);
+ return rc;
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 5b97b04aa145..707b6d4b20a3 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -31,6 +31,7 @@ enum iommufd_object_type {
IOMMUFD_OBJ_IOAS,
IOMMUFD_OBJ_ACCESS,
IOMMUFD_OBJ_VIOMMU,
+ IOMMUFD_OBJ_VQUEUE,
#ifdef CONFIG_IOMMUFD_TEST
IOMMUFD_OBJ_SELFTEST,
#endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index eaa192de63d3..95abe3100e3b 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -52,6 +52,7 @@ enum {
IOMMUFD_CMD_HWPT_INVALIDATE,
IOMMUFD_CMD_VIOMMU_ALLOC,
IOMMUFD_CMD_VIOMMU_SET_DEV_ID,
+ IOMMUFD_CMD_VQUEUE_ALLOC,
};

/**
@@ -742,4 +743,51 @@ struct iommu_viommu_set_dev_id {
__aligned_u64 id;
};
#define IOMMU_VIOMMU_SET_DEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_DEV_ID)
+
+/**
+ * struct iommu_vqueue_tegra241_cmdqv - NVIDIA Tegra241's Virtual Command Queue
+ * for its CMDQV Extension for ARM SMMUv3
+ * (IOMMU_VQUEUE_DATA_TEGRA241_CMDQV)
+ * @vcmdq_id: logical ID of a virtual command queue in the VIOMMU instance
+ * @vcmdq_log2size: (1 << @vcmdq_log2size) will be the size of the vcmdq
+ * @vcmdq_base: guest physical address (IPA) to the vcmdq base address
+ */
+struct iommu_vqueue_tegra241_cmdqv {
+ __u32 vcmdq_id;
+ __u32 vcmdq_log2size;
+ __aligned_u64 vcmdq_base;
+};
+
+/**
+ * enum iommu_vqueue_data_type - VQUEUE Data Type
+ * @IOMMU_VQUEUE_DATA_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
+ */
+enum iommu_vqueue_data_type {
+ IOMMU_VQUEUE_DATA_TEGRA241_CMDQV,
+};
+
+/**
+ * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
+ * @size: sizeof(struct iommu_vqueue_alloc)
+ * @flags: Must be 0
+ * @viommu_id: viommu ID to associate the virtual queue with
+ * @out_vqueue_id: The ID of the new virtual queue
+ * @data_type: One of enum iommu_vqueue_data_type
+ * @data_len: Length of the type specific data
+ * @data_uptr: User pointer to the type specific data
+ *
+ * Allocate an virtual queue object for driver-specific HW-accelerated queue
+ */
+
+struct iommu_vqueue_alloc {
+ __u32 size;
+ __u32 flags;
+ __u32 viommu_id;
+ __u32 out_vqueue_id;
+ __u32 data_type;
+ __u32 data_len;
+ __aligned_u64 data_uptr;
+};
+#define IOMMU_VQUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VQUEUE_ALLOC)
+
#endif
--
2.43.0


2024-04-13 03:51:11

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH RFCv1 14/14] iommu/tegra241-cmdqv: Add user-space use support

Add the support via VIOMMU infrastructure for virtualization use case.

This basically allows VMM to allocate VINTFs (as a viommu object) and
assign VCMDQs to it. A VINTF's MMIO page0 can be mmap'd to user space
for VM to access directly without VMEXIT and corresponding hypercall.

As an initial version, the number of VCMDQs per VINTF is fixed to two.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 ++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 19 ++
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 284 +++++++++++++++++-
3 files changed, 317 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9af6659ea488..367df65d1e2a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -17,6 +17,7 @@
#include <linux/err.h>
#include <linux/interrupt.h>
#include <linux/io-pgtable.h>
+#include <linux/iommufd.h>
#include <linux/iopoll.h>
#include <linux/module.h>
#include <linux/msi.h>
@@ -3077,6 +3078,23 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
}

+static struct iommufd_viommu *arm_smmu_viommu_alloc(struct device *dev,
+ unsigned int viommu_type,
+ struct iommu_domain *domain)
+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain_devices(domain);
+ struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+ if (!master || !master->smmu)
+ return ERR_PTR(-ENODEV);
+
+ if (master->smmu->tegra241_cmdqv &&
+ viommu_type == IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)
+ return tegra241_cmdqv_viommu_alloc(
+ master->smmu->tegra241_cmdqv, smmu_domain);
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
static struct iommu_ops arm_smmu_ops = {
.identity_domain = &arm_smmu_identity_domain,
.blocked_domain = &arm_smmu_blocked_domain,
@@ -3091,6 +3109,7 @@ static struct iommu_ops arm_smmu_ops = {
.remove_dev_pasid = arm_smmu_remove_dev_pasid,
.dev_enable_feat = arm_smmu_dev_enable_feature,
.dev_disable_feat = arm_smmu_dev_disable_feature,
+ .viommu_alloc = arm_smmu_viommu_alloc,
.page_response = arm_smmu_page_response,
.def_domain_type = arm_smmu_def_domain_type,
.pgsize_bitmap = -1UL, /* Restricted during device attach */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index fdc3d570cf43..8ee3f10086b9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -701,6 +701,7 @@ struct arm_smmu_device {

struct arm_smmu_stream {
u32 id;
+ u32 cmdqv_sid_slot;
struct arm_smmu_master *master;
struct rb_node node;
};
@@ -776,6 +777,14 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
unsigned long prod_off, unsigned long cons_off,
size_t dwords, const char *name);

+static inline phys_addr_t
+arm_smmu_domain_ipa_to_pa(struct arm_smmu_domain *smmu_domain, u64 ipa)
+{
+ if (WARN_ON_ONCE(smmu_domain->stage != ARM_SMMU_DOMAIN_S2))
+ return 0;
+ return iommu_iova_to_phys(&smmu_domain->domain, ipa);
+}
+
#ifdef CONFIG_ARM_SMMU_V3_SVA
bool arm_smmu_sva_supported(struct arm_smmu_device *smmu);
bool arm_smmu_master_sva_supported(struct arm_smmu_master *master);
@@ -838,6 +847,9 @@ tegra241_cmdqv_acpi_probe(struct arm_smmu_device *smmu, int id);
int tegra241_cmdqv_device_reset(struct arm_smmu_device *smmu);
struct arm_smmu_cmdq *tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
u64 *cmds, int n);
+struct iommufd_viommu *
+tegra241_cmdqv_viommu_alloc(struct tegra241_cmdqv *cmdqv,
+ struct arm_smmu_domain *smmu_domain);
#else /* CONFIG_TEGRA241_CMDQV */
static inline struct tegra241_cmdqv *
tegra241_cmdqv_acpi_probe(struct arm_smmu_device *smmu, int id)
@@ -855,6 +867,13 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu, u64 *cmds, int n)
{
return NULL;
}
+
+static inline struct iommufd_viommu *
+tegra241_cmdqv_viommu_alloc(struct tegra241_cmdqv *cmdqv,
+ struct arm_smmu_domain *smmu_domain);
+{
+ return -ENODEV;
+}
#endif /* CONFIG_TEGRA241_CMDQV */

#endif /* _ARM_SMMU_V3_H */
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 7aeaf810980c..18e250ec60dc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -8,7 +8,12 @@
#include <linux/dma-mapping.h>
#include <linux/interrupt.h>
#include <linux/iommu.h>
+#include <linux/iommufd.h>
#include <linux/iopoll.h>
+#include <linux/kvm_host.h>
+#include <linux/platform_device.h>
+#include <linux/vfio.h>
+#include <uapi/linux/iommufd.h>

#include <acpi/acpixf.h>

@@ -28,8 +33,10 @@
#define CMDQV_EN BIT(0)

#define TEGRA241_CMDQV_PARAM 0x0004
+#define CMDQV_NUM_SID_PER_VM_LOG2 GENMASK(15, 12)
#define CMDQV_NUM_VINTF_LOG2 GENMASK(11, 8)
#define CMDQV_NUM_VCMDQ_LOG2 GENMASK(7, 4)
+#define CMDQV_VER GENMASK(3, 0)

#define TEGRA241_CMDQV_STATUS 0x0008
#define CMDQV_ENABLED BIT(0)
@@ -47,6 +54,14 @@
/* VINTF config regs */
#define TEGRA241_VINTF(v) (0x1000 + 0x100*(v))

+#define TEGRA241_VINTFi_CONFIG(i) (TEGRA241_VINTF(i) + TEGRA241_VINTF_CONFIG)
+#define TEGRA241_VINTFi_STATUS(i) (TEGRA241_VINTF(i) + TEGRA241_VINTF_STATUS)
+#define TEGRA241_VINTFi_SID_MATCH(i, s) (TEGRA241_VINTF(i) + TEGRA241_VINTF_SID_MATCH(s))
+#define TEGRA241_VINTFi_SID_REPLACE(i, s) \
+ (TEGRA241_VINTF(i) + TEGRA241_VINTF_SID_REPLACE(s))
+#define TEGRA241_VINTFi_CMDQ_ERR_MAP(i, m) \
+ (TEGRA241_VINTF(i) + TEGRA241_VINTF_CMDQ_ERR_MAP(m))
+
#define TEGRA241_VINTF_CONFIG 0x0000
#define VINTF_HYP_OWN BIT(17)
#define VINTF_VMID GENMASK(16, 1)
@@ -55,6 +70,10 @@
#define TEGRA241_VINTF_STATUS 0x0004
#define VINTF_STATUS GENMASK(3, 1)
#define VINTF_ENABLED BIT(0)
+#define VINTF_VI_NUM_LVCMDQ GENMASK(23, 16)
+
+#define TEGRA241_VINTF_SID_MATCH(s) (0x0040 + 0x4*(s))
+#define TEGRA241_VINTF_SID_REPLACE(s) (0x0080 + 0x4*(s))

#define TEGRA241_VINTF_CMDQ_ERR_MAP(m) (0x00C0 + 0x4*(m))

@@ -236,6 +255,7 @@ MODULE_PARM_DESC(bypass_vcmdq,

/**
* struct tegra241_vcmdq - Virtual Command Queue
+ * @core: Embedded iommufd_vqueue structure
* @idx: Global index in the CMDQV HW
* @lidx: Local index in the VINTF
* @cmdqv: CMDQV HW pointer
@@ -246,6 +266,8 @@ MODULE_PARM_DESC(bypass_vcmdq,
* @page1: MMIO Page1 base address
*/
struct tegra241_vcmdq {
+ struct iommufd_vqueue core;
+
u16 idx;
u16 lidx;

@@ -257,19 +279,27 @@ struct tegra241_vcmdq {
void __iomem *page0;
void __iomem *page1;
};
+#define vqueue_to_vcmdq(v) container_of(v, struct tegra241_vcmdq, core)

/**
* struct tegra241_vintf - Virtual Interface
+ * @core: Embedded iommufd_viommu structure
* @idx: Global index in the CMDQV HW
+ * @vmid: VMID for configuration
* @enabled: Enabled or not
* @hyp_own: Owned by hypervisor (in-kernel)
* @error: Status error or not
* @cmdqv: CMDQV HW pointer
* @vcmdqs: List of VCMDQ pointers
* @base: MMIO base address
+ * @s2_domain: Stage-2 SMMU domain
+ * @sid_slots: Stream ID Slot allocator
*/
struct tegra241_vintf {
+ struct iommufd_viommu core;
+
u16 idx;
+ u16 vmid;

bool enabled;
bool hyp_own;
@@ -279,13 +309,19 @@ struct tegra241_vintf {
struct tegra241_vcmdq **vcmdqs;

void __iomem *base;
+ struct arm_smmu_domain *s2_domain;
+
+#define TEGRA241_VINTF_MAX_SID_SLOT 15
+ struct ida sid_slots;
};
+#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, core)

/**
* struct tegra241_cmdqv - CMDQ-V for SMMUv3
* @smmu: SMMUv3 pointer
* @dev: Device pointer
* @base: MMIO base address
+ * @base_phys: Page frame number of @base, for mmap
* @irq: IRQ number
* @num_vintfs: Total number of VINTFs
* @num_vcmdqs: Total number of VCMDQs
@@ -299,6 +335,7 @@ struct tegra241_cmdqv {

struct device *dev;
void __iomem *base;
+ unsigned long base_pfn;
int irq;

/* CMDQV Hardware Params */
@@ -446,15 +483,11 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
vcmdq_dbg("deinited\n");
}

-static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
+static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
{
struct tegra241_cmdqv *cmdqv = vcmdq->cmdqv;
struct tegra241_vintf *vintf = vcmdq->vintf;
u32 regval;
- int ret;
-
- /* Configure and enable the vcmdq */
- tegra241_vcmdq_hw_deinit(vcmdq);

regval = FIELD_PREP(CMDQV_CMDQ_ALLOC_VINTF, vintf->idx);
regval |= FIELD_PREP(CMDQV_CMDQ_ALLOC_LVCMDQ, vcmdq->lidx);
@@ -462,7 +495,15 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
cmdqv_writel_relaxed(regval, CMDQ_ALLOC(vcmdq->idx));

vcmdq_page1_writeq_relaxed(vcmdq->cmdq.q.q_base, BASE);
+}
+
+static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
+{
+ int ret;

+ /* Configure and enable the vcmdq */
+ tegra241_vcmdq_hw_deinit(vcmdq);
+ _tegra241_vcmdq_hw_init(vcmdq);
ret = vcmdq_write_config(VCMDQ_EN);
if (ret) {
vcmdq_err("GERRORN=0x%X\n", vcmdq_page0_readl_relaxed(GERRORN));
@@ -723,6 +764,8 @@ tegra241_cmdqv_find_resource(struct arm_smmu_device *smmu, int id)

acpi_dev_free_resource_list(&resource_list);

+ cmdqv->base_pfn = rentry->res->start >> PAGE_SHIFT;
+
INIT_LIST_HEAD(&resource_list);

ret = acpi_dev_get_resources(adev, &resource_list,
@@ -843,3 +886,234 @@ tegra241_cmdqv_acpi_probe(struct arm_smmu_device *smmu, int id)

return cmdqv;
}
+
+static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq)
+{
+ /* Configure the vcmdq only; User space does the enabling */
+ _tegra241_vcmdq_hw_init(vcmdq);
+
+ vcmdq_dbg("inited at host PA 0x%llx size 0x%lx\n",
+ vcmdq->cmdq.q.q_base & VCMDQ_ADDR,
+ 1UL << (vcmdq->cmdq.q.q_base & VCMDQ_LOG2SIZE));
+ return 0;
+}
+
+static struct iommufd_vqueue *
+tegra241_cmdqv_vqueue_alloc(struct iommufd_viommu *viommu,
+ const struct iommu_user_data *user_data)
+{
+ struct tegra241_vintf *vintf = viommu_to_vintf(viommu);
+ struct tegra241_cmdqv *cmdqv = vintf->cmdqv;
+ struct iommu_vqueue_tegra241_cmdqv arg;
+ struct tegra241_vcmdq *vcmdq;
+ phys_addr_t q_base;
+ int ret;
+
+ ret = iommu_copy_struct_from_user(&arg, user_data,
+ IOMMU_VQUEUE_DATA_TEGRA241_CMDQV,
+ vcmdq_base);
+ if (ret)
+ return ERR_PTR(ret);
+
+ if (!arg.vcmdq_base || arg.vcmdq_base & ~VCMDQ_ADDR)
+ return ERR_PTR(-EINVAL);
+ if (!arg.vcmdq_log2size || arg.vcmdq_log2size > VCMDQ_LOG2SIZE)
+ return ERR_PTR(-EINVAL);
+ if (arg.vcmdq_id >= cmdqv->num_vcmdqs_per_vintf)
+ return ERR_PTR(-EINVAL);
+ q_base = arm_smmu_domain_ipa_to_pa(vintf->s2_domain, arg.vcmdq_base);
+ if (!q_base)
+ return ERR_PTR(-EINVAL);
+
+ if (vintf->vcmdqs[arg.vcmdq_id]) {
+ vcmdq = vintf->vcmdqs[arg.vcmdq_id];
+
+ /* deinit the previous setting as a reset, before re-init */
+ tegra241_vcmdq_hw_deinit(vcmdq);
+
+ vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR;
+ vcmdq->cmdq.q.q_base |= arg.vcmdq_log2size;
+ tegra241_vcmdq_hw_init_user(vcmdq);
+
+ return &vcmdq->core;
+ }
+
+ vcmdq = iommufd_vqueue_alloc(tegra241_vcmdq, core);
+ if (!vcmdq)
+ return ERR_PTR(-ENOMEM);
+
+ ret = tegra241_vintf_lvcmdq_init(vintf, arg.vcmdq_id, vcmdq);
+ if (ret)
+ goto free_vcmdq;
+
+ vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR;
+ vcmdq->cmdq.q.q_base |= arg.vcmdq_log2size;
+
+ ret = tegra241_vcmdq_hw_init_user(vcmdq);
+ if (ret)
+ goto deinit_lvcmdq;
+ vintf->vcmdqs[arg.vcmdq_id] = vcmdq;
+ vcmdq_dbg("allocated\n");
+
+ return &vcmdq->core;
+deinit_lvcmdq:
+ tegra241_vintf_lvcmdq_deinit(vcmdq);
+free_vcmdq:
+ kfree(vcmdq);
+ return ERR_PTR(ret);
+}
+
+static void tegra241_cmdqv_vqueue_free(struct iommufd_vqueue *vqueue)
+{
+ struct tegra241_vcmdq *vcmdq = vqueue_to_vcmdq(vqueue);
+
+ tegra241_vcmdq_hw_deinit(vcmdq);
+ tegra241_vintf_lvcmdq_deinit(vcmdq);
+ vcmdq->vintf->vcmdqs[vcmdq->lidx] = NULL;
+ vcmdq_dbg("deallocated\n");
+
+ /* IOMMUFD core frees the memory of vcmdq and vqueue */
+}
+
+static void tegra241_cmdqv_viommu_free(struct iommufd_viommu *viommu)
+{
+ struct tegra241_vintf *vintf = viommu_to_vintf(viommu);
+ struct tegra241_cmdqv *cmdqv = vintf->cmdqv;
+ int qidx;
+
+ /* Just in case, make sure all LVCMDQs are disabled */
+ for (qidx = 0; qidx < cmdqv->num_vcmdqs_per_vintf; qidx++)
+ if (vintf->vcmdqs[qidx])
+ tegra241_cmdqv_vqueue_free(&vintf->vcmdqs[qidx]->core);
+
+ vintf_write_config(0);
+
+ ida_destroy(&vintf->sid_slots);
+ kfree(vintf->vcmdqs);
+
+ ida_free(&cmdqv->vintf_ids, vintf->idx);
+ cmdqv->vintfs[vintf->idx] = NULL;
+ vintf_dbg("deallocated with vmid (%d)\n", vintf->vmid);
+
+ /* IOMMUFD core frees the memory of vintf and viommu */
+}
+
+static int tegra241_cmdqv_viommu_set_dev_id(struct iommufd_viommu *viommu,
+ struct device *dev, u64 dev_id)
+{
+ struct tegra241_vintf *vintf =
+ container_of(viommu, struct tegra241_vintf, core);
+ struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+ struct arm_smmu_stream *stream = &master->streams[0];
+ int slot;
+
+ WARN_ON_ONCE(master->num_streams != 1);
+
+ /* Find an empty slot of SID_MATCH and SID_REPLACE */
+ slot = ida_alloc_max(&vintf->sid_slots,
+ TEGRA241_VINTF_MAX_SID_SLOT, GFP_KERNEL);
+ if (slot < 0)
+ return slot;
+
+ vintf_writel_relaxed(stream->id, SID_REPLACE(slot));
+ vintf_writel_relaxed(dev_id << 1 | 0x1, SID_MATCH(slot));
+ stream->cmdqv_sid_slot = slot;
+ vintf_dbg("allocated a slot (%d) for pSID=%x, vSID=%x\n",
+ slot, stream->id, (u32)dev_id);
+
+ return 0;
+}
+
+static void tegra241_cmdqv_viommu_unset_dev_id(struct iommufd_viommu *viommu,
+ struct device *dev)
+{
+ struct tegra241_vintf *vintf =
+ container_of(viommu, struct tegra241_vintf, core);
+ struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+ struct arm_smmu_stream *stream = &master->streams[0];
+ int slot = stream->cmdqv_sid_slot;
+
+ vintf_writel_relaxed(0, SID_REPLACE(slot));
+ vintf_writel_relaxed(0, SID_MATCH(slot));
+ ida_free(&vintf->sid_slots, slot);
+ vintf_dbg("deallocated a slot (%d) for pSID=%x\n", slot, stream->id);
+}
+
+static unsigned long tegra241_cmdqv_get_mmap_pfn(struct iommufd_viommu *viommu,
+ size_t pgsize)
+{
+ struct tegra241_vintf *vintf =
+ container_of(viommu, struct tegra241_vintf, core);
+ struct tegra241_cmdqv *cmdqv = vintf->cmdqv;
+
+ return cmdqv->base_pfn + TEGRA241_VINTFi_PAGE0(vintf->idx) / PAGE_SIZE;
+}
+
+static const struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
+ .free = tegra241_cmdqv_viommu_free,
+ .set_dev_id = tegra241_cmdqv_viommu_set_dev_id,
+ .unset_dev_id = tegra241_cmdqv_viommu_unset_dev_id,
+ .vqueue_alloc = tegra241_cmdqv_vqueue_alloc,
+ .vqueue_free = tegra241_cmdqv_vqueue_free,
+ .get_mmap_pfn = tegra241_cmdqv_get_mmap_pfn,
+};
+
+struct iommufd_viommu *
+tegra241_cmdqv_viommu_alloc(struct tegra241_cmdqv *cmdqv,
+ struct arm_smmu_domain *smmu_domain)
+{
+ struct tegra241_vintf *vintf;
+ int idx, ret;
+ u32 regval;
+
+ if (!smmu_domain || smmu_domain->stage != ARM_SMMU_DOMAIN_S2)
+ return ERR_PTR(-EINVAL);
+
+ vintf = iommufd_viommu_alloc(tegra241_vintf, core);
+ if (!vintf)
+ return ERR_PTR(-ENOMEM);
+ vintf->core.ops = &tegra241_cmdqv_viommu_ops;
+
+ ret = ida_alloc_range(&cmdqv->vintf_ids, 1,
+ cmdqv->num_vintfs - 1, GFP_KERNEL);
+ if (ret < 0) {
+ dev_err(cmdqv->dev, "no more available vintf\n");
+ goto out_free;
+ }
+ idx = ret;
+ cmdqv->vintfs[idx] = vintf;
+
+ vintf->idx = idx;
+ vintf->cmdqv = cmdqv;
+ vintf->vmid = smmu_domain->vmid;
+ vintf->s2_domain = smmu_domain;
+ vintf->base = cmdqv->base + TEGRA241_VINTF(idx);
+
+ ret = vintf_write_config(0);
+ if (ret)
+ goto out_ida_free;
+ regval = FIELD_PREP(VINTF_VMID, vintf->vmid) |
+ FIELD_PREP(VINTF_EN, 1);
+ ret = vintf_write_config(regval);
+ if (ret)
+ goto out_ida_free;
+
+ vintf->vcmdqs = kcalloc(cmdqv->num_vcmdqs_per_vintf,
+ sizeof(*vintf->vcmdqs), GFP_KERNEL);
+ if (!vintf->vcmdqs) {
+ ret = -ENOMEM;
+ goto out_ida_free;
+ }
+
+ ida_init(&vintf->sid_slots);
+
+ vintf_dbg("allocated with vmid (%d)\n", vintf->vmid);
+
+ return &vintf->core;
+
+out_ida_free:
+ ida_free(&cmdqv->vintf_ids, vintf->idx);
+out_free:
+ kfree(vintf);
+ return ERR_PTR(ret);
+}
--
2.43.0


2024-05-12 14:28:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

On Fri, Apr 12, 2024 at 08:47:02PM -0700, Nicolin Chen wrote:

> +int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
> +{
> + struct iommu_viommu_alloc *cmd = ucmd->cmd;
> + struct iommufd_hwpt_paging *hwpt_paging;
> + struct iommu_device *iommu_dev;
> + struct iommufd_viommu *viommu;
> + struct iommufd_device *idev;
> + int rc;
> +
> + if (cmd->flags)
> + return -EOPNOTSUPP;
> +
> + idev = iommufd_get_device(ucmd, cmd->dev_id);
> + if (IS_ERR(idev))
> + return PTR_ERR(idev);
> + iommu_dev = idev->dev->iommu->iommu_dev;
> +
> + if (!iommu_dev->ops->viommu_alloc) {
> + rc = -EOPNOTSUPP;
> + goto out_put_idev;
> + }
> +
> + hwpt_paging = iommufd_get_hwpt_paging(ucmd, cmd->hwpt_id);
> + if (IS_ERR(hwpt_paging)) {
> + rc = PTR_ERR(hwpt_paging);
> + goto out_put_idev;
> + }
> +
> + if (!hwpt_paging->nest_parent) {
> + rc = -EINVAL;
> + goto out_put_hwpt;
> + }
> +
> + viommu = iommu_dev->ops->viommu_alloc(idev->dev, cmd->type,
> + hwpt_paging->common.domain);
> + if (IS_ERR(viommu)) {
> + rc = PTR_ERR(viommu);
> + goto out_put_hwpt;
> + }

Ah you did already include the S2, So should it be
domain->viommu_alloc() then?

> +
> + /* iommufd_object_finalize will store the viommu->obj.id */
> + rc = xa_alloc(&ucmd->ictx->objects, &viommu->obj.id, XA_ZERO_ENTRY,
> + xa_limit_31b, GFP_KERNEL_ACCOUNT);
> + if (rc)
> + goto out_free;
> +
> + viommu->obj.type = IOMMUFD_OBJ_VIOMMU;

See my other notes, lets try not to open code this.

> + viommu->type = cmd->type;
> +
> + viommu->ictx = ucmd->ictx;
> + viommu->hwpt = hwpt_paging;
> + viommu->iommu_dev = idev->dev->iommu->iommu_dev;
> + cmd->out_viommu_id = viommu->obj.id;
> + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> + if (rc)
> + goto out_erase_xa;
> + iommufd_object_finalize(ucmd->ictx, &viommu->obj);
> + refcount_inc(&viommu->hwpt->common.obj.users);
> + goto out_put_hwpt;
> +
> +out_erase_xa:
> + xa_erase(&ucmd->ictx->objects, viommu->obj.id);
> +out_free:
> + if (viommu->ops && viommu->ops->free)
> + viommu->ops->free(viommu);
> + kfree(viommu);

This really should use the abort flow. The driver free callback has to
be in the object release..

> +
> +/**
> + * enum iommu_viommu_type - VIOMMU Type
> + * @IOMMU_VIOMMU_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
> + */
> +enum iommu_viommu_type {
> + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> +};

At least the 241 line should be in a following patch

> +/**
> + * struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
> + * @size: sizeof(struct iommu_viommu_alloc)
> + * @flags: Must be 0
> + * @type: Type of the VIOMMU object. Must be defined in enum iommu_viommu_type
> + * @dev_id: The device to allocate this virtual IOMMU for
> + * @hwpt_id: ID of a nested parent HWPT
> + * @out_viommu_id: Output virtual IOMMU ID for the allocated object
> + *
> + * Allocate an virtual IOMMU object that holds a (shared) nested parent HWPT
> + */
> +struct iommu_viommu_alloc {
> + __u32 size;
> + __u32 flags;
> + __u32 type;
> + __u32 dev_id;
> + __u32 hwpt_id;
> + __u32 out_viommu_id;
> +};

This seems fine.

Let's have a following patch to change the hwpt_alloc to accept the
viommu as a hwpt as a uAPI change as well.

The more I think about how this needs to work the more sure I am that
we need to do that.

ARM will need a fairly tricky set of things to manage the VMID
lifecycle. In BTM mode the VMID must come from the KVM. For vcmdq the
VMID is needed to create the queue/viommu. For AMD the S2 is needed to
create the VIOMMU in the first place.

So, to make this all work perfectly we need approx the following
- S2 sharing across instances in ARM - meaning the VMID is allocated
at attach not domain alloc
- S2 hwpt is refcounted by the VIOMMU in the iommufd layer
- VIOMMU is refcounted by every nesting child in the iommufd layer
- The nesting child holds a pointer to both the S2 and the VIOMMU
(viommu optional)
- When the nesting child attaches to a device the STE will source the
VMID from the VIOMMU if present otherwise from the S2
- "RID" attach (ie naked S2) will have to be done with a Nesting
Child using a vSTE that indicates Identity. Then the attach logic
will have enough information to get the VMID from the VIOMMU
- In full VIOMMU mode the S2 will never get a VMID of its own, it
will always use the VIOMMU. Life cycle is simple, the VMID is freed
when the VIOMMU is freed. That can't happen until all Nesting
Children are freed. That can't happen until all Nesting Children
are detached from devices. Detatching removes the HW touch of the VMID.

At this point you don't need the full generality, but let's please get
ready and get the viommu pointer available in all the right spots and
we can keep the current logic to borrow the VMID from the S2 for the
VIOMMU.

AMD folks, please consider if this works for you as well.

Jason

2024-05-12 14:28:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
> Add a new iommufd_viommu core structure to represent a vIOMMU instance in
> the user space, typically backed by a HW-accelerated feature of an IOMMU,
> e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and AMD Hardware
> Accelerated Virtualized IOMMU (vIOMMU).

I expect this will also be the only way to pass in an associated KVM,
userspace would supply the kvm when creating the viommu.

The tricky bit of this flow is how to manage the S2. It is necessary
that the S2 be linked to the viommu:

1) ARM BTM requires the VMID to be shared with KVM
2) AMD and others need the S2 translation because some of the HW
acceleration is done inside the guest address space

I haven't looked closely at AMD but presumably the VIOMMU create will
have to install the S2 into a DID or something?

So we need the S2 to exist before the VIOMMU is created, but the
drivers are going to need some more fixing before that will fully
work.

Does the nesting domain create need the viommu as well (in place of
the S2 hwpt)? That feels sort of natural.

There is still a lot of fixing before everything can work fully, but
do we need to make some preperations here in the uapi? Like starting
to thread the S2 through it as I described?

Kevin, does Intel forsee any viommu needs on current/future Intel HW?
I assume you are thinking about invalidation queue bypass like
everyone else. I think it is an essential feature for vSVA.

> A driver should embed this core structure in its driver viommu structure
> and call the new iommufd_viommu_alloc() helper to allocate a core/driver
> structure bundle and fill its core viommu->ops:
> struct my_driver_viommu {
> struct iommufd_viommu core;
> ....
> };
>
> static const struct iommufd_viommu_ops my_driver_viommu_ops = {
> .free = my_driver_viommu_free,
> };
>
> struct my_driver_viommu *my_viommu =
> iommufd_viommu_alloc(my_driver_viommu, core);

Why don't we have an ictx here anyhow? The caller has it? Just pass it
down and then it is normal:

my_viommu = iommufd_object_alloc_elm(ictx, my_viommu, IOMMUFD_OBJ_HWPT_VIOMMU, core.obj);

And abort works properly for error handling.

Jason

2024-05-12 14:28:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 03/14] iommufd: Prepare for viommu structures and functions

On Fri, Apr 12, 2024 at 08:47:00PM -0700, Nicolin Chen wrote:

> +static inline struct iommufd_object *___iommufd_object_alloc(size_t size)
> +{
> + struct iommufd_object *obj;
> +
> + obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
> + if (!obj)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Starts out bias'd by 1 until it is removed from the xarray */
> + refcount_set(&obj->shortterm_users, 1);
> + refcount_set(&obj->users, 1);
> +
> + /*
> + * The allocation of an obj->id needs an ictx, so it has to be done
> + * after this ___iommufd_object_alloc() callback.
> + */
> +
> + return obj;
> +}

It is probably cleaner to just make the existing allocation work with
a NULL ictx for this case? Then we can use the existing alloc
functions.

> +#define viommu_struct_alloc(name) \
> + struct iommufd_##name *_iommufd_##name##_alloc(size_t size) \
> + { \
> + struct iommufd_object *obj; \
> + if (WARN_ON(size < sizeof(struct iommufd_##name))) \
> + return NULL; \

Then here you'd just use the exisint container_of based flow with the
driver sub struct name passed as the 'obj'

Jason

2024-05-12 14:28:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 02/14] iommufd: Swap _iommufd_object_alloc and __iommufd_object_alloc

On Fri, Apr 12, 2024 at 08:46:59PM -0700, Nicolin Chen wrote:
> Currently, the object allocation function calls:
> level-0: iommufd_object_alloc()
> level-1: ___iommufd_object_alloc()
> level-2: _iommufd_object_alloc()

Let's give __iommufd_object_alloc() a better name then

It is a less general version of iommufd_object_alloc(), maybe
iommufd_object_alloc_elm() ?

Jason

2024-05-12 14:58:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote:
> Introduce a new ioctl to set a per-viommu device virtual id that should be
> linked to the physical device id (or just a struct device pointer).
>
> Since a viommu (user space IOMMU instance) can have multiple devices while
> it's not ideal to confine a device to one single user space IOMMU instance
> either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their
> structures to bind them bidirectionally.

Since I would like to retain the dev_id, I think this is probably
better done with an allocated struct per dev-id:

struct dev_id {
struct iommufd_device *idev;
struct iommufd_viommu *viommu;
u64 vdev_id;
u64 driver_private; // Ie the driver can store the pSID here
struct list_head idev_entry;
};

> @@ -135,7 +135,16 @@ void iommufd_device_destroy(struct iommufd_object *obj)
> {
> struct iommufd_device *idev =
> container_of(obj, struct iommufd_device, obj);
> + struct iommufd_viommu *viommu;
> + unsigned long index;
>
> + xa_for_each(&idev->viommus, index, viommu) {
> + if (viommu->ops->unset_dev_id)
> + viommu->ops->unset_dev_id(viommu, idev->dev);
> + xa_erase(&viommu->idevs, idev->obj.id);
> + xa_erase(&idev->viommus, index);
> + }

Then this turns into list_for_each(idev->viommu_vdevid_list)

> +int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd)
> +{
> + struct iommu_viommu_set_dev_id *cmd = ucmd->cmd;
> + unsigned int dev_id, viommu_id;
> + struct iommufd_viommu *viommu;
> + struct iommufd_device *idev;
> + int rc;
> +
> + idev = iommufd_get_device(ucmd, cmd->dev_id);
> + if (IS_ERR(idev))
> + return PTR_ERR(idev);
> + dev_id = idev->obj.id;
> +
> + viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> + if (IS_ERR(viommu)) {
> + rc = PTR_ERR(viommu);
> + goto out_put_idev;
> + }
> +
> + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> + rc = -EINVAL;
> + goto out_put_viommu;
> + }
> +
> + if (!viommu->ops->set_dev_id || !viommu->ops->unset_dev_id) {
> + rc = -EOPNOTSUPP;
> + goto out_put_viommu;
> + }
> +
> + rc = xa_alloc(&idev->viommus, &viommu_id, viommu,
> + XA_LIMIT(viommu->obj.id, viommu->obj.id),
> + GFP_KERNEL_ACCOUNT);
> + if (rc)
> + goto out_put_viommu;
> +
> + rc = xa_alloc(&viommu->idevs, &dev_id, idev,
> + XA_LIMIT(dev_id, dev_id), GFP_KERNEL_ACCOUNT);
> + if (rc)
> + goto out_xa_erase_viommu;

Both of these are API mis-uses, you don't want an allocating xarray
you just want to use xa_cmpxchg

Put an xarray in the viommu object and fill it with pointers to the
struct dev_id thing above

The driver can piggyback on this xarray too if it wants, so we only
need one.

xa_cmpchg to install the user requests vdev_id only if the vdev_id is
already unused.

> @@ -51,6 +51,7 @@ enum {
> IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
> IOMMUFD_CMD_HWPT_INVALIDATE,
> IOMMUFD_CMD_VIOMMU_ALLOC,
> + IOMMUFD_CMD_VIOMMU_SET_DEV_ID,
> };

We almost certainly will need a REMOVE_DEV_ID so userspace can have
sensible error handling.

> +
> +/**
> + * struct iommu_viommu_set_dev_id - ioctl(IOMMU_VIOMMU_SET_DEV_ID)
> + * @size: sizeof(struct iommu_viommu_set_dev_id)
> + * @viommu_id: viommu ID to associate with the device to store its virtual ID
> + * @dev_id: device ID to set a device virtual ID
> + * @__reserved: Must be 0
> + * @id: Device virtual ID
> + *
> + * Set a viommu-specific virtual ID of a device
> + */
> +struct iommu_viommu_set_dev_id {
> + __u32 size;
> + __u32 viommu_id;
> + __u32 dev_id;
> + __u32 __reserved;
> + __aligned_u64 id;

I think I'd consistently call this vdev_id throughout the API

Jason

2024-05-12 15:02:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

On Fri, Apr 12, 2024 at 08:47:09PM -0700, Nicolin Chen wrote:

> +/**
> + * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
> + * @size: sizeof(struct iommu_vqueue_alloc)
> + * @flags: Must be 0
> + * @viommu_id: viommu ID to associate the virtual queue with
> + * @out_vqueue_id: The ID of the new virtual queue
> + * @data_type: One of enum iommu_vqueue_data_type
> + * @data_len: Length of the type specific data
> + * @data_uptr: User pointer to the type specific data
> + *
> + * Allocate an virtual queue object for driver-specific HW-accelerated queue
> + */
> +
> +struct iommu_vqueue_alloc {
> + __u32 size;
> + __u32 flags;
> + __u32 viommu_id;
> + __u32 out_vqueue_id;
> + __u32 data_type;
> + __u32 data_len;
> + __aligned_u64 data_uptr;

Some of the iommus will want an IPA here not a user pointer. I think
it is fine API wise, we'd just add a flag to indicate data_uptr is an
IPA.

Jason

2024-05-12 15:19:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 13/14] iommufd: Add mmap infrastructure

On Fri, Apr 12, 2024 at 08:47:10PM -0700, Nicolin Chen wrote:
> Add for sharing the kernel page with user space. This allows to pass
> through HW resource (VCMDQ MMIO pages for example) to user space VMM
> and guest OS. Use vma->vm_pgoff as the carrier of a viommu_id.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/iommufd/main.c | 40 ++++++++++++++++++++++++++++++++++++
> include/linux/iommufd.h | 4 ++++
> 2 files changed, 44 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 96ef81530809..5b401c80cca8 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -16,6 +16,7 @@
> #include <linux/mutex.h>
> #include <linux/bug.h>
> #include <uapi/linux/iommufd.h>
> +#include <linux/iommu.h>
> #include <linux/iommufd.h>
>
> #include "io_pagetable.h"
> @@ -427,11 +428,50 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
> return ret;
> }
>
> +static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> + struct iommufd_ctx *ictx = filp->private_data;
> + size_t size = vma->vm_end - vma->vm_start;
> + u32 viommu_id = (u32)vma->vm_pgoff;

Don't do mmaps this way, it doesn't scale well for future things.

The pgoff/length should *always* come from the kernel as some
'mmap_offset' output. I usually call this the mmap cookie.

In this case have the mmap cookie for the tegra doorbell return in the
viommu's driver data struct, then userspace just passes the opaque
cookie to mmap to get the correct tegra doorbell.

The core code has some simple xarray/maple tree to allocate cookies
and dispatch them to the correct mmap callback. Usually I'd say to
provide a mmap callback pointer when allocating the cookie.

Also look at the RDMA Code around mmap there is a bunch of VMA
validations needed. Ie we must insist on VM_SHARED and check
permissions, etc.

Jason

2024-05-13 02:30:15

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 02/14] iommufd: Swap _iommufd_object_alloc and __iommufd_object_alloc

On Sun, May 12, 2024 at 10:26:44AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:46:59PM -0700, Nicolin Chen wrote:
> > Currently, the object allocation function calls:
> > level-0: iommufd_object_alloc()
> > level-1: ___iommufd_object_alloc()
> > level-2: _iommufd_object_alloc()
>
> Let's give __iommufd_object_alloc() a better name then
>
> It is a less general version of iommufd_object_alloc(), maybe
> iommufd_object_alloc_elm() ?

With the level-3 allocator, something like the followings?

level-0: iommufd_object_alloc()
level-1: __iommufd_object_alloc()
level-2: iommufd_object_alloc_elm()
level-3: __iommufd_object_alloc_elm()

In this case, this patch will be:
"iommufd: Rename _iommufd_object_alloc to iommufd_object_alloc_elm"

Thanks
Nicolin

2024-05-13 02:35:53

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 03/14] iommufd: Prepare for viommu structures and functions

On Sun, May 12, 2024 at 10:42:12AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:00PM -0700, Nicolin Chen wrote:
>
> > +static inline struct iommufd_object *___iommufd_object_alloc(size_t size)
> > +{
> > + struct iommufd_object *obj;
> > +
> > + obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
> > + if (!obj)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + /* Starts out bias'd by 1 until it is removed from the xarray */
> > + refcount_set(&obj->shortterm_users, 1);
> > + refcount_set(&obj->users, 1);
> > +
> > + /*
> > + * The allocation of an obj->id needs an ictx, so it has to be done
> > + * after this ___iommufd_object_alloc() callback.
> > + */
> > +
> > + return obj;
> > +}
>
> It is probably cleaner to just make the existing allocation work with
> a NULL ictx for this case? Then we can use the existing alloc
> functions.
>
> > +#define viommu_struct_alloc(name) \
> > + struct iommufd_##name *_iommufd_##name##_alloc(size_t size) \
> > + { \
> > + struct iommufd_object *obj; \
> > + if (WARN_ON(size < sizeof(struct iommufd_##name))) \
> > + return NULL; \
>
> Then here you'd just use the exisint container_of based flow with the
> driver sub struct name passed as the 'obj'

Yes. And then we can probably unwrap this macro too since it's
no long that verbose to duplicate in viommu/vqueue allocators.

Thanks
Nicolin

2024-05-13 03:34:34

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On Sun, May 12, 2024 at 11:03:53AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
> > Add a new iommufd_viommu core structure to represent a vIOMMU instance in
> > the user space, typically backed by a HW-accelerated feature of an IOMMU,
> > e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and AMD Hardware
> > Accelerated Virtualized IOMMU (vIOMMU).
>
> I expect this will also be the only way to pass in an associated KVM,
> userspace would supply the kvm when creating the viommu.
>
> The tricky bit of this flow is how to manage the S2. It is necessary
> that the S2 be linked to the viommu:
>
> 1) ARM BTM requires the VMID to be shared with KVM
> 2) AMD and others need the S2 translation because some of the HW
> acceleration is done inside the guest address space
>
> I haven't looked closely at AMD but presumably the VIOMMU create will
> have to install the S2 into a DID or something?
>
> So we need the S2 to exist before the VIOMMU is created, but the
> drivers are going to need some more fixing before that will fully
> work.
>
> Does the nesting domain create need the viommu as well (in place of
> the S2 hwpt)? That feels sort of natural.

Yes, I had a similar thought initially: each viommu is backed by
a nested IOMMU HW, and a special HW accelerator like VCMDQ could
be treated as an extension on top of that. It might not be very
straightforward like the current design having vintf<->viommu and
vcmdq <-> vqueue though...

In that case, we can then support viommu_cache_invalidate, which
is quite natural for SMMUv3. Yet, I recall Kevin said that VT-d
doesn't want or need that.

> There is still a lot of fixing before everything can work fully, but
> do we need to make some preperations here in the uapi? Like starting
> to thread the S2 through it as I described?
>
> Kevin, does Intel forsee any viommu needs on current/future Intel HW?
> I assume you are thinking about invalidation queue bypass like
> everyone else. I think it is an essential feature for vSVA.
>
> > A driver should embed this core structure in its driver viommu structure
> > and call the new iommufd_viommu_alloc() helper to allocate a core/driver
> > structure bundle and fill its core viommu->ops:
> > struct my_driver_viommu {
> > struct iommufd_viommu core;
> > ....
> > };
> >
> > static const struct iommufd_viommu_ops my_driver_viommu_ops = {
> > .free = my_driver_viommu_free,
> > };
> >
> > struct my_driver_viommu *my_viommu =
> > iommufd_viommu_alloc(my_driver_viommu, core);
>
> Why don't we have an ictx here anyhow? The caller has it? Just pass it
> down and then it is normal:
>
> my_viommu = iommufd_object_alloc_elm(ictx, my_viommu, IOMMUFD_OBJ_HWPT_VIOMMU, core.obj);

Oh, in that case, we probably don't need a level-3 obj allocator
that was previously missing ictx to allocate an obj->id.

Thanks
Nicolin

2024-05-13 03:44:58

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 02/14] iommufd: Swap _iommufd_object_alloc and __iommufd_object_alloc

On Sun, May 12, 2024 at 07:29:54PM -0700, Nicolin Chen wrote:
> On Sun, May 12, 2024 at 10:26:44AM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 12, 2024 at 08:46:59PM -0700, Nicolin Chen wrote:
> > > Currently, the object allocation function calls:
> > > level-0: iommufd_object_alloc()
> > > level-1: ___iommufd_object_alloc()
> > > level-2: _iommufd_object_alloc()
> >
> > Let's give __iommufd_object_alloc() a better name then
> >
> > It is a less general version of iommufd_object_alloc(), maybe
> > iommufd_object_alloc_elm() ?
>
> With the level-3 allocator, something like the followings?
>
> level-0: iommufd_object_alloc()
> level-1: __iommufd_object_alloc()
> level-2: iommufd_object_alloc_elm()
> level-3: __iommufd_object_alloc_elm()
>
> In this case, this patch will be:
> "iommufd: Rename _iommufd_object_alloc to iommufd_object_alloc_elm"

After reading your comments in PATCH-4, seems that we don't need
the level-3 allocator, if we pass an ictx pointer throughout the
core and driver. I will try with this first.

Thanks
Nicolin

2024-05-13 04:34:35

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

On Sun, May 12, 2024 at 11:27:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:02PM -0700, Nicolin Chen wrote:
> > + viommu = iommu_dev->ops->viommu_alloc(idev->dev, cmd->type,
> > + hwpt_paging->common.domain);
> > + if (IS_ERR(viommu)) {
> > + rc = PTR_ERR(viommu);
> > + goto out_put_hwpt;
> > + }
>
> Ah you did already include the S2, So should it be
> domain->viommu_alloc() then?

We can do that. In that case, the VIOMMU_ALLOC ioctl should be
simply per S2 HWPT too v.s. per IDEV.

> > +
> > + /* iommufd_object_finalize will store the viommu->obj.id */
> > + rc = xa_alloc(&ucmd->ictx->objects, &viommu->obj.id, XA_ZERO_ENTRY,
> > + xa_limit_31b, GFP_KERNEL_ACCOUNT);
> > + if (rc)
> > + goto out_free;
> > +
> > + viommu->obj.type = IOMMUFD_OBJ_VIOMMU;
>
> See my other notes, lets try not to open code this.

Ack.

> > + viommu->type = cmd->type;
> > +
> > + viommu->ictx = ucmd->ictx;
> > + viommu->hwpt = hwpt_paging;
> > + viommu->iommu_dev = idev->dev->iommu->iommu_dev;
> > + cmd->out_viommu_id = viommu->obj.id;
> > + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> > + if (rc)
> > + goto out_erase_xa;
> > + iommufd_object_finalize(ucmd->ictx, &viommu->obj);
> > + refcount_inc(&viommu->hwpt->common.obj.users);
> > + goto out_put_hwpt;
> > +
> > +out_erase_xa:
> > + xa_erase(&ucmd->ictx->objects, viommu->obj.id);
> > +out_free:
> > + if (viommu->ops && viommu->ops->free)
> > + viommu->ops->free(viommu);
> > + kfree(viommu);
>
> This really should use the abort flow. The driver free callback has to
> be in the object release..

Yea, with the original object allocator, we probably can do abort().

> > +
> > +/**
> > + * enum iommu_viommu_type - VIOMMU Type
> > + * @IOMMU_VIOMMU_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
> > + */
> > +enum iommu_viommu_type {
> > + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> > +};
>
> At least the 241 line should be in a following patch

It's for the "enum iommu_viommu_type" mentioned in the following
structure. Yi told me that you don't like an empty enum, and he
did something like this in HWPT_INVALIDATE series:
https://lore.kernel.org/linux-iommu/[email protected]/

> > +/**
> > + * struct iommu_viommu_alloc - ioctl(IOMMU_VIOMMU_ALLOC)
> > + * @size: sizeof(struct iommu_viommu_alloc)
> > + * @flags: Must be 0
> > + * @type: Type of the VIOMMU object. Must be defined in enum iommu_viommu_type
> > + * @dev_id: The device to allocate this virtual IOMMU for
> > + * @hwpt_id: ID of a nested parent HWPT
> > + * @out_viommu_id: Output virtual IOMMU ID for the allocated object
> > + *
> > + * Allocate an virtual IOMMU object that holds a (shared) nested parent HWPT
> > + */
> > +struct iommu_viommu_alloc {
> > + __u32 size;
> > + __u32 flags;
> > + __u32 type;
> > + __u32 dev_id;
> > + __u32 hwpt_id;
> > + __u32 out_viommu_id;
> > +};
>
> This seems fine.
>
> Let's have a following patch to change the hwpt_alloc to accept the
> viommu as a hwpt as a uAPI change as well.
>
> The more I think about how this needs to work the more sure I am that
> we need to do that.
>
> ARM will need a fairly tricky set of things to manage the VMID
> lifecycle. In BTM mode the VMID must come from the KVM. For vcmdq the
> VMID is needed to create the queue/viommu. For AMD the S2 is needed to
> create the VIOMMU in the first place.
>
> So, to make this all work perfectly we need approx the following
> - S2 sharing across instances in ARM - meaning the VMID is allocated
> at attach not domain alloc
> - S2 hwpt is refcounted by the VIOMMU in the iommufd layer
> - VIOMMU is refcounted by every nesting child in the iommufd layer
> - The nesting child holds a pointer to both the S2 and the VIOMMU
> (viommu optional)
> - When the nesting child attaches to a device the STE will source the
> VMID from the VIOMMU if present otherwise from the S2
> - "RID" attach (ie naked S2) will have to be done with a Nesting
> Child using a vSTE that indicates Identity. Then the attach logic
> will have enough information to get the VMID from the VIOMMU

What is this RID attach (naked S2) case? S1DSS_BYPASS + SVA?

> - In full VIOMMU mode the S2 will never get a VMID of its own, it
> will always use the VIOMMU. Life cycle is simple, the VMID is freed
> when the VIOMMU is freed. That can't happen until all Nesting
> Children are freed. That can't happen until all Nesting Children
> are detached from devices. Detatching removes the HW touch of the VMID.

So, each VM will have one S2 HWPT/domain/iopt, but each VM can
have multiple VIOMMU instances sharing that single S2 HWPT, and
each VIOMMU instance (in the SMMU driver at least) holds a vmid.

This seems to be a quite clear big picture now!

> At this point you don't need the full generality, but let's please get
> ready and get the viommu pointer available in all the right spots and
> we can keep the current logic to borrow the VMID from the S2 for the
> VIOMMU.

Yea. Will try as much as I can.

Thanks
Nicolin

2024-05-13 04:41:50

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

On Sun, May 12, 2024 at 12:02:21PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:09PM -0700, Nicolin Chen wrote:
>
> > +/**
> > + * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
> > + * @size: sizeof(struct iommu_vqueue_alloc)
> > + * @flags: Must be 0
> > + * @viommu_id: viommu ID to associate the virtual queue with
> > + * @out_vqueue_id: The ID of the new virtual queue
> > + * @data_type: One of enum iommu_vqueue_data_type
> > + * @data_len: Length of the type specific data
> > + * @data_uptr: User pointer to the type specific data
> > + *
> > + * Allocate an virtual queue object for driver-specific HW-accelerated queue
> > + */
> > +
> > +struct iommu_vqueue_alloc {
> > + __u32 size;
> > + __u32 flags;
> > + __u32 viommu_id;
> > + __u32 out_vqueue_id;
> > + __u32 data_type;
> > + __u32 data_len;
> > + __aligned_u64 data_uptr;
>
> Some of the iommus will want an IPA here not a user pointer. I think
> it is fine API wise, we'd just add a flag to indicate data_uptr is an
> IPA.

Ack.

Nicolin

2024-05-13 04:44:19

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 13/14] iommufd: Add mmap infrastructure

On Sun, May 12, 2024 at 12:19:12PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:10PM -0700, Nicolin Chen wrote:
> > Add for sharing the kernel page with user space. This allows to pass
> > through HW resource (VCMDQ MMIO pages for example) to user space VMM
> > and guest OS. Use vma->vm_pgoff as the carrier of a viommu_id.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/iommufd/main.c | 40 ++++++++++++++++++++++++++++++++++++
> > include/linux/iommufd.h | 4 ++++
> > 2 files changed, 44 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> > index 96ef81530809..5b401c80cca8 100644
> > --- a/drivers/iommu/iommufd/main.c
> > +++ b/drivers/iommu/iommufd/main.c
> > @@ -16,6 +16,7 @@
> > #include <linux/mutex.h>
> > #include <linux/bug.h>
> > #include <uapi/linux/iommufd.h>
> > +#include <linux/iommu.h>
> > #include <linux/iommufd.h>
> >
> > #include "io_pagetable.h"
> > @@ -427,11 +428,50 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
> > return ret;
> > }
> >
> > +static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
> > +{
> > + struct iommufd_ctx *ictx = filp->private_data;
> > + size_t size = vma->vm_end - vma->vm_start;
> > + u32 viommu_id = (u32)vma->vm_pgoff;
>
> Don't do mmaps this way, it doesn't scale well for future things.
>
> The pgoff/length should *always* come from the kernel as some
> 'mmap_offset' output. I usually call this the mmap cookie.
>
> In this case have the mmap cookie for the tegra doorbell return in the
> viommu's driver data struct, then userspace just passes the opaque
> cookie to mmap to get the correct tegra doorbell.
>
> The core code has some simple xarray/maple tree to allocate cookies
> and dispatch them to the correct mmap callback. Usually I'd say to
> provide a mmap callback pointer when allocating the cookie.
>
> Also look at the RDMA Code around mmap there is a bunch of VMA
> validations needed. Ie we must insist on VM_SHARED and check
> permissions, etc.

Yea, the vm_pgoff as a carrier is a bit of hack, as mentioned
in the cover-letter. Let me revisit the whole thing and study
from RDMA code also.

Thanks
Nicolin

2024-05-13 05:24:38

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Sun, May 12, 2024 at 11:58:27AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote:
> > Introduce a new ioctl to set a per-viommu device virtual id that should be
> > linked to the physical device id (or just a struct device pointer).
> >
> > Since a viommu (user space IOMMU instance) can have multiple devices while
> > it's not ideal to confine a device to one single user space IOMMU instance
> > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their
> > structures to bind them bidirectionally.
>
> Since I would like to retain the dev_id, I think this is probably
> better done with an allocated struct per dev-id:
>
> struct dev_id {
> struct iommufd_device *idev;
> struct iommufd_viommu *viommu;
> u64 vdev_id;
> u64 driver_private; // Ie the driver can store the pSID here
> struct list_head idev_entry;
> };

Oh, I needed a better solution to store the HW slot number of a
VINTF configuration that links a pSID and a vSID, which I hacked
into struct arm_smmu_stream for now. So, with this struct dev_id,
likely I can tie it to this structure.

For a driver, pSID is known with a device pointer, while likely
no use to the iommufd core?

Also, I will address the other comments in your reply.

Thanks
Nicolin

2024-05-13 22:31:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 02/14] iommufd: Swap _iommufd_object_alloc and __iommufd_object_alloc

On Sun, May 12, 2024 at 07:29:37PM -0700, Nicolin Chen wrote:
> On Sun, May 12, 2024 at 10:26:44AM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 12, 2024 at 08:46:59PM -0700, Nicolin Chen wrote:
> > > Currently, the object allocation function calls:
> > > level-0: iommufd_object_alloc()
> > > level-1: ___iommufd_object_alloc()
> > > level-2: _iommufd_object_alloc()
> >
> > Let's give __iommufd_object_alloc() a better name then
> >
> > It is a less general version of iommufd_object_alloc(), maybe
> > iommufd_object_alloc_elm() ?
>
> With the level-3 allocator, something like the followings?
>
> level-0: iommufd_object_alloc()
> level-1: __iommufd_object_alloc()
> level-2: iommufd_object_alloc_elm()
> level-3: __iommufd_object_alloc_elm()
>
> In this case, this patch will be:
> "iommufd: Rename _iommufd_object_alloc to iommufd_object_alloc_elm"

Yes

Jason

2024-05-13 22:36:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

On Sun, May 12, 2024 at 09:41:19PM -0700, Nicolin Chen wrote:
> On Sun, May 12, 2024 at 12:02:21PM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 12, 2024 at 08:47:09PM -0700, Nicolin Chen wrote:
> >
> > > +/**
> > > + * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
> > > + * @size: sizeof(struct iommu_vqueue_alloc)
> > > + * @flags: Must be 0
> > > + * @viommu_id: viommu ID to associate the virtual queue with
> > > + * @out_vqueue_id: The ID of the new virtual queue
> > > + * @data_type: One of enum iommu_vqueue_data_type
> > > + * @data_len: Length of the type specific data
> > > + * @data_uptr: User pointer to the type specific data
> > > + *
> > > + * Allocate an virtual queue object for driver-specific HW-accelerated queue
> > > + */
> > > +
> > > +struct iommu_vqueue_alloc {
> > > + __u32 size;
> > > + __u32 flags;
> > > + __u32 viommu_id;
> > > + __u32 out_vqueue_id;
> > > + __u32 data_type;
> > > + __u32 data_len;
> > > + __aligned_u64 data_uptr;
> >
> > Some of the iommus will want an IPA here not a user pointer. I think
> > it is fine API wise, we'd just add a flag to indicate data_uptr is an
> > IPA.
>
> Ack.

To be clear, add a flag someday, no change needed

Jason

2024-05-14 15:39:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

> > > +
> > > +/**
> > > + * enum iommu_viommu_type - VIOMMU Type
> > > + * @IOMMU_VIOMMU_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
> > > + */
> > > +enum iommu_viommu_type {
> > > + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> > > +};
> >
> > At least the 241 line should be in a following patch
>
> It's for the "enum iommu_viommu_type" mentioned in the following
> structure. Yi told me that you don't like an empty enum, and he
> did something like this in HWPT_INVALIDATE series:
> https://lore.kernel.org/linux-iommu/[email protected]/

I suspect 0 should be reserved as a non-set value for some
basic sanity in all these driver type enums.

Jason

> > So, to make this all work perfectly we need approx the following
> > - S2 sharing across instances in ARM - meaning the VMID is allocated
> > at attach not domain alloc
> > - S2 hwpt is refcounted by the VIOMMU in the iommufd layer
> > - VIOMMU is refcounted by every nesting child in the iommufd layer
> > - The nesting child holds a pointer to both the S2 and the VIOMMU
> > (viommu optional)
> > - When the nesting child attaches to a device the STE will source the
> > VMID from the VIOMMU if present otherwise from the S2
> > - "RID" attach (ie naked S2) will have to be done with a Nesting
> > Child using a vSTE that indicates Identity. Then the attach logic
> > will have enough information to get the VMID from the VIOMMU
>
> What is this RID attach (naked S2) case? S1DSS_BYPASS + SVA?

No, when the guest installs a vSTE that simply says bypass with no CD
table pointer. That should result in a pSTE that is the S2 with on CD
pointer.

I was originally thinking that the VMM would simply directly attach
the S2 HWPT in this caes, but given the above issue with the VMID lifetime
it makes more sense to 'attach' the viommu which holds the correct
VMID.

The issue with direct attach the S2 HWPT is the VMID lifetime, as it
would have to borrow the VMID from the viommu but then the lifetime
becomes more complex as it has to live beyond VIOMMU destruction. Not
unsolvable but it seems easier to just avoid it entirely.

> > - In full VIOMMU mode the S2 will never get a VMID of its own, it
> > will always use the VIOMMU. Life cycle is simple, the VMID is freed
> > when the VIOMMU is freed. That can't happen until all Nesting
> > Children are freed. That can't happen until all Nesting Children
> > are detached from devices. Detatching removes the HW touch of the VMID.
>
> So, each VM will have one S2 HWPT/domain/iopt, but each VM can
> have multiple VIOMMU instances sharing that single S2 HWPT, and
> each VIOMMU instance (in the SMMU driver at least) holds a vmid.

Yes, right. We really want to share the S2 across instances in the end
and I have made the VMID per-instance along with the per-instance
ASID. So the above sounds like it could work

Jason

2024-05-14 15:56:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On Sun, May 12, 2024 at 08:34:02PM -0700, Nicolin Chen wrote:
> On Sun, May 12, 2024 at 11:03:53AM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
> > > Add a new iommufd_viommu core structure to represent a vIOMMU instance in
> > > the user space, typically backed by a HW-accelerated feature of an IOMMU,
> > > e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and AMD Hardware
> > > Accelerated Virtualized IOMMU (vIOMMU).
> >
> > I expect this will also be the only way to pass in an associated KVM,
> > userspace would supply the kvm when creating the viommu.
> >
> > The tricky bit of this flow is how to manage the S2. It is necessary
> > that the S2 be linked to the viommu:
> >
> > 1) ARM BTM requires the VMID to be shared with KVM
> > 2) AMD and others need the S2 translation because some of the HW
> > acceleration is done inside the guest address space
> >
> > I haven't looked closely at AMD but presumably the VIOMMU create will
> > have to install the S2 into a DID or something?
> >
> > So we need the S2 to exist before the VIOMMU is created, but the
> > drivers are going to need some more fixing before that will fully
> > work.
> >
> > Does the nesting domain create need the viommu as well (in place of
> > the S2 hwpt)? That feels sort of natural.
>
> Yes, I had a similar thought initially: each viommu is backed by
> a nested IOMMU HW, and a special HW accelerator like VCMDQ could
> be treated as an extension on top of that. It might not be very
> straightforward like the current design having vintf<->viommu and
> vcmdq <-> vqueue though...

vqueue should be considered a sub object of the viommu and hold a
refcount on the viommu object for its lifetime.

> In that case, we can then support viommu_cache_invalidate, which
> is quite natural for SMMUv3. Yet, I recall Kevin said that VT-d
> doesn't want or need that.

Right, Intel currently doesn't need it, but I feel like everyone will
need this eventually as the fast invalidation path is quite important.

Jason

2024-05-15 01:20:38

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

On Tue, May 14, 2024 at 12:38:57PM -0300, Jason Gunthorpe wrote:
> > > > +
> > > > +/**
> > > > + * enum iommu_viommu_type - VIOMMU Type
> > > > + * @IOMMU_VIOMMU_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension for SMMUv3
> > > > + */
> > > > +enum iommu_viommu_type {
> > > > + IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> > > > +};
> > >
> > > At least the 241 line should be in a following patch
> >
> > It's for the "enum iommu_viommu_type" mentioned in the following
> > structure. Yi told me that you don't like an empty enum, and he
> > did something like this in HWPT_INVALIDATE series:
> > https://lore.kernel.org/linux-iommu/[email protected]/
>
> I suspect 0 should be reserved as a non-set value for some
> basic sanity in all these driver type enums.

We have an IOMMU_HWPT_DATA_NONE for HWPT_ALLOC to compatible
with an S2 hwpt, since it doesn't need a data.

Maybe we can have an IOMMU_VIOMMU_TYPE_DEFAULT to be 0, for
an IOMMU driver (e.g. VT-d) that doesn't need to handle nor
be aware of any viommu object?

So, VMM can have a unified "attach-to-viommu" practice with
different IOMMUs, v.s. some still doing "attach-to-s2"?

> > > So, to make this all work perfectly we need approx the following
> > > - S2 sharing across instances in ARM - meaning the VMID is allocated
> > > at attach not domain alloc
> > > - S2 hwpt is refcounted by the VIOMMU in the iommufd layer
> > > - VIOMMU is refcounted by every nesting child in the iommufd layer
> > > - The nesting child holds a pointer to both the S2 and the VIOMMU
> > > (viommu optional)
> > > - When the nesting child attaches to a device the STE will source the
> > > VMID from the VIOMMU if present otherwise from the S2
> > > - "RID" attach (ie naked S2) will have to be done with a Nesting
> > > Child using a vSTE that indicates Identity. Then the attach logic
> > > will have enough information to get the VMID from the VIOMMU
> >
> > What is this RID attach (naked S2) case? S1DSS_BYPASS + SVA?
>
> No, when the guest installs a vSTE that simply says bypass with no CD
> table pointer. That should result in a pSTE that is the S2 with on CD
> pointer.
>
> I was originally thinking that the VMM would simply directly attach
> the S2 HWPT in this caes, but given the above issue with the VMID lifetime
> it makes more sense to 'attach' the viommu which holds the correct
> VMID.
>
> The issue with direct attach the S2 HWPT is the VMID lifetime, as it
> would have to borrow the VMID from the viommu but then the lifetime
> becomes more complex as it has to live beyond VIOMMU destruction. Not
> unsolvable but it seems easier to just avoid it entirely.

That makes a lot sense. I'd need to go through QEMU code and
see how we will accommodate these two more naturally: likely
the QEMU core should allocate an S2 HWPT for a VM, while the
viommu code should allocate a VIOMMU for each instance.

Thanks
Nicolin

2024-05-21 18:06:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

On Tue, May 14, 2024 at 06:20:06PM -0700, Nicolin Chen wrote:
> > I suspect 0 should be reserved as a non-set value for some
> > basic sanity in all these driver type enums.
>
> We have an IOMMU_HWPT_DATA_NONE for HWPT_ALLOC to compatible
> with an S2 hwpt, since it doesn't need a data.
>
> Maybe we can have an IOMMU_VIOMMU_TYPE_DEFAULT to be 0, for
> an IOMMU driver (e.g. VT-d) that doesn't need to handle nor
> be aware of any viommu object?

Seems like a good practice, and perhaps userspace will find value in a
generic viommu object that is always present.

> That makes a lot sense. I'd need to go through QEMU code and
> see how we will accommodate these two more naturally: likely
> the QEMU core should allocate an S2 HWPT for a VM, while the
> viommu code should allocate a VIOMMU for each instance.

I'd suggest that core qemu should allocate the S2 IOAS and pass that
to the qemu viommu driver

The qemu viommu driver should create the hwpt and then the viommu and
perhaps return the viommu or hwpt back to the core code.

If the vSTE flow above is used for identity then the qemu viommu
driver would also have to go an create vSTEs for identity and attach
them to all devices before the VM starts up. Then when the OS
activates the SMMU it would have to mirror the real vSTE from guest
memory to the kernel.

Not sure there is value in having the core qemu code directly access
the hwpt/viommu?

Jason

2024-05-21 18:30:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, May 16, 2024 at 10:14:38PM -0700, Nicolin Chen wrote:
> I implemented it with a small tweak, to align with viommu_alloc
> and vqueue_alloc:
>
> // core
> struct iommufd_vdev_id {
> struct iommufd_viommu *viommu;
> struct device *dev;
> u64 vdev_id;
> struct list_head idev_item;
> };
>
> // driver
> struct my_driver_vdev_id {
> struct iommufd_vdev_id core;
> unsigned int private_attrs;
> };
>
> static struct iommufd_vdev_id *
> my_driver_set_vdev_id(struct iommufd_viommu *viommu,
> struct device *dev, u64 id)
> {
> struct my_driver_vdev_id *my_vdev_id;
>
> my_vdev_id = kzalloc(sizeof(*my_vdev_id), GFP_KERNEL);
> .... /* set private_attrs */
> return &my_driver_vdev_id->core;
> }
>
> static void my_driver_unset_vdev_id(struct iommufd_vdev_id *vdev_id)
> {
> struct my_driver_vdev_id *my_vdev_id =
> container_of(vdev_id, struct my_driver_vdev_id, core);
> .... /* unset private_attrs */
> }
>
> Please let me know if you like it inverted as you wrote above.

Seems like a reasonable direction

> I also moved xa_cmpxchg to the driver, as I feel that the vdev_id
> here is very driver/iommu specific. There can be some complication
> if iommufd core handles this u64 vdev_id: most likely we will use
> this u64 vdev_id to index the xarray that takes an unsigned-long
> xa_index for a fast vdev_id-to-pdev_id lookup, while only a driver
> knows whether u64 vdev_id is compatible with unsigned long or not.

This seems like the most general part.. The core code should just have
a check like:

if (vdevid >= ULONG_MAX) return -EINVAL;

And if someone wants to make 32 bit kernels support a 64bit vdevid
then they can sort it out someday :) I think this is not a big issue
as all iommus seem to have some kind of radix lookup for vdevid and
want it to be small.

Matthew has been talking about support for 64bit indexes in
maple/xarray or something for a bit so it might sort itself out.

> And, we have a list_head in the structure idev, so a device unbind
> will for-each the list and unset all the vdev_ids in it, meanwhile
> the viommu stays. I wonder if we need to add another list_head in
> the structure viommu, so a viommu tear down will for-each its list
> and unset all the vdev_ids on its side while a device (idev) stays.
> I don't see a use case of that though..any thought?

I think you need to support viommu teardown, at least from a
correctness perspective. The API permits it. But isn't it just
list_del everything in the xarray and that will clean it up enough?

Jason

2024-05-22 00:14:16

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

On Tue, May 21, 2024 at 03:05:55PM -0300, Jason Gunthorpe wrote:
> On Tue, May 14, 2024 at 06:20:06PM -0700, Nicolin Chen wrote:
> > > I suspect 0 should be reserved as a non-set value for some
> > > basic sanity in all these driver type enums.
> >
> > We have an IOMMU_HWPT_DATA_NONE for HWPT_ALLOC to compatible
> > with an S2 hwpt, since it doesn't need a data.
> >
> > Maybe we can have an IOMMU_VIOMMU_TYPE_DEFAULT to be 0, for
> > an IOMMU driver (e.g. VT-d) that doesn't need to handle nor
> > be aware of any viommu object?
>
> Seems like a good practice, and perhaps userspace will find value in a
> generic viommu object that is always present.

Yea. VMM is always allowed to create a viommu to wrap an S2
HWPT. Then, I assume iommufd in this case should allocate a
viommu object if !domain_ops->viommu_alloc.

> > That makes a lot sense. I'd need to go through QEMU code and
> > see how we will accommodate these two more naturally: likely
> > the QEMU core should allocate an S2 HWPT for a VM, while the
> > viommu code should allocate a VIOMMU for each instance.
>
> I'd suggest that core qemu should allocate the S2 IOAS and pass that
> to the qemu viommu driver
>
> The qemu viommu driver should create the hwpt and then the viommu and
> perhaps return the viommu or hwpt back to the core code.
>
> If the vSTE flow above is used for identity then the qemu viommu
> driver would also have to go an create vSTEs for identity and attach
> them to all devices before the VM starts up. Then when the OS
> activates the SMMU it would have to mirror the real vSTE from guest
> memory to the kernel.

The entire flow makes sense to me.

> Not sure there is value in having the core qemu code directly access
> the hwpt/viommu?

I think so, though here might be some complication here.

On one side, it may not be straightforward for a qemu viommu
driver to hold a shared S2 hwpt, as the driver is typically
per instance, though I think it can keep viommu to its own.
So passing the S2 hwpt back to qemu core and tie to iommufd
handler (ictx) makes sense.

On the other side, there can be some future HW potentially
supporting two+ kinds of IO page tables so a VM may have two+
S2 hwpts? Then the core would hold a list of S2 hwpts and the
viommu driver would need to try-n-allocate viommu against the
list..

Thanks
Nicolin

2024-05-22 08:40:19

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

> From: Nicolin Chen <[email protected]>
> Sent: Saturday, April 13, 2024 11:47 AM
>
> This is an experimental RFC series for VIOMMU infrastructure, using NVIDIA
> Tegra241 (Grace) CMDQV as a test instance.
>
> VIOMMU obj is used to represent a virtual interface (iommu) backed by an
> underlying IOMMU's HW-accelerated feature for virtualizaion: for example,
> NVIDIA's VINTF (v-interface for CMDQV) and AMD"s vIOMMU.
>
> VQUEUE obj is used to represent a virtual command queue (buffer) backed
> by
> an underlying IOMMU command queue to passthrough for VMs to use
> directly:
> for example, NVIDIA's Virtual Command Queue and AMD's Command Buffer.
>

is VCMDQ more accurate? AMD also supports fault queue passthrough
then VQUEUE sounds broader than a cmd queue...

2024-05-22 08:59:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, May 14, 2024 11:56 PM
>
> On Sun, May 12, 2024 at 08:34:02PM -0700, Nicolin Chen wrote:
> > On Sun, May 12, 2024 at 11:03:53AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
> > > > Add a new iommufd_viommu core structure to represent a vIOMMU
> instance in
> > > > the user space, typically backed by a HW-accelerated feature of an
> IOMMU,
> > > > e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and
> AMD Hardware
> > > > Accelerated Virtualized IOMMU (vIOMMU).
> > >
> > > I expect this will also be the only way to pass in an associated KVM,
> > > userspace would supply the kvm when creating the viommu.
> > >
> > > The tricky bit of this flow is how to manage the S2. It is necessary
> > > that the S2 be linked to the viommu:
> > >
> > > 1) ARM BTM requires the VMID to be shared with KVM
> > > 2) AMD and others need the S2 translation because some of the HW
> > > acceleration is done inside the guest address space
> > >
> > > I haven't looked closely at AMD but presumably the VIOMMU create will
> > > have to install the S2 into a DID or something?
> > >
> > > So we need the S2 to exist before the VIOMMU is created, but the
> > > drivers are going to need some more fixing before that will fully
> > > work.

Can you elaborate on this point? VIOMMU is a dummy container when
it's created and the association to S2 comes relevant only until when
VQUEUE is created inside and linked to a device? then there should be
a window in between allowing the userspace to configure S2.

Not saying against setting S2 up before vIOMMU creation. Just want
to better understand the rationale here.

> > >
> > > Does the nesting domain create need the viommu as well (in place of
> > > the S2 hwpt)? That feels sort of natural.
> >
> > Yes, I had a similar thought initially: each viommu is backed by
> > a nested IOMMU HW, and a special HW accelerator like VCMDQ could
> > be treated as an extension on top of that. It might not be very
> > straightforward like the current design having vintf<->viommu and
> > vcmdq <-> vqueue though...
>
> vqueue should be considered a sub object of the viommu and hold a
> refcount on the viommu object for its lifetime.
>
> > In that case, we can then support viommu_cache_invalidate, which
> > is quite natural for SMMUv3. Yet, I recall Kevin said that VT-d
> > doesn't want or need that.
>
> Right, Intel currently doesn't need it, but I feel like everyone will
> need this eventually as the fast invalidation path is quite important.
>

yes, there is no need but I don't see any harm of preparing for such
extension on VT-d. Logically it's clearer, e.g. if we decide to move
device TLB invalidation to a separate uAPI then vIOMMU is certainly
a clearer object to carry it. and hardware extensions really looks like
optimization on software implementations.

and we do need make a decision now, given if we make vIOMMU as
a generic object for all vendors it may have potential impact on
the user page fault support which Baolu is working on. the so-called
fault object will be contained in vIOMMU, which is software managed
on VT-d/SMMU but passed through on AMD. And probably we don't
need another handle mechanism in the attach path, suppose the
vIOMMU object already contains necessary information to find out
iommufd_object for a reported fault.

Baolu, your thoughts?

2024-05-22 13:41:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On Wed, May 22, 2024 at 08:58:34AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, May 14, 2024 11:56 PM
> >
> > On Sun, May 12, 2024 at 08:34:02PM -0700, Nicolin Chen wrote:
> > > On Sun, May 12, 2024 at 11:03:53AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
> > > > > Add a new iommufd_viommu core structure to represent a vIOMMU
> > instance in
> > > > > the user space, typically backed by a HW-accelerated feature of an
> > IOMMU,
> > > > > e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and
> > AMD Hardware
> > > > > Accelerated Virtualized IOMMU (vIOMMU).
> > > >
> > > > I expect this will also be the only way to pass in an associated KVM,
> > > > userspace would supply the kvm when creating the viommu.
> > > >
> > > > The tricky bit of this flow is how to manage the S2. It is necessary
> > > > that the S2 be linked to the viommu:
> > > >
> > > > 1) ARM BTM requires the VMID to be shared with KVM
> > > > 2) AMD and others need the S2 translation because some of the HW
> > > > acceleration is done inside the guest address space
> > > >
> > > > I haven't looked closely at AMD but presumably the VIOMMU create will
> > > > have to install the S2 into a DID or something?
> > > >
> > > > So we need the S2 to exist before the VIOMMU is created, but the
> > > > drivers are going to need some more fixing before that will fully
> > > > work.
>
> Can you elaborate on this point? VIOMMU is a dummy container when
> it's created and the association to S2 comes relevant only until when
> VQUEUE is created inside and linked to a device?

VIOMMU contains:
- A nesting parent
- A KVM
- Any global per-VM data the driver needs
* In ARM case this is VMID, sometimes shared with KVM
* In AMD case this is will allocate memory in the
"viommu backing storage memory"

Objects can be created on top of a VIOMMU:
- A nested HWPT (iommu_hwpt_alloc::pt_id can be a viommu)
- A vqueue (ARM/AMD)
- Other AMD virtualized objects (EventLog, PPRLog)

It is desirable to keep the VIOMMU linked to only a single nesting
parent that never changes. Given it seems to be a small ask to
allocate the nesting parent before the VIOMMU providing it at VIOMMU
creation time looks like it will simplify the drivers because they can
rely on it always existing and never changing.

I think this lends itself to a logical layered VMM design..

- If VFIO is being used get an iommufd
- Allocate an IOAS for the entire guest physical
- Determine the vIOMMU driver to use
- Allocate a HWPT for use by all the vIOMMU instances
- Allocate a VIOMMU per vIOMMU instance

On ARM the S2 is not divorced from the VIOMMU, ARM requires a single
VMID, shared with KVM, and localized to a single VM for some of the
bypass features (vBTM, vCMDQ). So to attach a S2 you actually have to
attach the VIOMMU to pick up the correct VMID.

I imagine something like this:
hwpt_alloc(deva, nesting_parent=true) = shared_s2
viommu_alloc(deva, shared_s2) = viommu1
viommu_alloc(devb, shared_s2) = viommu2
hwpt_alloc(deva, viommu1, vste) = deva_vste
hwpt_alloc(devb, viommu2, vste) = devb_vste
attach(deva, deva_vste)
attach(devb, devb_vste)
attach(devc, shared_s2)

The driver will then know it should program three different VMIDs for
the same S2 page table, which matches the ARM expectation for
VMID. That is to say we'd pass in the viommu as the pt_id for the
iommu_hwpt_alloc. The viommu would imply both the S2 page table and
any meta information like VMID the driver needs.

Both AMD and the vCMDQ thing need to translate some PFNs through the
S2 and program them elsewhere, this is manually done by SW, and there
are three choices I guess:
- Have the VMM do it and provide void __user * to the driver
- Have the driver do it through the S2 directly and track
S2 invalidations
- Have the driver open an access on the IOAS and use the access unmap

Not sure which is the best..

> > Right, Intel currently doesn't need it, but I feel like everyone will
> > need this eventually as the fast invalidation path is quite important.
>
> yes, there is no need but I don't see any harm of preparing for such
> extension on VT-d. Logically it's clearer, e.g. if we decide to move
> device TLB invalidation to a separate uAPI then vIOMMU is certainly
> a clearer object to carry it. and hardware extensions really looks like
> optimization on software implementations.
>
> and we do need make a decision now, given if we make vIOMMU as
> a generic object for all vendors it may have potential impact on
> the user page fault support which Baolu is working on.

> the so-called
> fault object will be contained in vIOMMU, which is software managed
> on VT-d/SMMU but passed through on AMD.

Hmm, given we currently have no known hardware entanglement between
PRI and VIOMMU it does seem OK for PRI to just exist seperate for
now. If someone needs them linked someday we can add a viommu_id to
the create pri queue command.

> And probably we don't need another handle mechanism in the attach
> path, suppose the vIOMMU object already contains necessary
> information to find out iommufd_object for a reported fault.

The viommu might be useful to have the kernel return the vRID instead
of the dev_id in the fault messages. I'm not sure how valuable this
is..

Jason

2024-05-22 14:03:33

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On 2024/5/22 16:58, Tian, Kevin wrote:
>> From: Jason Gunthorpe<[email protected]>
>> Sent: Tuesday, May 14, 2024 11:56 PM
>>
>> On Sun, May 12, 2024 at 08:34:02PM -0700, Nicolin Chen wrote:
>>> On Sun, May 12, 2024 at 11:03:53AM -0300, Jason Gunthorpe wrote:
>>>> On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
>>>>> Add a new iommufd_viommu core structure to represent a vIOMMU
>> instance in
>>>>> the user space, typically backed by a HW-accelerated feature of an
>> IOMMU,
>>>>> e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and
>> AMD Hardware
>>>>> Accelerated Virtualized IOMMU (vIOMMU).
>>>> I expect this will also be the only way to pass in an associated KVM,
>>>> userspace would supply the kvm when creating the viommu.
>>>>
>>>> The tricky bit of this flow is how to manage the S2. It is necessary
>>>> that the S2 be linked to the viommu:
>>>>
>>>> 1) ARM BTM requires the VMID to be shared with KVM
>>>> 2) AMD and others need the S2 translation because some of the HW
>>>> acceleration is done inside the guest address space
>>>>
>>>> I haven't looked closely at AMD but presumably the VIOMMU create will
>>>> have to install the S2 into a DID or something?
>>>>
>>>> So we need the S2 to exist before the VIOMMU is created, but the
>>>> drivers are going to need some more fixing before that will fully
>>>> work.
> Can you elaborate on this point? VIOMMU is a dummy container when
> it's created and the association to S2 comes relevant only until when
> VQUEUE is created inside and linked to a device? then there should be
> a window in between allowing the userspace to configure S2.
>
> Not saying against setting S2 up before vIOMMU creation. Just want
> to better understand the rationale here.
>
>>>> Does the nesting domain create need the viommu as well (in place of
>>>> the S2 hwpt)? That feels sort of natural.
>>> Yes, I had a similar thought initially: each viommu is backed by
>>> a nested IOMMU HW, and a special HW accelerator like VCMDQ could
>>> be treated as an extension on top of that. It might not be very
>>> straightforward like the current design having vintf<->viommu and
>>> vcmdq <-> vqueue though...
>> vqueue should be considered a sub object of the viommu and hold a
>> refcount on the viommu object for its lifetime.
>>
>>> In that case, we can then support viommu_cache_invalidate, which
>>> is quite natural for SMMUv3. Yet, I recall Kevin said that VT-d
>>> doesn't want or need that.
>> Right, Intel currently doesn't need it, but I feel like everyone will
>> need this eventually as the fast invalidation path is quite important.
>>
> yes, there is no need but I don't see any harm of preparing for such
> extension on VT-d. Logically it's clearer, e.g. if we decide to move
> device TLB invalidation to a separate uAPI then vIOMMU is certainly
> a clearer object to carry it. and hardware extensions really looks like
> optimization on software implementations.
>
> and we do need make a decision now, given if we make vIOMMU as
> a generic object for all vendors it may have potential impact on
> the user page fault support which Baolu is working on. the so-called
> fault object will be contained in vIOMMU, which is software managed
> on VT-d/SMMU but passed through on AMD. And probably we don't
> need another handle mechanism in the attach path, suppose the
> vIOMMU object already contains necessary information to find out
> iommufd_object for a reported fault.

Yes, if the vIOMMU object tracks all iommufd devices that it manages.

Best regards,
baolu

2024-05-22 16:46:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC

On Tue, May 21, 2024 at 05:13:50PM -0700, Nicolin Chen wrote:
> Yea. VMM is always allowed to create a viommu to wrap an S2
> HWPT. Then, I assume iommufd in this case should allocate a
> viommu object if !domain_ops->viommu_alloc.

Yeah

> On one side, it may not be straightforward for a qemu viommu
> driver to hold a shared S2 hwpt, as the driver is typically
> per instance, though I think it can keep viommu to its own.
> So passing the S2 hwpt back to qemu core and tie to iommufd
> handler (ictx) makes sense.

Yes, qemu will need some per-driver-type but not per-instance storage
to make this work. Ie the ARM per-driver-type shared storage would
hold the ARM specific list of S2 hwpts.

> On the other side, there can be some future HW potentially
> supporting two+ kinds of IO page tables so a VM may have two+
> S2 hwpts? Then the core would hold a list of S2 hwpts and the
> viommu driver would need to try-n-allocate viommu against the
> list..

Yes, it is supported in the API. Userspace should try to create
viommus with all the S2 hwpts available and build a new one if it
can't, just like hwpt attachment to a device.

Jason

2024-05-22 16:48:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

On Wed, May 22, 2024 at 08:40:00AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Saturday, April 13, 2024 11:47 AM
> >
> > This is an experimental RFC series for VIOMMU infrastructure, using NVIDIA
> > Tegra241 (Grace) CMDQV as a test instance.
> >
> > VIOMMU obj is used to represent a virtual interface (iommu) backed by an
> > underlying IOMMU's HW-accelerated feature for virtualizaion: for example,
> > NVIDIA's VINTF (v-interface for CMDQV) and AMD"s vIOMMU.
> >
> > VQUEUE obj is used to represent a virtual command queue (buffer) backed
> > by
> > an underlying IOMMU command queue to passthrough for VMs to use
> > directly:
> > for example, NVIDIA's Virtual Command Queue and AMD's Command Buffer.
> >
>
> is VCMDQ more accurate? AMD also supports fault queue passthrough
> then VQUEUE sounds broader than a cmd queue...

Is there a reason VQUEUE couldn't handle the fault/etc queues too? The
only difference is direction, there is still a doorbell/etc.

Jason

2024-05-22 19:48:00

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

On Wed, May 22, 2024 at 01:48:18PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2024 at 08:40:00AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Saturday, April 13, 2024 11:47 AM
> > >
> > > This is an experimental RFC series for VIOMMU infrastructure, using NVIDIA
> > > Tegra241 (Grace) CMDQV as a test instance.
> > >
> > > VIOMMU obj is used to represent a virtual interface (iommu) backed by an
> > > underlying IOMMU's HW-accelerated feature for virtualizaion: for example,
> > > NVIDIA's VINTF (v-interface for CMDQV) and AMD"s vIOMMU.
> > >
> > > VQUEUE obj is used to represent a virtual command queue (buffer) backed
> > > by
> > > an underlying IOMMU command queue to passthrough for VMs to use
> > > directly:
> > > for example, NVIDIA's Virtual Command Queue and AMD's Command Buffer.
> > >
> >
> > is VCMDQ more accurate? AMD also supports fault queue passthrough
> > then VQUEUE sounds broader than a cmd queue...
>
> Is there a reason VQUEUE couldn't handle the fault/etc queues too? The
> only difference is direction, there is still a doorbell/etc.

Yea, SMMU also has Event Queue and PRI queue. Though I haven't
got time to sit down to look at Baolu's work closely, the uAPI
seems to be a unified one for all IOMMUs. And though I have no
intention to be against that design, yet maybe there could be
an alternative in a somewhat HW specific language as we do for
invalidation? Or not worth it?

Thanks
Nicolin

2024-05-22 23:29:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

On Wed, May 22, 2024 at 12:47:19PM -0700, Nicolin Chen wrote:
> On Wed, May 22, 2024 at 01:48:18PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 22, 2024 at 08:40:00AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <[email protected]>
> > > > Sent: Saturday, April 13, 2024 11:47 AM
> > > >
> > > > This is an experimental RFC series for VIOMMU infrastructure, using NVIDIA
> > > > Tegra241 (Grace) CMDQV as a test instance.
> > > >
> > > > VIOMMU obj is used to represent a virtual interface (iommu) backed by an
> > > > underlying IOMMU's HW-accelerated feature for virtualizaion: for example,
> > > > NVIDIA's VINTF (v-interface for CMDQV) and AMD"s vIOMMU.
> > > >
> > > > VQUEUE obj is used to represent a virtual command queue (buffer) backed
> > > > by
> > > > an underlying IOMMU command queue to passthrough for VMs to use
> > > > directly:
> > > > for example, NVIDIA's Virtual Command Queue and AMD's Command Buffer.
> > > >
> > >
> > > is VCMDQ more accurate? AMD also supports fault queue passthrough
> > > then VQUEUE sounds broader than a cmd queue...
> >
> > Is there a reason VQUEUE couldn't handle the fault/etc queues too? The
> > only difference is direction, there is still a doorbell/etc.
>
> Yea, SMMU also has Event Queue and PRI queue. Though I haven't
> got time to sit down to look at Baolu's work closely, the uAPI
> seems to be a unified one for all IOMMUs. And though I have no
> intention to be against that design, yet maybe there could be
> an alternative in a somewhat HW specific language as we do for
> invalidation? Or not worth it?

I was thinking not worth it, I expect a gain here is to do as AMD has
done and make the HW dma the queues directly to guest memory.

IMHO the primary issue with the queues is DOS, as having any shared
queue across VMs is dangerous in that way. Allowing each VIOMMU to
have its own private queue and own flow control helps with that.

Jason

2024-05-22 23:44:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, May 23, 2024 7:29 AM
>
> On Wed, May 22, 2024 at 12:47:19PM -0700, Nicolin Chen wrote:
> > On Wed, May 22, 2024 at 01:48:18PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2024 at 08:40:00AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <[email protected]>
> > > > > Sent: Saturday, April 13, 2024 11:47 AM
> > > > >
> > > > > This is an experimental RFC series for VIOMMU infrastructure, using
> NVIDIA
> > > > > Tegra241 (Grace) CMDQV as a test instance.
> > > > >
> > > > > VIOMMU obj is used to represent a virtual interface (iommu) backed
> by an
> > > > > underlying IOMMU's HW-accelerated feature for virtualizaion: for
> example,
> > > > > NVIDIA's VINTF (v-interface for CMDQV) and AMD"s vIOMMU.
> > > > >
> > > > > VQUEUE obj is used to represent a virtual command queue (buffer)
> backed
> > > > > by
> > > > > an underlying IOMMU command queue to passthrough for VMs to
> use
> > > > > directly:
> > > > > for example, NVIDIA's Virtual Command Queue and AMD's Command
> Buffer.
> > > > >
> > > >
> > > > is VCMDQ more accurate? AMD also supports fault queue passthrough
> > > > then VQUEUE sounds broader than a cmd queue...
> > >
> > > Is there a reason VQUEUE couldn't handle the fault/etc queues too? The
> > > only difference is direction, there is still a doorbell/etc.

No reason. the description made it specific to a cmd queue which
led me the impression that we may want to create a separate
fault queue.

> >
> > Yea, SMMU also has Event Queue and PRI queue. Though I haven't
> > got time to sit down to look at Baolu's work closely, the uAPI
> > seems to be a unified one for all IOMMUs. And though I have no
> > intention to be against that design, yet maybe there could be
> > an alternative in a somewhat HW specific language as we do for
> > invalidation? Or not worth it?
>
> I was thinking not worth it, I expect a gain here is to do as AMD has
> done and make the HW dma the queues directly to guest memory.
>
> IMHO the primary issue with the queues is DOS, as having any shared
> queue across VMs is dangerous in that way. Allowing each VIOMMU to
> have its own private queue and own flow control helps with that.
>

and also shorter delivering path with less data copy?

2024-05-23 01:44:02

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, May 22, 2024 9:39 PM
>
> On Wed, May 22, 2024 at 08:58:34AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Tuesday, May 14, 2024 11:56 PM
> > >
> > > > > So we need the S2 to exist before the VIOMMU is created, but the
> > > > > drivers are going to need some more fixing before that will fully
> > > > > work.
> >
> > Can you elaborate on this point? VIOMMU is a dummy container when
> > it's created and the association to S2 comes relevant only until when
> > VQUEUE is created inside and linked to a device?
>
> VIOMMU contains:
> - A nesting parent
> - A KVM
> - Any global per-VM data the driver needs
> * In ARM case this is VMID, sometimes shared with KVM

In which case is it not shared with KVM? I had the impression that
VMID always comes from KVM in this VCMDQ usage. ????

> * In AMD case this is will allocate memory in the
> "viommu backing storage memory"
>
> Objects can be created on top of a VIOMMU:
> - A nested HWPT (iommu_hwpt_alloc::pt_id can be a viommu)
> - A vqueue (ARM/AMD)
> - Other AMD virtualized objects (EventLog, PPRLog)
>
> It is desirable to keep the VIOMMU linked to only a single nesting
> parent that never changes. Given it seems to be a small ask to
> allocate the nesting parent before the VIOMMU providing it at VIOMMU
> creation time looks like it will simplify the drivers because they can
> rely on it always existing and never changing.

yes, this makes sense.

>
> I think this lends itself to a logical layered VMM design..
>
> - If VFIO is being used get an iommufd
> - Allocate an IOAS for the entire guest physical
> - Determine the vIOMMU driver to use
> - Allocate a HWPT for use by all the vIOMMU instances
> - Allocate a VIOMMU per vIOMMU instance
>
> On ARM the S2 is not divorced from the VIOMMU, ARM requires a single
> VMID, shared with KVM, and localized to a single VM for some of the
> bypass features (vBTM, vCMDQ). So to attach a S2 you actually have to
> attach the VIOMMU to pick up the correct VMID.
>
> I imagine something like this:
> hwpt_alloc(deva, nesting_parent=true) = shared_s2
> viommu_alloc(deva, shared_s2) = viommu1
> viommu_alloc(devb, shared_s2) = viommu2
> hwpt_alloc(deva, viommu1, vste) = deva_vste
> hwpt_alloc(devb, viommu2, vste) = devb_vste
> attach(deva, deva_vste)
> attach(devb, devb_vste)
> attach(devc, shared_s2)

I wonder whether we want to make viommu as the 1st-class citizen
for any nested hwpt if it is desirable to enable it even for VT-d which
lacks of a hw viommu concept at the moment.

>
> The driver will then know it should program three different VMIDs for
> the same S2 page table, which matches the ARM expectation for
> VMID. That is to say we'd pass in the viommu as the pt_id for the
> iommu_hwpt_alloc. The viommu would imply both the S2 page table and
> any meta information like VMID the driver needs.

Can you elaborate the aspect about "three different VMIDs"? They are
all for the same VM hence sharing the same VMID per the earlier
description. This is also echo-ed in patch14:

tegra241_cmdqv_viommu_alloc()
vintf->vmid = smmu_domain->vmid;

>
> Both AMD and the vCMDQ thing need to translate some PFNs through the
> S2 and program them elsewhere, this is manually done by SW, and there
> are three choices I guess:
> - Have the VMM do it and provide void __user * to the driver

this sounds redundant to what S2 already provides

> - Have the driver do it through the S2 directly and track
> S2 invalidations

this makes more sense to me. Just like the driver already needs to track
S2 invalidations to flush any nested cache related to the affected S2 range.

> - Have the driver open an access on the IOAS and use the access unmap
>

it requires adding more iommufd awareness into the iommu driver. I'm
inclined to do it only at minimal necessity.

> Not sure which is the best..

>
> > > Right, Intel currently doesn't need it, but I feel like everyone will
> > > need this eventually as the fast invalidation path is quite important.
> >
> > yes, there is no need but I don't see any harm of preparing for such
> > extension on VT-d. Logically it's clearer, e.g. if we decide to move
> > device TLB invalidation to a separate uAPI then vIOMMU is certainly
> > a clearer object to carry it. and hardware extensions really looks like
> > optimization on software implementations.
> >
> > and we do need make a decision now, given if we make vIOMMU as
> > a generic object for all vendors it may have potential impact on
> > the user page fault support which Baolu is working on.
>
> > the so-called
> > fault object will be contained in vIOMMU, which is software managed
> > on VT-d/SMMU but passed through on AMD.
>
> Hmm, given we currently have no known hardware entanglement between
> PRI and VIOMMU it does seem OK for PRI to just exist seperate for

Isn't AMD vPPRLog for directly sending PRI request into the guest?

> now. If someone needs them linked someday we can add a viommu_id to
> the create pri queue command.

I'm more worried about the potential conflict between the vqueue
object here and the fault queue object in Baolu's series, if we want
to introduce vIOMMU concept to platforms which lack of the hw
support.

In that case both objects are software constructs to deliver faults.

2024-05-23 03:09:43

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

On Wed, May 22, 2024 at 11:43:51PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, May 23, 2024 7:29 AM
> > On Wed, May 22, 2024 at 12:47:19PM -0700, Nicolin Chen wrote:
> > > Yea, SMMU also has Event Queue and PRI queue. Though I haven't
> > > got time to sit down to look at Baolu's work closely, the uAPI
> > > seems to be a unified one for all IOMMUs. And though I have no
> > > intention to be against that design, yet maybe there could be
> > > an alternative in a somewhat HW specific language as we do for
> > > invalidation? Or not worth it?
> >
> > I was thinking not worth it, I expect a gain here is to do as AMD has
> > done and make the HW dma the queues directly to guest memory.
> >
> > IMHO the primary issue with the queues is DOS, as having any shared
> > queue across VMs is dangerous in that way. Allowing each VIOMMU to
> > have its own private queue and own flow control helps with that.
> >
>
> and also shorter delivering path with less data copy?

Should I interpret that as a yes for fault report via VQUEUE?

We only have AMD that can HW dma the events to the guest queue
memory. Others all need a backward translation of (at least) a
physical dev ID to a virtual dev ID. This is now doable in the
kernel by the ongoing vdev_id design by the way. So kernel then
can write the guest memory directly to report events?

Thanks
Nicolin

2024-05-23 04:02:21

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On Thu, May 23, 2024 at 01:43:45AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, May 22, 2024 9:39 PM
> > VIOMMU contains:
> > - A nesting parent
> > - A KVM
> > - Any global per-VM data the driver needs
> > * In ARM case this is VMID, sometimes shared with KVM
>
> In which case is it not shared with KVM? I had the impression that
> VMID always comes from KVM in this VCMDQ usage. ????

Not actually. I guess that shared VMID is for BTM.

> > On ARM the S2 is not divorced from the VIOMMU, ARM requires a single
> > VMID, shared with KVM, and localized to a single VM for some of the
> > bypass features (vBTM, vCMDQ). So to attach a S2 you actually have to
> > attach the VIOMMU to pick up the correct VMID.
> >
> > I imagine something like this:
> > hwpt_alloc(deva, nesting_parent=true) = shared_s2
> > viommu_alloc(deva, shared_s2) = viommu1
> > viommu_alloc(devb, shared_s2) = viommu2
> > hwpt_alloc(deva, viommu1, vste) = deva_vste
> > hwpt_alloc(devb, viommu2, vste) = devb_vste
> > attach(deva, deva_vste)
> > attach(devb, devb_vste)
> > attach(devc, shared_s2)
>
> I wonder whether we want to make viommu as the 1st-class citizen
> for any nested hwpt if it is desirable to enable it even for VT-d which
> lacks of a hw viommu concept at the moment.

I think Jason is completely using SMMU as an example here.

Also FWIW, I am trying a core-allocated core-managed viommu for
IOMMU_VIOMMU_TYPE_DEFAULT. So VT-d driver doesn't need to hold
a viommu while VMM could still allocate one if it wants. And the
VIOMMU interface can provide some helpers if driver wants some
info from the core-managed viommu: a virtual dev ID to physical
dev ID (returning device pointer) translation for example. And
we can add more after we brain storm.

Sample change:
@@ -623,6 +625,18 @@ struct iommu_ops {
+ * @viommu_alloc: Allocate an iommufd_viommu associating to a nested parent
+ * @domain as a user space IOMMU instance for HW-accelerated
+ * features from the physical IOMMU behind the @dev. The
+ * @viommu_type must be defined in include/uapi/linux/iommufd.h
+ * It is suggested to call iommufd_viommu_alloc() helper for
+ * a bundled allocation of the core and the driver structures,
+ * using the given @ictx pointer.
+ * @default_viommu_ops: Driver can choose to use a default core-allocated core-
+ * managed viommu object by providing a default viommu ops.
+ * Otherwise, i.e. for a driver-managed viommu, viommu_ops
+ * should be passed in via iommufd_viommu_alloc() helper in
+ * its own viommu_alloc op.

[..]

+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
...
+ if (cmd->type == IOMMU_VIOMMU_TYPE_DEFAULT) {
+ viommu = __iommufd_viommu_alloc(
+ ucmd->ictx, sizeof(*viommu),
+ domain->ops->default_viommu_ops);
+ } else {
+ if (!domain->ops->viommu_alloc) {
+ rc = -EOPNOTSUPP;
+ goto out_put_hwpt;
+ }
+
+ viommu = domain->ops->viommu_alloc(domain, idev->dev,
+ ucmd->ictx, cmd->type);
+ }

[..]
// Helper:
+struct device *
+iommufd_viommu_find_device(struct iommufd_viommu *viommu, u64 id);

> > The driver will then know it should program three different VMIDs for
> > the same S2 page table, which matches the ARM expectation for
> > VMID. That is to say we'd pass in the viommu as the pt_id for the
> > iommu_hwpt_alloc. The viommu would imply both the S2 page table and
> > any meta information like VMID the driver needs.
>
> Can you elaborate the aspect about "three different VMIDs"? They are
> all for the same VM hence sharing the same VMID per the earlier
> description. This is also echo-ed in patch14:
>
> tegra241_cmdqv_viommu_alloc()
> vintf->vmid = smmu_domain->vmid;

The design in the series is still old using a 1:1 relationship
between a viommu and an S2 domain. I think the "three" is from
his SMMU example above? Leaving it to Jason to reply though.

> > now. If someone needs them linked someday we can add a viommu_id to
> > the create pri queue command.
>
> I'm more worried about the potential conflict between the vqueue
> object here and the fault queue object in Baolu's series, if we want
> to introduce vIOMMU concept to platforms which lack of the hw
> support.

I actually see one argument is whether we should use a vqueue
v.s. Baolu's fault queue object, and a counter argument whether
we should also use vqueue for viommu invalidation v.s an array-
based invalidation request that we have similarly for HWPT...

Thanks
Nicolin

2024-05-23 05:40:43

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

> From: Nicolin Chen <[email protected]>
> Sent: Thursday, May 23, 2024 12:02 PM
>
> On Thu, May 23, 2024 at 01:43:45AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, May 22, 2024 9:39 PM
> > > VIOMMU contains:
> > > - A nesting parent
> > > - A KVM
> > > - Any global per-VM data the driver needs
> > > * In ARM case this is VMID, sometimes shared with KVM
> >
> > In which case is it not shared with KVM? I had the impression that
> > VMID always comes from KVM in this VCMDQ usage. ????
>
> Not actually. I guess that shared VMID is for BTM.

yes. now that is clear to me after reading more context.

>
> > > On ARM the S2 is not divorced from the VIOMMU, ARM requires a single
> > > VMID, shared with KVM, and localized to a single VM for some of the
> > > bypass features (vBTM, vCMDQ). So to attach a S2 you actually have

but here the description should be only for vBTM too. vCMDQ has
its own VMID policy.

> > > attach the VIOMMU to pick up the correct VMID.
> > >
> > > I imagine something like this:
> > > hwpt_alloc(deva, nesting_parent=true) = shared_s2
> > > viommu_alloc(deva, shared_s2) = viommu1
> > > viommu_alloc(devb, shared_s2) = viommu2
> > > hwpt_alloc(deva, viommu1, vste) = deva_vste
> > > hwpt_alloc(devb, viommu2, vste) = devb_vste
> > > attach(deva, deva_vste)
> > > attach(devb, devb_vste)
> > > attach(devc, shared_s2)
> >
> > I wonder whether we want to make viommu as the 1st-class citizen
> > for any nested hwpt if it is desirable to enable it even for VT-d which
> > lacks of a hw viommu concept at the moment.
>
> I think Jason is completely using SMMU as an example here.
>
> Also FWIW, I am trying a core-allocated core-managed viommu for
> IOMMU_VIOMMU_TYPE_DEFAULT. So VT-d driver doesn't need to hold
> a viommu while VMM could still allocate one if it wants. And the
> VIOMMU interface can provide some helpers if driver wants some
> info from the core-managed viommu: a virtual dev ID to physical
> dev ID (returning device pointer) translation for example. And
> we can add more after we brain storm.

that'd be nice.

> > > The driver will then know it should program three different VMIDs for
> > > the same S2 page table, which matches the ARM expectation for
> > > VMID. That is to say we'd pass in the viommu as the pt_id for the
> > > iommu_hwpt_alloc. The viommu would imply both the S2 page table and
> > > any meta information like VMID the driver needs.
> >
> > Can you elaborate the aspect about "three different VMIDs"? They are
> > all for the same VM hence sharing the same VMID per the earlier
> > description. This is also echo-ed in patch14:
> >
> > tegra241_cmdqv_viommu_alloc()
> > vintf->vmid = smmu_domain->vmid;
>
> The design in the series is still old using a 1:1 relationship
> between a viommu and an S2 domain. I think the "three" is from
> his SMMU example above? Leaving it to Jason to reply though.

I'd expect some detailed explanation in the commit msg about how
VCMDQ actually works. It's unclear to me whether this "three
different VMIDs" thing comes from the hw requirement or certain
software design choices.

2024-05-23 06:57:33

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

> From: Jason Gunthorpe <[email protected]>
> Sent: Sunday, May 12, 2024 11:02 PM
>
> On Fri, Apr 12, 2024 at 08:47:09PM -0700, Nicolin Chen wrote:
>
> > +/**
> > + * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
> > + * @size: sizeof(struct iommu_vqueue_alloc)
> > + * @flags: Must be 0
> > + * @viommu_id: viommu ID to associate the virtual queue with
> > + * @out_vqueue_id: The ID of the new virtual queue
> > + * @data_type: One of enum iommu_vqueue_data_type
> > + * @data_len: Length of the type specific data
> > + * @data_uptr: User pointer to the type specific data
> > + *
> > + * Allocate an virtual queue object for driver-specific HW-accelerated
> queue
> > + */
> > +
> > +struct iommu_vqueue_alloc {
> > + __u32 size;
> > + __u32 flags;
> > + __u32 viommu_id;
> > + __u32 out_vqueue_id;
> > + __u32 data_type;
> > + __u32 data_len;
> > + __aligned_u64 data_uptr;
>
> Some of the iommus will want an IPA here not a user pointer. I think
> it is fine API wise, we'd just add a flag to indicate data_uptr is an
> IPA.
>

Presumably each driver will create its own type data which can
include any IPA field as vcmdq_base in this patch?


2024-05-23 07:05:53

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

> From: Nicolin Chen <[email protected]>
> Sent: Saturday, April 13, 2024 11:47 AM
> +
> +/**
> + * struct iommu_vqueue_tegra241_cmdqv - NVIDIA Tegra241's Virtual
> Command Queue
> + * for its CMDQV Extension for ARM SMMUv3
> + * (IOMMU_VQUEUE_DATA_TEGRA241_CMDQV)
> + * @vcmdq_id: logical ID of a virtual command queue in the VIOMMU
> instance
> + * @vcmdq_log2size: (1 << @vcmdq_log2size) will be the size of the vcmdq
> + * @vcmdq_base: guest physical address (IPA) to the vcmdq base address
> + */
> +struct iommu_vqueue_tegra241_cmdqv {
> + __u32 vcmdq_id;
> + __u32 vcmdq_log2size;
> + __aligned_u64 vcmdq_base;
> +};

Is there restriction that log2size cannot exceed 12?

According to patch14:

+ q_base = arm_smmu_domain_ipa_to_pa(vintf->s2_domain, arg.vcmdq_base);
+ vcmdq->cmdq.q.q_base = q_base & VCMDQ_ADDR;
+ vcmdq->cmdq.q.q_base |= arg.vcmdq_log2size;

It only converts the IPA to HPA for the starting page, assuming
continuous host pages backing a guest cmd queue including
multiple pages. but what guarantees it?

2024-05-23 12:48:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)

On Wed, May 22, 2024 at 08:09:12PM -0700, Nicolin Chen wrote:
> On Wed, May 22, 2024 at 11:43:51PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Thursday, May 23, 2024 7:29 AM
> > > On Wed, May 22, 2024 at 12:47:19PM -0700, Nicolin Chen wrote:
> > > > Yea, SMMU also has Event Queue and PRI queue. Though I haven't
> > > > got time to sit down to look at Baolu's work closely, the uAPI
> > > > seems to be a unified one for all IOMMUs. And though I have no
> > > > intention to be against that design, yet maybe there could be
> > > > an alternative in a somewhat HW specific language as we do for
> > > > invalidation? Or not worth it?
> > >
> > > I was thinking not worth it, I expect a gain here is to do as AMD has
> > > done and make the HW dma the queues directly to guest memory.
> > >
> > > IMHO the primary issue with the queues is DOS, as having any shared
> > > queue across VMs is dangerous in that way. Allowing each VIOMMU to
> > > have its own private queue and own flow control helps with that.
> > >
> >
> > and also shorter delivering path with less data copy?
>
> Should I interpret that as a yes for fault report via VQUEUE?
>
> We only have AMD that can HW dma the events to the guest queue
> memory. Others all need a backward translation of (at least) a
> physical dev ID to a virtual dev ID. This is now doable in the
> kernel by the ongoing vdev_id design by the way. So kernel then
> can write the guest memory directly to report events?

I don't think we should get into kernel doing direct access at this
point, lets focus on basic functionality before we get to
microoptimizations like that.

So long as the API could support doing something like that it could be
done after benchmarking/etc.

Jason

2024-05-23 12:59:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On Thu, May 23, 2024 at 01:43:45AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, May 22, 2024 9:39 PM
> >
> > On Wed, May 22, 2024 at 08:58:34AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Tuesday, May 14, 2024 11:56 PM
> > > >
> > > > > > So we need the S2 to exist before the VIOMMU is created, but the
> > > > > > drivers are going to need some more fixing before that will fully
> > > > > > work.
> > >
> > > Can you elaborate on this point? VIOMMU is a dummy container when
> > > it's created and the association to S2 comes relevant only until when
> > > VQUEUE is created inside and linked to a device?
> >
> > VIOMMU contains:
> > - A nesting parent
> > - A KVM
> > - Any global per-VM data the driver needs
> > * In ARM case this is VMID, sometimes shared with KVM
>
> In which case is it not shared with KVM? I had the impression that
> VMID always comes from KVM in this VCMDQ usage. ????

Not quite, only vBTM needs it to be shared with KVM because the CPU
will forward the KVM VMID to the SMMU during invalidations.

Everything else in the nesting space (including vCMDQ) just needs the
VMID to be unique to the VM since it scopes the ASID that is stored in
guest table.s

For non-nesting cases (ie no viommu) the VMID can be unique to the S2.

> > On ARM the S2 is not divorced from the VIOMMU, ARM requires a single
> > VMID, shared with KVM, and localized to a single VM for some of the
> > bypass features (vBTM, vCMDQ). So to attach a S2 you actually have to
> > attach the VIOMMU to pick up the correct VMID.
> >
> > I imagine something like this:
> > hwpt_alloc(deva, nesting_parent=true) = shared_s2
> > viommu_alloc(deva, shared_s2) = viommu1
> > viommu_alloc(devb, shared_s2) = viommu2
> > hwpt_alloc(deva, viommu1, vste) = deva_vste
> > hwpt_alloc(devb, viommu2, vste) = devb_vste
> > attach(deva, deva_vste)
> > attach(devb, devb_vste)
> > attach(devc, shared_s2)
>
> I wonder whether we want to make viommu as the 1st-class citizen
> for any nested hwpt if it is desirable to enable it even for VT-d which
> lacks of a hw viommu concept at the moment.

I think we may as well code it like that, yes. It is easy to get the
S2 out of the viommu and feed that into the intel driver.

> > The driver will then know it should program three different VMIDs for
> > the same S2 page table, which matches the ARM expectation for
> > VMID. That is to say we'd pass in the viommu as the pt_id for the
> > iommu_hwpt_alloc. The viommu would imply both the S2 page table and
> > any meta information like VMID the driver needs.
>
> Can you elaborate the aspect about "three different VMIDs"?

In SMMUv3 the cache is tagged by (VMID,ASID) where ASID is completely
controlled by the guest.

Every time the guest observes a SMMUv3 instance it is allowed to
creates its own private ASID number space for that instance. The guest
could re-use ASID #1 on two instances.

So every SMMUv3 instance plugged into the guest needs to have its own
unique VMID so that the overlapping ASID's are disambiguate. The above
would create a VM where:

deva -> vSMMUv3 #1
devb -> vSMMUv3 #2
devc -> No IOMMU

> tegra241_cmdqv_viommu_alloc()
> vintf->vmid = smmu_domain->vmid;

And same here the 'vintf' is processing ASIDs directly from the guest,
it needs to understand the VMID they are scoped under. Every vSMMUv3
instance in the guest must get a VMID, vintf, and vCMDQ all to its
own. The above needs revising in a viommu world.

> > Both AMD and the vCMDQ thing need to translate some PFNs through the
> > S2 and program them elsewhere, this is manually done by SW, and there
> > are three choices I guess:
> > - Have the VMM do it and provide void __user * to the driver
>
> this sounds redundant to what S2 already provides

Yes, but we don't have to track invalidations here..

> > - Have the driver do it through the S2 directly and track
> > S2 invalidations
>
> this makes more sense to me. Just like the driver already needs to track
> S2 invalidations to flush any nested cache related to the affected S2 range.

I'm a bit worried about 'track invalidations' though..

> > - Have the driver open an access on the IOAS and use the access unmap
>
> it requires adding more iommufd awareness into the iommu driver. I'm
> inclined to do it only at minimal necessity.

Yes, it is certainly annoying because of the modular/builtin problem.

> > Hmm, given we currently have no known hardware entanglement between
> > PRI and VIOMMU it does seem OK for PRI to just exist seperate for
>
> Isn't AMD vPPRLog for directly sending PRI request into the guest?

I think it is, but that would be a vQUEUE on the VIOMMU not adding a
VIOMMU to Lu's patches, which is what I ment.

> > now. If someone needs them linked someday we can add a viommu_id to
> > the create pri queue command.
>
> I'm more worried about the potential conflict between the vqueue
> object here and the fault queue object in Baolu's series, if we want
> to introduce vIOMMU concept to platforms which lack of the hw
> support.

I assume the vPPRLog will steal all the PRI before it reaches the
kernel, so once this is turned on Lu's path won't see anything.

I don't know if AMD can turn vPPRLog on individually or if it is a
whole package once they switch on VIOMMU..

Jason

2024-05-24 02:16:49

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, May 23, 2024 8:59 PM
> On Thu, May 23, 2024 at 01:43:45AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, May 22, 2024 9:39 PM
> > >
> > > The driver will then know it should program three different VMIDs for
> > > the same S2 page table, which matches the ARM expectation for
> > > VMID. That is to say we'd pass in the viommu as the pt_id for the
> > > iommu_hwpt_alloc. The viommu would imply both the S2 page table and
> > > any meta information like VMID the driver needs.
> >
> > Can you elaborate the aspect about "three different VMIDs"?
>
> In SMMUv3 the cache is tagged by (VMID,ASID) where ASID is completely
> controlled by the guest.
>
> Every time the guest observes a SMMUv3 instance it is allowed to
> creates its own private ASID number space for that instance. The guest
> could re-use ASID #1 on two instances.
>
> So every SMMUv3 instance plugged into the guest needs to have its own
> unique VMID so that the overlapping ASID's are disambiguate. The above
> would create a VM where:
>
> deva -> vSMMUv3 #1
> devb -> vSMMUv3 #2
> devc -> No IOMMU

This assumes that each vSMMUv3 instance has only one ASID space
i.e. the guest cannot create multiple VMID's itself?

> > > Hmm, given we currently have no known hardware entanglement
> between
> > > PRI and VIOMMU it does seem OK for PRI to just exist seperate for
> >
> > Isn't AMD vPPRLog for directly sending PRI request into the guest?
>
> I think it is, but that would be a vQUEUE on the VIOMMU not adding a
> VIOMMU to Lu's patches, which is what I ment.
>
> > > now. If someone needs them linked someday we can add a viommu_id to
> > > the create pri queue command.
> >
> > I'm more worried about the potential conflict between the vqueue
> > object here and the fault queue object in Baolu's series, if we want
> > to introduce vIOMMU concept to platforms which lack of the hw
> > support.
>
> I assume the vPPRLog will steal all the PRI before it reaches the
> kernel, so once this is turned on Lu's path won't see anything.
>

Okay, then we expect this vqueue object only for HW acceleration
while software-based fault logging is still routed via Baolu's work.

2024-05-24 02:37:24

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, May 23, 2024 8:59 PM
>
> > > now. If someone needs them linked someday we can add a viommu_id to
> > > the create pri queue command.
> >
> > I'm more worried about the potential conflict between the vqueue
> > object here and the fault queue object in Baolu's series, if we want
> > to introduce vIOMMU concept to platforms which lack of the hw
> > support.
>
> I assume the vPPRLog will steal all the PRI before it reaches the
> kernel, so once this is turned on Lu's path won't see anything.
>
> I don't know if AMD can turn vPPRLog on individually or if it is a
> whole package once they switch on VIOMMU..
>

their spec says:
"
This feature is supported if MMIO Offset 0030h[vIommuSup]=1. When
enabled, DTE[vImuEn]=1 indicates the guest owning this device is a
vIOMMU enabled guest, and IOMMU should write event and peripheral
page requests associated with this device into the guest data structures.
"

so it's configured per device.

2024-05-24 04:42:36

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

On Thu, May 23, 2024 at 06:57:15AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Sunday, May 12, 2024 11:02 PM
> >
> > On Fri, Apr 12, 2024 at 08:47:09PM -0700, Nicolin Chen wrote:
> >
> > > +/**
> > > + * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
> > > + * @size: sizeof(struct iommu_vqueue_alloc)
> > > + * @flags: Must be 0
> > > + * @viommu_id: viommu ID to associate the virtual queue with
> > > + * @out_vqueue_id: The ID of the new virtual queue
> > > + * @data_type: One of enum iommu_vqueue_data_type
> > > + * @data_len: Length of the type specific data
> > > + * @data_uptr: User pointer to the type specific data
> > > + *
> > > + * Allocate an virtual queue object for driver-specific HW-accelerated
> > queue
> > > + */
> > > +
> > > +struct iommu_vqueue_alloc {
> > > + __u32 size;
> > > + __u32 flags;
> > > + __u32 viommu_id;
> > > + __u32 out_vqueue_id;
> > > + __u32 data_type;
> > > + __u32 data_len;
> > > + __aligned_u64 data_uptr;
> >
> > Some of the iommus will want an IPA here not a user pointer. I think
> > it is fine API wise, we'd just add a flag to indicate data_uptr is an
> > IPA.
> >
>
> Presumably each driver will create its own type data which can
> include any IPA field as vcmdq_base in this patch?

You mean putting a base address field in this common vqueue
structure? Hmm, I think we could, assuming every queue must
have a set of base and size info in the guest level.

Thanks
Nicolin

2024-05-24 05:27:03

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

> From: Nicolin Chen <[email protected]>
> Sent: Friday, May 24, 2024 12:42 PM
>
> On Thu, May 23, 2024 at 06:57:15AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Sunday, May 12, 2024 11:02 PM
> > >
> > > On Fri, Apr 12, 2024 at 08:47:09PM -0700, Nicolin Chen wrote:
> > >
> > > > +/**
> > > > + * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
> > > > + * @size: sizeof(struct iommu_vqueue_alloc)
> > > > + * @flags: Must be 0
> > > > + * @viommu_id: viommu ID to associate the virtual queue with
> > > > + * @out_vqueue_id: The ID of the new virtual queue
> > > > + * @data_type: One of enum iommu_vqueue_data_type
> > > > + * @data_len: Length of the type specific data
> > > > + * @data_uptr: User pointer to the type specific data
> > > > + *
> > > > + * Allocate an virtual queue object for driver-specific HW-accelerated
> > > queue
> > > > + */
> > > > +
> > > > +struct iommu_vqueue_alloc {
> > > > + __u32 size;
> > > > + __u32 flags;
> > > > + __u32 viommu_id;
> > > > + __u32 out_vqueue_id;
> > > > + __u32 data_type;
> > > > + __u32 data_len;
> > > > + __aligned_u64 data_uptr;
> > >
> > > Some of the iommus will want an IPA here not a user pointer. I think
> > > it is fine API wise, we'd just add a flag to indicate data_uptr is an
> > > IPA.
> > >
> >
> > Presumably each driver will create its own type data which can
> > include any IPA field as vcmdq_base in this patch?
>
> You mean putting a base address field in this common vqueue
> structure? Hmm, I think we could, assuming every queue must
> have a set of base and size info in the guest level.
>

We could, but my original point was just that it's confusing to
change the meaning of data_uptr. Let's stick it to be driver defined
data and any driver-specific IPA field is defined in that data.

Of course if we think the base address is common enough it can
be moved to the common vqueue structure too.

2024-05-24 05:40:49

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Saturday, April 13, 2024 11:47 AM
> >
> > Introduce a new ioctl to set a per-viommu device virtual id that should be
> > linked to the physical device id (or just a struct device pointer).
> >
> > Since a viommu (user space IOMMU instance) can have multiple devices
>
> this is true...
>
> > while
> > it's not ideal to confine a device to one single user space IOMMU instance
> > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in
>
> ...but why would one device link to multiple viommu instances?

That's a suggestion from Jason, IIRC, to avoid limiting a device
to a single viommu, though I can't find out the source at this
moment.

Jason, would you mind shed some light here?

> Or is it referring to Tegra194 as arm-smmu-nvidia.c tries to support?

Not actual. It's an SMMUv2 driver, which is not in our plan for
virtualization at this moment. And that driver is essentially a
different "compatible" string as a unique SMMUv2 implementation.

> btw there is a check in the following code:
>
> + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> + rc = -EINVAL;
> + goto out_put_viommu;
> + }
>
> I vaguely remember an earlier discussion about multiple vSMMU instances
> following the physical SMMU topology, but don't quite recall the exact
> reason.
>
> What is the actual technical obstacle prohibiting one to put multiple
> VCMDQ's from different SMMU's into one vIOMMU instance?

Because a VCMDQ passthrough involves a direct mmap of a HW MMIO
page to the guest-level MMIO region. The MMIO page provides the
read/write of queue's head and tail indexes.

With a single pSMMU and a single vSMMU, it's simply 1:1 mapping.

With a multi-pSMMU and a single vSMMU, the single vSMMU will see
one guest-level MMIO region backed by multiple physical pages.
Since we are talking about MMIO, technically it cannot select the
corresponding MMIO page to the device, not to mention we don't
really want VMM to involve, i.e. no VM exist, when using VCMDQ.

So, there must be some kind of multi-instanced carriers to hold
those MMIO pages, by attaching devices behind different pSMMUs to
corresponding carriers. And today we have VIOMMU as the carrier.

One step back, even without VCMDQ feature, a multi-pSMMU setup
will have multiple viommus (with our latest design) being added
to a viommu list of a single vSMMU's. Yet, vSMMU in this case
always traps regular SMMU CMDQ, so it can do viommu selection
or even broadcast (if it has to).

Thanks
Nicolin

2024-05-24 06:03:42

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC

On Fri, May 24, 2024 at 05:26:50AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Nicolin Chen <[email protected]>
> > Sent: Friday, May 24, 2024 12:42 PM
> >
> > On Thu, May 23, 2024 at 06:57:15AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Sunday, May 12, 2024 11:02 PM
> > > >
> > > > On Fri, Apr 12, 2024 at 08:47:09PM -0700, Nicolin Chen wrote:
> > > >
> > > > > +/**
> > > > > + * struct iommu_vqueue_alloc - ioctl(IOMMU_VQUEUE_ALLOC)
> > > > > + * @size: sizeof(struct iommu_vqueue_alloc)
> > > > > + * @flags: Must be 0
> > > > > + * @viommu_id: viommu ID to associate the virtual queue with
> > > > > + * @out_vqueue_id: The ID of the new virtual queue
> > > > > + * @data_type: One of enum iommu_vqueue_data_type
> > > > > + * @data_len: Length of the type specific data
> > > > > + * @data_uptr: User pointer to the type specific data
> > > > > + *
> > > > > + * Allocate an virtual queue object for driver-specific HW-accelerated
> > > > queue
> > > > > + */
> > > > > +
> > > > > +struct iommu_vqueue_alloc {
> > > > > + __u32 size;
> > > > > + __u32 flags;
> > > > > + __u32 viommu_id;
> > > > > + __u32 out_vqueue_id;
> > > > > + __u32 data_type;
> > > > > + __u32 data_len;
> > > > > + __aligned_u64 data_uptr;
> > > >
> > > > Some of the iommus will want an IPA here not a user pointer. I think
> > > > it is fine API wise, we'd just add a flag to indicate data_uptr is an
> > > > IPA.
> > > >
> > >
> > > Presumably each driver will create its own type data which can
> > > include any IPA field as vcmdq_base in this patch?
> >
> > You mean putting a base address field in this common vqueue
> > structure? Hmm, I think we could, assuming every queue must
> > have a set of base and size info in the guest level.
> >
>
> We could, but my original point was just that it's confusing to
> change the meaning of data_uptr. Let's stick it to be driver defined
> data and any driver-specific IPA field is defined in that data.

Oh, that's copied from IOMMU_HWPT_ALLOC:
* @data_uptr: User pointer to the type specific data

And it actually makes sense to me to be "type specific" since a
driver (AMD VIOMMU for example) might have two different vqueue
types (invalidation queue v.s. fault queue)?

Thanks
Nicolin

2024-05-24 07:14:09

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

> From: Nicolin Chen <[email protected]>
> Sent: Friday, May 24, 2024 1:40 PM
>
> On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > btw there is a check in the following code:
> >
> > + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> > + rc = -EINVAL;
> > + goto out_put_viommu;
> > + }
> >
> > I vaguely remember an earlier discussion about multiple vSMMU instances
> > following the physical SMMU topology, but don't quite recall the exact
> > reason.
> >
> > What is the actual technical obstacle prohibiting one to put multiple
> > VCMDQ's from different SMMU's into one vIOMMU instance?
>
> Because a VCMDQ passthrough involves a direct mmap of a HW MMIO
> page to the guest-level MMIO region. The MMIO page provides the
> read/write of queue's head and tail indexes.
>
> With a single pSMMU and a single vSMMU, it's simply 1:1 mapping.
>
> With a multi-pSMMU and a single vSMMU, the single vSMMU will see
> one guest-level MMIO region backed by multiple physical pages.
> Since we are talking about MMIO, technically it cannot select the
> corresponding MMIO page to the device, not to mention we don't
> really want VMM to involve, i.e. no VM exist, when using VCMDQ.

can a vSMMU report to support multiple CMDQ's then there are
several MMIO regions each mapped to a different backend VCMDQ?

but I guess even if it's supported there is still a problem describing
the association between assigned devices and the CMDQ's of the
single vIOMMU instance. On bare metal a feature of multiple CMDQ
is more for load-balancing so there won't be a fixed association
between CMDQ/device. But the fixed association exists if we mixes
multiple VCMDQ's in a single vIOMMU instance then there may
lack of a way in spec to describe it.

>
> So, there must be some kind of multi-instanced carriers to hold
> those MMIO pages, by attaching devices behind different pSMMUs to
> corresponding carriers. And today we have VIOMMU as the carrier.
>
> One step back, even without VCMDQ feature, a multi-pSMMU setup
> will have multiple viommus (with our latest design) being added
> to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> always traps regular SMMU CMDQ, so it can do viommu selection
> or even broadcast (if it has to).
>

I'm curious to learn the real reason of that design. Is it because you
want to do certain load-balance between viommu's or due to other
reasons in the kernel smmuv3 driver which e.g. cannot support a
viommu spanning multiple pSMMU?

2024-05-24 13:04:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

On Fri, May 24, 2024 at 02:16:34AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, May 23, 2024 8:59 PM
> > On Thu, May 23, 2024 at 01:43:45AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Wednesday, May 22, 2024 9:39 PM
> > > >
> > > > The driver will then know it should program three different VMIDs for
> > > > the same S2 page table, which matches the ARM expectation for
> > > > VMID. That is to say we'd pass in the viommu as the pt_id for the
> > > > iommu_hwpt_alloc. The viommu would imply both the S2 page table and
> > > > any meta information like VMID the driver needs.
> > >
> > > Can you elaborate the aspect about "three different VMIDs"?
> >
> > In SMMUv3 the cache is tagged by (VMID,ASID) where ASID is completely
> > controlled by the guest.
> >
> > Every time the guest observes a SMMUv3 instance it is allowed to
> > creates its own private ASID number space for that instance. The guest
> > could re-use ASID #1 on two instances.
> >
> > So every SMMUv3 instance plugged into the guest needs to have its own
> > unique VMID so that the overlapping ASID's are disambiguate. The above
> > would create a VM where:
> >
> > deva -> vSMMUv3 #1
> > devb -> vSMMUv3 #2
> > devc -> No IOMMU
>
> This assumes that each vSMMUv3 instance has only one ASID space
> i.e. the guest cannot create multiple VMID's itself?

Right, the vSMMUv3 in the guest has no support VMID at all.

> > I assume the vPPRLog will steal all the PRI before it reaches the
> > kernel, so once this is turned on Lu's path won't see anything.
>
> Okay, then we expect this vqueue object only for HW acceleration
> while software-based fault logging is still routed via Baolu's work.

Yeah, I think it mirrors the invalidation that way, we have a SW
invalidation path and a HW path. The PRI is standards based so it is
easier to make a generic API for it.

Jason

2024-05-24 13:20:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Friday, May 24, 2024 1:40 PM
> >
> > On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > > btw there is a check in the following code:
> > >
> > > + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> > > + rc = -EINVAL;
> > > + goto out_put_viommu;
> > > + }
> > >
> > > I vaguely remember an earlier discussion about multiple vSMMU instances
> > > following the physical SMMU topology, but don't quite recall the exact
> > > reason.
> > >
> > > What is the actual technical obstacle prohibiting one to put multiple
> > > VCMDQ's from different SMMU's into one vIOMMU instance?
> >
> > Because a VCMDQ passthrough involves a direct mmap of a HW MMIO
> > page to the guest-level MMIO region. The MMIO page provides the
> > read/write of queue's head and tail indexes.
> >
> > With a single pSMMU and a single vSMMU, it's simply 1:1 mapping.
> >
> > With a multi-pSMMU and a single vSMMU, the single vSMMU will see
> > one guest-level MMIO region backed by multiple physical pages.
> > Since we are talking about MMIO, technically it cannot select the
> > corresponding MMIO page to the device, not to mention we don't
> > really want VMM to involve, i.e. no VM exist, when using VCMDQ.
>
> can a vSMMU report to support multiple CMDQ's then there are
> several MMIO regions each mapped to a different backend VCMDQ?

This is something I want to support in the API, regardless vCMDQ uses
it or not..

> but I guess even if it's supported there is still a problem describing
> the association between assigned devices and the CMDQ's of the
> single vIOMMU instance.

CMDQ should be linked to the VIOMMU instance only and not per-device,
I think that is a very important property of the system.

This means if there are multiple pSMMU instances then we need a
matching set of vSMMU instances so that the devices are grouped on the
proper vSMMU instance so they are compatible with the vCMDQ.

> I'm curious to learn the real reason of that design. Is it because you
> want to do certain load-balance between viommu's or due to other
> reasons in the kernel smmuv3 driver which e.g. cannot support a
> viommu spanning multiple pSMMU?

Yeah, there is no concept of support for a SMMUv3 instance where it's
command Q's can only work on a subset of devices.

My expectation was that VIOMMU would be 1:1 with physical iommu
instances, I think AMD needs this too??

Jason

2024-05-24 13:23:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, May 23, 2024 at 10:40:07PM -0700, Nicolin Chen wrote:
> On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Saturday, April 13, 2024 11:47 AM
> > >
> > > Introduce a new ioctl to set a per-viommu device virtual id that should be
> > > linked to the physical device id (or just a struct device pointer).
> > >
> > > Since a viommu (user space IOMMU instance) can have multiple devices
> >
> > this is true...
> >
> > > while
> > > it's not ideal to confine a device to one single user space IOMMU instance
> > > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in
> >
> > ...but why would one device link to multiple viommu instances?
>
> That's a suggestion from Jason, IIRC, to avoid limiting a device
> to a single viommu, though I can't find out the source at this
> moment.
>
> Jason, would you mind shed some light here?

We could probably do it either way, it just doesn't seem like there is
a strong reason to limit a device to a single viommu even if in
practice that is how everyone will use it.

Jason

2024-05-27 01:09:07

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, May 24, 2024 9:19 PM
>
> On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > I'm curious to learn the real reason of that design. Is it because you
> > want to do certain load-balance between viommu's or due to other
> > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > viommu spanning multiple pSMMU?
>
> Yeah, there is no concept of support for a SMMUv3 instance where it's
> command Q's can only work on a subset of devices.
>
> My expectation was that VIOMMU would be 1:1 with physical iommu
> instances, I think AMD needs this too??
>

Yes this part is clear now regarding to VCMDQ.

But Nicoline said:

"
One step back, even without VCMDQ feature, a multi-pSMMU setup
will have multiple viommus (with our latest design) being added
to a viommu list of a single vSMMU's. Yet, vSMMU in this case
always traps regular SMMU CMDQ, so it can do viommu selection
or even broadcast (if it has to).
"

I don't think there is an arch limitation mandating that?

2024-05-28 20:30:57

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, May 24, 2024 9:19 PM
> >
> > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > I'm curious to learn the real reason of that design. Is it because you
> > > want to do certain load-balance between viommu's or due to other
> > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > viommu spanning multiple pSMMU?
> >
> > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > command Q's can only work on a subset of devices.
> >
> > My expectation was that VIOMMU would be 1:1 with physical iommu
> > instances, I think AMD needs this too??
> >
>
> Yes this part is clear now regarding to VCMDQ.
>
> But Nicoline said:
>
> "
> One step back, even without VCMDQ feature, a multi-pSMMU setup
> will have multiple viommus (with our latest design) being added
> to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> always traps regular SMMU CMDQ, so it can do viommu selection
> or even broadcast (if it has to).
> "
>
> I don't think there is an arch limitation mandating that?

What I mean is for regular a nested SMMU case. Without VCMDQ, a
regular vSMMU on a multi-pSMMU setup will look like (e.g. three
devices behind different SMMUs):
|<---- guest ---->|<--------- VMM -------->|<- kernel ->|
|-- dev0 --|-- viommu0 (s2_hwpt0) --|-- pSMMU0 --|
vSMMU--|-- dev1 --|-- viommu1 (s2_hwpt0) --|-- pSMMU1 --|
|-- dev2 --|-- viommu2 (s2_hwpt0) --|-- pSMMU2 --|

When trapping a device cache invalidation: it is straightforward
by deciphering the virtual device ID to pick the viommu that the
device is attached to. So, only one pSMMU would receive the user
invalidation request.

When doing iotlb invalidation, a command may or may not contain
an ASID (a domain ID, and nested domain in this case):
a) if a command doesn't have an ASID, VMM needs to broadcast the
command to all viommus (i.e. pSMMUs)
b) if a command has an ASID, VMM needs to initially maintain an
S1 HWPT list by linking an ASID when adding an HWPT entry to
the list, by deciphering vSTE and its linked CD. Then it needs
to go through the S1 list with the ASID in the command, and to
find all corresponding HWPTs to issue/broadcast the command.

Thanks
Nicolin

2024-05-28 20:34:11

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Tue, May 28, 2024 at 01:22:46PM -0700, Nicolin Chen wrote:
> On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, May 24, 2024 9:19 PM
> > >
> > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > > I'm curious to learn the real reason of that design. Is it because you
> > > > want to do certain load-balance between viommu's or due to other
> > > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > > viommu spanning multiple pSMMU?
> > >
> > > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > > command Q's can only work on a subset of devices.
> > >
> > > My expectation was that VIOMMU would be 1:1 with physical iommu
> > > instances, I think AMD needs this too??
> > >
> >
> > Yes this part is clear now regarding to VCMDQ.
> >
> > But Nicoline said:
> >
> > "
> > One step back, even without VCMDQ feature, a multi-pSMMU setup
> > will have multiple viommus (with our latest design) being added
> > to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> > always traps regular SMMU CMDQ, so it can do viommu selection
> > or even broadcast (if it has to).
> > "
> >
> > I don't think there is an arch limitation mandating that?
>
> What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> on a multi-pSMMU setup will look like (e.g. three devices behind
> different SMMUs):
> |<------ VMM ------->|<------ kernel ------>|
> |-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
> |-- viommu2 --|-- pSMMU2 --|
>
> And device would attach to:
> |<---- guest ---->|<--- VMM --->|<- kernel ->|
> |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
> |-- dev2 --|-- viommu2 --|-- pSMMU2 --|

I accidentally sent a duplicated one.. Please ignore this reply
and check the other one. Thanks!

2024-05-28 21:07:00

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, May 24, 2024 9:19 PM
> >
> > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > I'm curious to learn the real reason of that design. Is it because you
> > > want to do certain load-balance between viommu's or due to other
> > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > viommu spanning multiple pSMMU?
> >
> > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > command Q's can only work on a subset of devices.
> >
> > My expectation was that VIOMMU would be 1:1 with physical iommu
> > instances, I think AMD needs this too??
> >
>
> Yes this part is clear now regarding to VCMDQ.
>
> But Nicoline said:
>
> "
> One step back, even without VCMDQ feature, a multi-pSMMU setup
> will have multiple viommus (with our latest design) being added
> to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> always traps regular SMMU CMDQ, so it can do viommu selection
> or even broadcast (if it has to).
> "
>
> I don't think there is an arch limitation mandating that?

What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
on a multi-pSMMU setup will look like (e.g. three devices behind
different SMMUs):
|<------ VMM ------->|<------ kernel ------>|
|-- viommu0 --|-- pSMMU0 --|
vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
|-- viommu2 --|-- pSMMU2 --|

And device would attach to:
|<---- guest ---->|<--- VMM --->|<- kernel ->|
|-- dev0 --|-- viommu0 --|-- pSMMU0 --|
vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
|-- dev2 --|-- viommu2 --|-- pSMMU2 --|

When trapping a device cache invalidation: it is straightforward
by deciphering the virtual device ID to pick the viommu that the
device is attached to.

When doing iotlb invalidation, a command may or may not contain
an ASID (a domain ID, and nested domain in this case):
a) if a command doesn't have an ASID, VMM needs to broadcast the
command to all viommus (i.e. pSMMUs)
b) if a command has an ASID, VMM needs to initially maintain an
S1 HWPT list by linking an ASID when adding an HWPT entry to
the list, by deciphering vSTE and its linked CD. Then it needs
to go through the S1 list with the ASID in the command, and to
find all corresponding HWPTs to issue/broadcast the command.

Thanks
Nicolin

2024-05-29 02:58:29

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, May 29, 2024 4:23 AM
>
> On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, May 24, 2024 9:19 PM
> > >
> > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > > I'm curious to learn the real reason of that design. Is it because you
> > > > want to do certain load-balance between viommu's or due to other
> > > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > > viommu spanning multiple pSMMU?
> > >
> > > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > > command Q's can only work on a subset of devices.
> > >
> > > My expectation was that VIOMMU would be 1:1 with physical iommu
> > > instances, I think AMD needs this too??
> > >
> >
> > Yes this part is clear now regarding to VCMDQ.
> >
> > But Nicoline said:
> >
> > "
> > One step back, even without VCMDQ feature, a multi-pSMMU setup
> > will have multiple viommus (with our latest design) being added
> > to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> > always traps regular SMMU CMDQ, so it can do viommu selection
> > or even broadcast (if it has to).
> > "
> >
> > I don't think there is an arch limitation mandating that?
>
> What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> on a multi-pSMMU setup will look like (e.g. three devices behind
> different SMMUs):
> |<------ VMM ------->|<------ kernel ------>|
> |-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
> |-- viommu2 --|-- pSMMU2 --|
>
> And device would attach to:
> |<---- guest ---->|<--- VMM --->|<- kernel ->|
> |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
> |-- dev2 --|-- viommu2 --|-- pSMMU2 --|
>
> When trapping a device cache invalidation: it is straightforward
> by deciphering the virtual device ID to pick the viommu that the
> device is attached to.

I understand how above works.

My question is why that option is chosen instead of going with 1:1
mapping between vSMMU and viommu i.e. letting the kernel to
figure out which pSMMU should be sent an invalidation cmd to, as
how VT-d is virtualized.

I want to know whether doing so is simply to be compatible with
what VCMDQ requires, or due to another untold reason.

>
> When doing iotlb invalidation, a command may or may not contain
> an ASID (a domain ID, and nested domain in this case):
> a) if a command doesn't have an ASID, VMM needs to broadcast the
> command to all viommus (i.e. pSMMUs)
> b) if a command has an ASID, VMM needs to initially maintain an
> S1 HWPT list by linking an ASID when adding an HWPT entry to
> the list, by deciphering vSTE and its linked CD. Then it needs
> to go through the S1 list with the ASID in the command, and to
> find all corresponding HWPTs to issue/broadcast the command.
>
> Thanks
> Nicolin

2024-05-29 03:22:13

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, May 29, 2024 4:23 AM

> > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> > on a multi-pSMMU setup will look like (e.g. three devices behind
> > different SMMUs):
> > |<------ VMM ------->|<------ kernel ------>|
> > |-- viommu0 --|-- pSMMU0 --|
> > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
> > |-- viommu2 --|-- pSMMU2 --|
> >
> > And device would attach to:
> > |<---- guest ---->|<--- VMM --->|<- kernel ->|
> > |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
> > |-- dev2 --|-- viommu2 --|-- pSMMU2 --|
> >
> > When trapping a device cache invalidation: it is straightforward
> > by deciphering the virtual device ID to pick the viommu that the
> > device is attached to.
>
> I understand how above works.
>
> My question is why that option is chosen instead of going with 1:1
> mapping between vSMMU and viommu i.e. letting the kernel to
> figure out which pSMMU should be sent an invalidation cmd to, as
> how VT-d is virtualized.
>
> I want to know whether doing so is simply to be compatible with
> what VCMDQ requires, or due to another untold reason.

Because we use viommu as a VMID holder for SMMU. So a pSMMU must
have its own viommu to store its VMID for a shared s2_hwpt:
|-- viommu0 (VMIDx) --|-- pSMMU0 --|
vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
|-- viommu2 (VMIDz) --|-- pSMMU2 --|

Thanks
Nicolin

2024-05-30 00:28:57

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, May 29, 2024 11:21 AM
>
> On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, May 29, 2024 4:23 AM
>
> > > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> > > on a multi-pSMMU setup will look like (e.g. three devices behind
> > > different SMMUs):
> > > |<------ VMM ------->|<------ kernel ------>|
> > > |-- viommu0 --|-- pSMMU0 --|
> > > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
> > > |-- viommu2 --|-- pSMMU2 --|
> > >
> > > And device would attach to:
> > > |<---- guest ---->|<--- VMM --->|<- kernel ->|
> > > |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> > > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
> > > |-- dev2 --|-- viommu2 --|-- pSMMU2 --|
> > >
> > > When trapping a device cache invalidation: it is straightforward
> > > by deciphering the virtual device ID to pick the viommu that the
> > > device is attached to.
> >
> > I understand how above works.
> >
> > My question is why that option is chosen instead of going with 1:1
> > mapping between vSMMU and viommu i.e. letting the kernel to
> > figure out which pSMMU should be sent an invalidation cmd to, as
> > how VT-d is virtualized.
> >
> > I want to know whether doing so is simply to be compatible with
> > what VCMDQ requires, or due to another untold reason.
>
> Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> have its own viommu to store its VMID for a shared s2_hwpt:
> |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> |-- viommu2 (VMIDz) --|-- pSMMU2 --|
>

there are other options, e.g. you can have one viommu holding multiple
VMIDs each associating to a pSMMU.

so it's really an implementation choice that you want a same viommu
topology scheme w/ or w/o VCMDQ.

we all know there are some constraints of copying the physical topology,
e.g. hotplugging a device or migration. for VCMDQ it's clearly an
acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
you want to keep more flexibility for the user. ????

2024-05-30 00:59:02

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, May 29, 2024 11:21 AM
> > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > My question is why that option is chosen instead of going with 1:1
> > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > how VT-d is virtualized.
> > >
> > > I want to know whether doing so is simply to be compatible with
> > > what VCMDQ requires, or due to another untold reason.
> >
> > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > have its own viommu to store its VMID for a shared s2_hwpt:
> > |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> >
>
> there are other options, e.g. you can have one viommu holding multiple
> VMIDs each associating to a pSMMU.

Well, possibly. But everything previously in a viommu would have
to be a list (for number of VMIDs) in the kernel level: not only
a VMID list, but also a 2D virtual ID lists, something like:

struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup

And a driver in this case would be difficult to get a complete
concept of a viommu object since it's core object and shared by
all kernel-level IOMMU instances. If a driver wants to extend a
viommu object for some additional feature, e.g. VINTF in this
series, it would likely have to create another per-driver object
and again another list of this kind of objects in struct viommu.

At the end of day, we have to duplicate it one way or another for
the amount of physical IOMMUs. And it seems to cleaner by doing
it with multiple viommu objects.

> so it's really an implementation choice that you want a same viommu
> topology scheme w/ or w/o VCMDQ.
>
> we all know there are some constraints of copying the physical topology,
> e.g. hotplugging a device or migration. for VCMDQ it's clearly an
> acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
> you want to keep more flexibility for the user. ????

Oh. With regular nested SMMU, there is only one virtual SMMU in
the guest VM. No need of copying physical topology. Just the VMM
needs to allocate three viommus to add them to a list of its own.

Thanks
Nicolin

2024-05-30 03:05:19

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

> From: Nicolin Chen <[email protected]>
> Sent: Thursday, May 30, 2024 8:59 AM
>
> On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > My question is why that option is chosen instead of going with 1:1
> > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > how VT-d is virtualized.
> > > >
> > > > I want to know whether doing so is simply to be compatible with
> > > > what VCMDQ requires, or due to another untold reason.
> > >
> > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > have its own viommu to store its VMID for a shared s2_hwpt:
> > > |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > > |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > >
> >
> > there are other options, e.g. you can have one viommu holding multiple
> > VMIDs each associating to a pSMMU.
>
> Well, possibly. But everything previously in a viommu would have
> to be a list (for number of VMIDs) in the kernel level: not only
> a VMID list, but also a 2D virtual ID lists, something like:
>
> struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup

ah does it mean that dev ID space on ARM is local to SMMU?

It's not the case on x86 platforms where the RID is platform-wise.

>
> And a driver in this case would be difficult to get a complete
> concept of a viommu object since it's core object and shared by
> all kernel-level IOMMU instances. If a driver wants to extend a
> viommu object for some additional feature, e.g. VINTF in this
> series, it would likely have to create another per-driver object
> and again another list of this kind of objects in struct viommu.
>
> At the end of day, we have to duplicate it one way or another for
> the amount of physical IOMMUs. And it seems to cleaner by doing
> it with multiple viommu objects.
>
> > so it's really an implementation choice that you want a same viommu
> > topology scheme w/ or w/o VCMDQ.
> >
> > we all know there are some constraints of copying the physical topology,
> > e.g. hotplugging a device or migration. for VCMDQ it's clearly an
> > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
> > you want to keep more flexibility for the user. ????
>
> Oh. With regular nested SMMU, there is only one virtual SMMU in
> the guest VM. No need of copying physical topology. Just the VMM
> needs to allocate three viommus to add them to a list of its own.
>

Okay I missed this point. Then the number of viommus is really about
the contract between the vmm and the kernel, which is invisible to
the guest or the admin who launches the Qemu.

but wait, isn't there another problem - does the VMM have the
permission to enumerate the topology of physical SMMUs? Or probably
the VMM only needs to know the number of relevant SMMUs for
assigned devices and such info can be indirectly composed by extending GET_HW_INFO...

2024-05-30 04:26:50

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, May 30, 2024 at 03:05:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Thursday, May 30, 2024 8:59 AM
> > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <[email protected]>
> > > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > > My question is why that option is chosen instead of going with 1:1
> > > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > > how VT-d is virtualized.
> > > > >
> > > > > I want to know whether doing so is simply to be compatible with
> > > > > what VCMDQ requires, or due to another untold reason.
> > > >
> > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > > have its own viommu to store its VMID for a shared s2_hwpt:
> > > > |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > > > |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > > >
> > >
> > > there are other options, e.g. you can have one viommu holding multiple
> > > VMIDs each associating to a pSMMU.
> >
> > Well, possibly. But everything previously in a viommu would have
> > to be a list (for number of VMIDs) in the kernel level: not only
> > a VMID list, but also a 2D virtual ID lists, something like:
> >
> > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup
>
> ah does it mean that dev ID space on ARM is local to SMMU?
>
> It's not the case on x86 platforms where the RID is platform-wise.

Actually I had a second thought after I replied. Yea, we only
support PCI devices at this moment, so their RIDs come from BDF
numbers, and then should be platform-wise as you pointed out:

|<-------VMM-------->|<--------------- kernel ------------>|
vRID_a --| |-- VMIDx / SMMU0 -- pRID_A
vRID_b --| => vSMMU => viommu => |-- VMIDy / SMMU1 -- pRID_B
vRID_c --| |-- VMIDz / SMMU2 -- pRID_C

# x/y/z can be identical, while a/b/c and A/B/C must be unique.

So likely a single lookup list can be enough. That still can't
avoid the list of per-pIOMMU objects for some driver feature
though. So, I think having a per-pIOMMU viommu also adds a bit
of flexibility for the kernel.

Overall, indeed an implementation choice, as you mentioned :)

> > And a driver in this case would be difficult to get a complete
> > concept of a viommu object since it's core object and shared by
> > all kernel-level IOMMU instances. If a driver wants to extend a
> > viommu object for some additional feature, e.g. VINTF in this
> > series, it would likely have to create another per-driver object
> > and again another list of this kind of objects in struct viommu.
> >
> > At the end of day, we have to duplicate it one way or another for
> > the amount of physical IOMMUs. And it seems to cleaner by doing
> > it with multiple viommu objects.
> >
> > > so it's really an implementation choice that you want a same viommu
> > > topology scheme w/ or w/o VCMDQ.
> > >
> > > we all know there are some constraints of copying the physical topology,
> > > e.g. hotplugging a device or migration. for VCMDQ it's clearly an
> > > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
> > > you want to keep more flexibility for the user. ????
> >
> > Oh. With regular nested SMMU, there is only one virtual SMMU in
> > the guest VM. No need of copying physical topology. Just the VMM
> > needs to allocate three viommus to add them to a list of its own.
> >
>
> Okay I missed this point. Then the number of viommus is really about
> the contract between the vmm and the kernel, which is invisible to
> the guest or the admin who launches the Qemu.

Yes. Everything should be behind the scene, since VMM can trap
and select, unlike VCMDQ doing direct MMIO to the HW without a
chance of VM Exits.

> but wait, isn't there another problem - does the VMM have the
> permission to enumerate the topology of physical SMMUs? Or probably
> the VMM only needs to know the number of relevant SMMUs for
> assigned devices and such info can be indirectly composed by extending GET_HW_INFO...

I think VMM already has some kinda permission reading the number
of IOMMUs from "/sys/class/iommu"?

That being said, in a regular nesting case w/o VCMDQ, there is
no need to instantiate all vSMMUs to the number of pSMMUs unless
there is at least one device behind each pSMMU attaches.

So, the current implementation allocates an s2_hwpt/viommu only
on demand, when the first device from a physical SMMU attaches,
meaning there'll be only one viommu in the list until the other
first device from another pSMMU attaches. And the actual attach
logical always tries attaching a device to the existing viommus
from the list, and only allocates a new viommu (or s2_hwpt) when
all of them failed. FWIW, host-kernel/driver has a full control
against these attaches.

Thanks
Nicolin

2024-06-01 23:05:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Wed, May 29, 2024 at 05:58:39PM -0700, Nicolin Chen wrote:
> On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > My question is why that option is chosen instead of going with 1:1
> > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > how VT-d is virtualized.
> > > >
> > > > I want to know whether doing so is simply to be compatible with
> > > > what VCMDQ requires, or due to another untold reason.
> > >
> > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > have its own viommu to store its VMID for a shared s2_hwpt:
> > > |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > > |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > >
> >
> > there are other options, e.g. you can have one viommu holding multiple
> > VMIDs each associating to a pSMMU.
>
> Well, possibly. But everything previously in a viommu would have
> to be a list (for number of VMIDs) in the kernel level: not only
> a VMID list, but also a 2D virtual ID lists, something like:
>
> struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup

I feel it makes most sense that ARM (and maybe everyone) just have a
viommu per piommu.

The main argument against is we haven't made it efficient for the VMM
to support multiple piommus. It has to do a system call per piommu
each time it processes the cmdq.

But, on the other hand, if you care about invalidation efficiency it
is kind of silly not to expose the piommus to the guest so that the
invalidation scope can be appropriately narrowed. Replicating all ASID
invalidations to all piommus doesn't make alot of sense if the guest
can know that only one piommu actually needs invalidation.

> And a driver in this case would be difficult to get a complete
> concept of a viommu object since it's core object and shared by
> all kernel-level IOMMU instances. If a driver wants to extend a
> viommu object for some additional feature, e.g. VINTF in this
> series, it would likely have to create another per-driver object
> and again another list of this kind of objects in struct viommu.

Right, we need some kind of per-piommu object because we have
per-piommu data.

> Oh. With regular nested SMMU, there is only one virtual SMMU in
> the guest VM. No need of copying physical topology. Just the VMM
> needs to allocate three viommus to add them to a list of its own.

I understand the appeal of doing this has been to minimize qemu
changes in its ACPI parts, if we tackle that instead maybe we should
just not implement viommu to multiple piommu. It is somewhat
complicated.

Jason

2024-06-03 03:26:19

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Sat, Jun 01, 2024 at 06:45:01PM -0300, Jason Gunthorpe wrote:
> On Wed, May 29, 2024 at 05:58:39PM -0700, Nicolin Chen wrote:
> > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <[email protected]>
> > > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > > My question is why that option is chosen instead of going with 1:1
> > > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > > how VT-d is virtualized.
> > > > >
> > > > > I want to know whether doing so is simply to be compatible with
> > > > > what VCMDQ requires, or due to another untold reason.
> > > >
> > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > > have its own viommu to store its VMID for a shared s2_hwpt:
> > > > |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > > > |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > > >
> > >
> > > there are other options, e.g. you can have one viommu holding multiple
> > > VMIDs each associating to a pSMMU.
> >
> > Well, possibly. But everything previously in a viommu would have
> > to be a list (for number of VMIDs) in the kernel level: not only
> > a VMID list, but also a 2D virtual ID lists, something like:
> >
> > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup
>
> I feel it makes most sense that ARM (and maybe everyone) just have a
> viommu per piommu.
>
> The main argument against is we haven't made it efficient for the VMM
> to support multiple piommus. It has to do a system call per piommu
> each time it processes the cmdq.
>
> But, on the other hand, if you care about invalidation efficiency it
> is kind of silly not to expose the piommus to the guest so that the
> invalidation scope can be appropriately narrowed. Replicating all ASID
> invalidations to all piommus doesn't make alot of sense if the guest
> can know that only one piommu actually needs invalidation.

Yea, that'd be pretty slow, though a broadcast would be still
inevitable when an invalidation only has an address range w/o
an ASID, CMD_TLBI_NH_VAA for example.

In fact, there should always be a dispatcher (v.s. broadcast):
- in the one-viommu-per-pIOMMU case (#1), it's in the VMM
- in the one-viommu-per-vIOMMU case (#2), it's in the kernel

One of them has to take the role to burn some CPUs for-eaching
the hwpt list to identify the iommu to forward. The design #1,
simply makes the kernel easier.

The design #2, on the other hand, would not only require some
lists and new objects that we just discussed, yet also pair of
VIOMMU_SET/UNSET_HWPT_ID ioctls, though it also makes sense as
we choose IOMMU_VIOMMU_INVALIDATE over IOMMU_DEV_INVALIDATE by
adding VIOMMU_SET/UNSET_VDEV_ID?

> > And a driver in this case would be difficult to get a complete
> > concept of a viommu object since it's core object and shared by
> > all kernel-level IOMMU instances. If a driver wants to extend a
> > viommu object for some additional feature, e.g. VINTF in this
> > series, it would likely have to create another per-driver object
> > and again another list of this kind of objects in struct viommu.
>
> Right, we need some kind of per-piommu object because we have
> per-piommu data.
>
> > Oh. With regular nested SMMU, there is only one virtual SMMU in
> > the guest VM. No need of copying physical topology. Just the VMM
> > needs to allocate three viommus to add them to a list of its own.
>
> I understand the appeal of doing this has been to minimize qemu
> changes in its ACPI parts if we tackle that instead maybe we should
> just not implement viommu to multiple piommu. It is somewhat
> complicated.

Would you please clarify that suggestion "not implement viommu
to multiple piommu"?

For regular nesting (SMMU), we are still doing one vSMMU in the
VMM, though VCMDQ case would be an exception....

Thanks
Nicolin

2024-06-06 18:24:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:

> > I understand the appeal of doing this has been to minimize qemu
> > changes in its ACPI parts if we tackle that instead maybe we should
> > just not implement viommu to multiple piommu. It is somewhat
> > complicated.
>
> Would you please clarify that suggestion "not implement viommu
> to multiple piommu"?
>
> For regular nesting (SMMU), we are still doing one vSMMU in the
> VMM, though VCMDQ case would be an exception....

This is what I mean, always do multiple vSMMU if there are multiple
physical pSMMUs. Don't replicate any virtual commands across pSMMUs.

Jason

2024-06-06 18:45:30

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
>
> > > I understand the appeal of doing this has been to minimize qemu
> > > changes in its ACPI parts if we tackle that instead maybe we should
> > > just not implement viommu to multiple piommu. It is somewhat
> > > complicated.
> >
> > Would you please clarify that suggestion "not implement viommu
> > to multiple piommu"?
> >
> > For regular nesting (SMMU), we are still doing one vSMMU in the
> > VMM, though VCMDQ case would be an exception....
>
> This is what I mean, always do multiple vSMMU if there are multiple
> physical pSMMUs. Don't replicate any virtual commands across pSMMUs.

Thanks for clarifying. That also means you'd prefer putting the
command dispatcher in VMM, which is what we have at this moment.

Nicolin

2024-06-07 00:27:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> >
> > > > I understand the appeal of doing this has been to minimize qemu
> > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > just not implement viommu to multiple piommu. It is somewhat
> > > > complicated.
> > >
> > > Would you please clarify that suggestion "not implement viommu
> > > to multiple piommu"?
> > >
> > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > VMM, though VCMDQ case would be an exception....
> >
> > This is what I mean, always do multiple vSMMU if there are multiple
> > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
>
> Thanks for clarifying. That also means you'd prefer putting the
> command dispatcher in VMM, which is what we have at this moment.

Unless someone knows a reason why we should strive hard to have only a
single vSMMU and accept some invalidation inefficiency?

Jason

2024-06-07 00:37:10

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 7, 2024 8:27 AM
>
> On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> > >
> > > > > I understand the appeal of doing this has been to minimize qemu
> > > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > > just not implement viommu to multiple piommu. It is somewhat
> > > > > complicated.
> > > >
> > > > Would you please clarify that suggestion "not implement viommu
> > > > to multiple piommu"?
> > > >
> > > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > > VMM, though VCMDQ case would be an exception....
> > >
> > > This is what I mean, always do multiple vSMMU if there are multiple
> > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
> >
> > Thanks for clarifying. That also means you'd prefer putting the
> > command dispatcher in VMM, which is what we have at this moment.
>
> Unless someone knows a reason why we should strive hard to have only a
> single vSMMU and accept some invalidation inefficiency?
>

migration? a single vSMMU provides better compatibility between
src/dest...

2024-06-07 14:55:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Fri, Jun 07, 2024 at 12:36:46AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, June 7, 2024 8:27 AM
> >
> > On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> > > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> > > >
> > > > > > I understand the appeal of doing this has been to minimize qemu
> > > > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > > > just not implement viommu to multiple piommu. It is somewhat
> > > > > > complicated.
> > > > >
> > > > > Would you please clarify that suggestion "not implement viommu
> > > > > to multiple piommu"?
> > > > >
> > > > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > > > VMM, though VCMDQ case would be an exception....
> > > >
> > > > This is what I mean, always do multiple vSMMU if there are multiple
> > > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
> > >
> > > Thanks for clarifying. That also means you'd prefer putting the
> > > command dispatcher in VMM, which is what we have at this moment.
> >
> > Unless someone knows a reason why we should strive hard to have only a
> > single vSMMU and accept some invalidation inefficiency?
> >
>
> migration? a single vSMMU provides better compatibility between
> src/dest...

Maybe, though I think we can safely split a single pSMMU into multiple
vSMMUs using the IOMMUFD vIOMMU interface. So if your machine model
has two vSMMUs and your physical HW has less we can still make that
work.

IOTLB efficiency will suffer though when splitting 1p -> 2v while
invalidation performance will suffer when joining 2p -> 1v.

Jason

2024-06-07 21:19:55

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Fri, Jun 07, 2024 at 11:49:17AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2024 at 12:36:46AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, June 7, 2024 8:27 AM
> > >
> > > On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> > > > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > > > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> > > > >
> > > > > > > I understand the appeal of doing this has been to minimize qemu
> > > > > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > > > > just not implement viommu to multiple piommu. It is somewhat
> > > > > > > complicated.
> > > > > >
> > > > > > Would you please clarify that suggestion "not implement viommu
> > > > > > to multiple piommu"?
> > > > > >
> > > > > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > > > > VMM, though VCMDQ case would be an exception....
> > > > >
> > > > > This is what I mean, always do multiple vSMMU if there are multiple
> > > > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
> > > >
> > > > Thanks for clarifying. That also means you'd prefer putting the
> > > > command dispatcher in VMM, which is what we have at this moment.
> > >
> > > Unless someone knows a reason why we should strive hard to have only a
> > > single vSMMU and accept some invalidation inefficiency?
> > >
> >
> > migration? a single vSMMU provides better compatibility between
> > src/dest...
>
> Maybe, though I think we can safely split a single pSMMU into multiple
> vSMMUs using the IOMMUFD vIOMMU interface. So if your machine model
> has two vSMMUs and your physical HW has less we can still make that
> work.
>
> IOTLB efficiency will suffer though when splitting 1p -> 2v while
> invalidation performance will suffer when joining 2p -> 1v.

I think the invalidation efficiency is actually solvable. So,
basically viommu_invalidate would receive a whole batch of cmds
and dispatch them to different pSMMUs (nested_domains/devices).
We already have a vdev_id table for devices, yet we just need a
new vasid table for nested_domains. Right?

The immediate benefit is that VMMs won't need to duplicate each
other's dispatcher pieces, and likely helps migrations as Kevin
pointed out.

With that being said, it would make the kernel design a bit more
complicated. And the VMM still has to separate the commands for
passthrough devices (HW iotlb) from commands for emulated devices
(emulated iotlb), unless we further split the topology at the VM
level to have a dedicated vSMMU for all passthrough devices --
then VMM could just forward its entire cmdq to the kernel without
deciphering every command (likely?).

Thanks
Nicolin

2024-06-10 12:12:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:

> > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > invalidation performance will suffer when joining 2p -> 1v.
>
> I think the invalidation efficiency is actually solvable. So,
> basically viommu_invalidate would receive a whole batch of cmds
> and dispatch them to different pSMMUs (nested_domains/devices).
> We already have a vdev_id table for devices, yet we just need a
> new vasid table for nested_domains. Right?

You can't know the ASID usage of the hypervisor from the VM, unless
you also inspect the CD table memory in the guest. That seems like
something we should try hard to avoid.

> With that being said, it would make the kernel design a bit more
> complicated. And the VMM still has to separate the commands for
> passthrough devices (HW iotlb) from commands for emulated devices
> (emulated iotlb), unless we further split the topology at the VM
> level to have a dedicated vSMMU for all passthrough devices --
> then VMM could just forward its entire cmdq to the kernel without
> deciphering every command (likely?).

I would not include the emulated devices in a shared SMMU.. For the
same reason, we should try hard to avoid inspecting the page table
memory.

If a viommu is needed for emulated then virtio-iommu may be more
appropriate..

That said I'm sure someone will want to do this, so as long as it is
possible in the VMM, as slow as it may be, then it is fine.

Jason

2024-06-10 20:06:43

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:
>
> > > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > > invalidation performance will suffer when joining 2p -> 1v.
> >
> > I think the invalidation efficiency is actually solvable. So,
> > basically viommu_invalidate would receive a whole batch of cmds
> > and dispatch them to different pSMMUs (nested_domains/devices).
> > We already have a vdev_id table for devices, yet we just need a
> > new vasid table for nested_domains. Right?
>
> You can't know the ASID usage of the hypervisor from the VM, unless
> you also inspect the CD table memory in the guest. That seems like
> something we should try hard to avoid.

Actually, even now as we put a dispatcher in VMM, VMM still does
decode the CD table to link ASID to s1_hwpt. Otherwise, it could
only broadcast a TLBI cmd to all pSMMUs.

Doing in the other way by moving it to the kernel, we'd just need
a pair of new ioctls and use them when VMM traps CFGI_CD cmds, so
kernel driver instead of VMM user driver manages the links between
ASIDs to nested domains. Either a master ASID or SVA ASIDs can be
linked to the same nested_domain that's allocated per vSTE.

> > With that being said, it would make the kernel design a bit more
> > complicated. And the VMM still has to separate the commands for
> > passthrough devices (HW iotlb) from commands for emulated devices
> > (emulated iotlb), unless we further split the topology at the VM
> > level to have a dedicated vSMMU for all passthrough devices --
> > then VMM could just forward its entire cmdq to the kernel without
> > deciphering every command (likely?).
>
> I would not include the emulated devices in a shared SMMU.. For the
> same reason, we should try hard to avoid inspecting the page table
> memory.

I wouldn't like the idea of attaching emulated devices to a shared
vSMMU. Yet, mind elaborating why this would inspect the page table
memory? Or do you mean we should avoid VMM inspecting user tables?

> If a viommu is needed for emulated then virtio-iommu may be more
> appropriate..
>
> That said I'm sure someone will want to do this, so as long as it is
> possible in the VMM, as slow as it may be, then it is fine.

Eric hasn't replied my previous query regarding how to design this,
yet I guess the same. And looks like Intel is doing so for emulated
devices, since there is only one intel_iommu instance in a VM.

Thanks
Nicolin

2024-06-10 22:01:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, Jun 10, 2024 at 01:01:32PM -0700, Nicolin Chen wrote:
> On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:
> >
> > > > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > > > invalidation performance will suffer when joining 2p -> 1v.
> > >
> > > I think the invalidation efficiency is actually solvable. So,
> > > basically viommu_invalidate would receive a whole batch of cmds
> > > and dispatch them to different pSMMUs (nested_domains/devices).
> > > We already have a vdev_id table for devices, yet we just need a
> > > new vasid table for nested_domains. Right?
> >
> > You can't know the ASID usage of the hypervisor from the VM, unless
> > you also inspect the CD table memory in the guest. That seems like
> > something we should try hard to avoid.
>
> Actually, even now as we put a dispatcher in VMM, VMM still does
> decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> only broadcast a TLBI cmd to all pSMMUs.

No, there should be no CD table decoding and no linking ASID to
anything by the VMM.

The ARM architecture is clean, the ASID can remain private to the VM,
there is no reason for the VMM to understand it.

The s1_hwpt is derived only from the vSTE and nothing more. It would
be fine for all the devices to have their own s1_hwpts with their own
vSTE's inside it.

> > > With that being said, it would make the kernel design a bit more
> > > complicated. And the VMM still has to separate the commands for
> > > passthrough devices (HW iotlb) from commands for emulated devices
> > > (emulated iotlb), unless we further split the topology at the VM
> > > level to have a dedicated vSMMU for all passthrough devices --
> > > then VMM could just forward its entire cmdq to the kernel without
> > > deciphering every command (likely?).
> >
> > I would not include the emulated devices in a shared SMMU.. For the
> > same reason, we should try hard to avoid inspecting the page table
> > memory.
>
> I wouldn't like the idea of attaching emulated devices to a shared
> vSMMU. Yet, mind elaborating why this would inspect the page table
> memory? Or do you mean we should avoid VMM inspecting user tables?

Emulated devices can't use the HW page table walker in the SMMU since
they won't get a clean CD linkage they can use. They have to manually
walk the page tables and convert them into an IOAS. It is a big PITA,
best to be avoided.

> > If a viommu is needed for emulated then virtio-iommu may be more
> > appropriate..
> >
> > That said I'm sure someone will want to do this, so as long as it is
> > possible in the VMM, as slow as it may be, then it is fine.
>
> Eric hasn't replied my previous query regarding how to design this,
> yet I guess the same. And looks like Intel is doing so for emulated
> devices, since there is only one intel_iommu instance in a VM.

Yes, Intel has long has this code to build VFIO containers from page
table walks.

Jason

2024-06-10 23:05:05

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, Jun 10, 2024 at 07:01:10PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 01:01:32PM -0700, Nicolin Chen wrote:
> > On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:
> > >
> > > > > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > > > > invalidation performance will suffer when joining 2p -> 1v.
> > > >
> > > > I think the invalidation efficiency is actually solvable. So,
> > > > basically viommu_invalidate would receive a whole batch of cmds
> > > > and dispatch them to different pSMMUs (nested_domains/devices).
> > > > We already have a vdev_id table for devices, yet we just need a
> > > > new vasid table for nested_domains. Right?
> > >
> > > You can't know the ASID usage of the hypervisor from the VM, unless
> > > you also inspect the CD table memory in the guest. That seems like
> > > something we should try hard to avoid.
> >
> > Actually, even now as we put a dispatcher in VMM, VMM still does
> > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > only broadcast a TLBI cmd to all pSMMUs.
>
> No, there should be no CD table decoding and no linking ASID to
> anything by the VMM.
>
> The ARM architecture is clean, the ASID can remain private to the VM,
> there is no reason for the VMM to understand it.

But a guest-level TLBI command usually has only ASID available to
know which pSMMU to dispatch the command. Without an ASID lookup
table, how could VMM then dispatch a command to the corresponding
pSMMU?

> The s1_hwpt is derived only from the vSTE and nothing more. It would
> be fine for all the devices to have their own s1_hwpts with their own
> vSTE's inside it.
>
> > > > With that being said, it would make the kernel design a bit more
> > > > complicated. And the VMM still has to separate the commands for
> > > > passthrough devices (HW iotlb) from commands for emulated devices
> > > > (emulated iotlb), unless we further split the topology at the VM
> > > > level to have a dedicated vSMMU for all passthrough devices --
> > > > then VMM could just forward its entire cmdq to the kernel without
> > > > deciphering every command (likely?).
> > >
> > > I would not include the emulated devices in a shared SMMU.. For the
> > > same reason, we should try hard to avoid inspecting the page table
> > > memory.
> >
> > I wouldn't like the idea of attaching emulated devices to a shared
> > vSMMU. Yet, mind elaborating why this would inspect the page table
> > memory? Or do you mean we should avoid VMM inspecting user tables?
>
> Emulated devices can't use the HW page table walker in the SMMU since
> they won't get a clean CD linkage they can use. They have to manually
> walk the page tables and convert them into an IOAS. It is a big PITA,
> best to be avoided.

Oh..the mainline QEMU smmu code already has that:
https://github.com/qemu/qemu/blob/master/hw/arm/smmu-common.c#L522

Surely, it's a pathway that passthrough devices shouldn't run into.

Thanks
Nicolin

2024-06-11 00:29:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:

> > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > only broadcast a TLBI cmd to all pSMMUs.
> >
> > No, there should be no CD table decoding and no linking ASID to
> > anything by the VMM.
> >
> > The ARM architecture is clean, the ASID can remain private to the VM,
> > there is no reason for the VMM to understand it.
>
> But a guest-level TLBI command usually has only ASID available to
> know which pSMMU to dispatch the command. Without an ASID lookup
> table, how could VMM then dispatch a command to the corresponding
> pSMMU?

It can broadcast. The ARM architecture does not expect a N:1 mapping
of SMMUs. This is why I think it is not such a good idea..

Yes the VMM could walk the CD tables too and build up a bitmap of what
ASIDs are being used by what pSMMUs, and that would be fine for the
VMM to do, but I wouldn't necessarily recommend it :)

Jason

2024-06-11 00:44:41

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:
>
> > > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > > only broadcast a TLBI cmd to all pSMMUs.
> > >
> > > No, there should be no CD table decoding and no linking ASID to
> > > anything by the VMM.
> > >
> > > The ARM architecture is clean, the ASID can remain private to the VM,
> > > there is no reason for the VMM to understand it.
> >
> > But a guest-level TLBI command usually has only ASID available to
> > know which pSMMU to dispatch the command. Without an ASID lookup
> > table, how could VMM then dispatch a command to the corresponding
> > pSMMU?
>
> It can broadcast. The ARM architecture does not expect a N:1 mapping
> of SMMUs. This is why I think it is not such a good idea..

Hmm, I thought we had an agreed idea that we shouldn't broadcast
a TLBI (except global NH_ALL/VAA) for invalidation performance?

> Yes the VMM could walk the CD tables too and build up a bitmap of what
> ASIDs are being used by what pSMMUs, and that would be fine for the
> VMM to do, but I wouldn't necessarily recommend it :)

CD table walkthrough would be always done only by VMM, while the
lookup table could be created/maintained by the kernel. I feel a
vasid table could make sense since we maintain the vdev_id table
in the kernel space too.

Anyway, it is still an implementation choice, as Kevin remarked.

Thanks
Nicolin

2024-06-11 12:18:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Mon, Jun 10, 2024 at 05:44:16PM -0700, Nicolin Chen wrote:
> On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:
> >
> > > > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > > > only broadcast a TLBI cmd to all pSMMUs.
> > > >
> > > > No, there should be no CD table decoding and no linking ASID to
> > > > anything by the VMM.
> > > >
> > > > The ARM architecture is clean, the ASID can remain private to the VM,
> > > > there is no reason for the VMM to understand it.
> > >
> > > But a guest-level TLBI command usually has only ASID available to
> > > know which pSMMU to dispatch the command. Without an ASID lookup
> > > table, how could VMM then dispatch a command to the corresponding
> > > pSMMU?
> >
> > It can broadcast. The ARM architecture does not expect a N:1 mapping
> > of SMMUs. This is why I think it is not such a good idea..
>
> Hmm, I thought we had an agreed idea that we shouldn't broadcast
> a TLBI (except global NH_ALL/VAA) for invalidation performance?

I wouldn't say agree, there are just lots of different trade offs to
be made here. You can reduce broadcast by parsing the CD table from
the VMM. You can reduce broadcast with multiple vSMMUs.

VMM needs to pick a design. I favour multiple vSMMUs.

> CD table walkthrough would be always done only by VMM, while the
> lookup table could be created/maintained by the kernel. I feel a
> vasid table could make sense since we maintain the vdev_id table
> in the kernel space too.

I'm not convinced we should put such a micro optimization in the
kernel. If the VMM cares about performance then split the vSMMU,
otherwise lets just put all the mess in the VMM and give it the tools
to manage the invalidation distribution.

In the long run things like vCMDQ are going to force the choice to
multi-smmu, which is why I don't see too much value with investing in
optimizing the single vSMMU case. The optimization can be done later
if someone does have a use case.

Jason

2024-06-11 19:22:15

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Tue, Jun 11, 2024 at 09:17:56AM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 05:44:16PM -0700, Nicolin Chen wrote:
> > On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:
> > >
> > > > > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > > > > only broadcast a TLBI cmd to all pSMMUs.
> > > > >
> > > > > No, there should be no CD table decoding and no linking ASID to
> > > > > anything by the VMM.
> > > > >
> > > > > The ARM architecture is clean, the ASID can remain private to the VM,
> > > > > there is no reason for the VMM to understand it.
> > > >
> > > > But a guest-level TLBI command usually has only ASID available to
> > > > know which pSMMU to dispatch the command. Without an ASID lookup
> > > > table, how could VMM then dispatch a command to the corresponding
> > > > pSMMU?
> > >
> > > It can broadcast. The ARM architecture does not expect a N:1 mapping
> > > of SMMUs. This is why I think it is not such a good idea..
> >
> > Hmm, I thought we had an agreed idea that we shouldn't broadcast
> > a TLBI (except global NH_ALL/VAA) for invalidation performance?
>
> I wouldn't say agree, there are just lots of different trade offs to
> be made here. You can reduce broadcast by parsing the CD table from
> the VMM. You can reduce broadcast with multiple vSMMUs.
>
> VMM needs to pick a design. I favour multiple vSMMUs.

Yea, having multiple vSMMUs for nesting too seems to be a cleaner
design. The thing is that we have to put a certain complexity in
the VMM, and it should be more efficient by having it at the boot
stage (creating multi-vSMMUs/PCIs and IORT nodes) v.s. runtime
(trappings and distributing at every command).

> > CD table walkthrough would be always done only by VMM, while the
> > lookup table could be created/maintained by the kernel. I feel a
> > vasid table could make sense since we maintain the vdev_id table
> > in the kernel space too.
>
> I'm not convinced we should put such a micro optimization in the
> kernel. If the VMM cares about performance then split the vSMMU,
> otherwise lets just put all the mess in the VMM and give it the tools
> to manage the invalidation distribution.
>
> In the long run things like vCMDQ are going to force the choice to
> multi-smmu, which is why I don't see too much value with investing in
> optimizing the single vSMMU case. The optimization can be done later
> if someone does have a use case.

I see. That makes sense to me.

Nicolin