2023-03-09 10:55:05

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 00/14] Add Nested Translation Support for SMMUv3

Hi all,

This series of patches add nested translation support for ARM SMMUv3.

Eric Auger made a huge effort previously with the VFIO uAPIs, and sent
his v16 a year ago. Now, the nested translation should follow the new
IOMMUFD uAPIs design. So, most of the key features are ported from the
privous VFIO solution, and then rebuilt on top of the IOMMUFD nesting
infrastructure.

This series is rebased on top of the Intel VT-d nesting changes, so as
to reduce merge conflicts at the uapi header updates.

The essential parts in the driver to support a nested translation are
->hw_info, ->domain_alloc_user and ->invalidate_cache_user ops. So this
series fundamentally adds these three functions in the SMMUv3 driver,
along with several preparations and cleanups for them.

One unique requirement for SMMUv3 nested translation support is the MSI
doorbell address translation, which is a 2-stage translation too. And,
to working with the ITS driver, an msi_cookie needs to be setup on the
kernel-managed domain, the stage-2 domain of the nesting setup. And the
same msi_cookie will be fetched, via iommu_get_domain_for_dev(), in the
iommu core to allocate and creates IOVA mappings for the MSI doorbell
page(s). However, with the nesting design, the device is attached to a
user-managed domain, the stage-1 domain. So both the setup and fetching
of the msi_cookie would not work at the level of stage-2 domain. Thus,
on both sides, the msi_cookie setup and fetching require a redirection
of the domain pointer. It's easy to do so in iommufd core, but needs a
new op in the iommu core and driver.

You can also find this series on the Github:
https://github.com/nicolinc/iommufd/commits/iommufd_nesting

The kernel branch is tested with this QEMU branch:
https://github.com/nicolinc/qemu/commits/wip/iommufd_rfcv3+nesting+smmuv3

Thanks!
Nicolin Chen

Eric Auger (2):
iommu/arm-smmu-v3: Unset corresponding STE fields when s2_cfg is NULL
iommu/arm-smmu-v3: Add STRTAB_STE_0_CFG_NESTED for 2-stage translation

Nicolin Chen (12):
iommu: Add iommu_get_unmanaged_domain helper
iommufd: Add nesting related data structures for ARM SMMUv3
iommufd/device: Setup MSI on kernel-managed domains
iommu/arm-smmu-v3: Add arm_smmu_hw_info
iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Prepare for nested domain support
iommu/arm-smmu-v3: Implement arm_smmu_get_unmanaged_domain
iommu/arm-smmu-v3: Pass in user_cfg to arm_smmu_domain_finalise
iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user
iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations
iommu/arm-smmu-v3: Add CMDQ_OP_TLBI_NH_VAA and CMDQ_OP_TLBI_NH_ALL
iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 267 ++++++++++++++++----
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 7 +-
drivers/iommu/dma-iommu.c | 5 +-
drivers/iommu/iommu-priv.h | 15 ++
drivers/iommu/iommufd/device.c | 5 +-
drivers/iommu/iommufd/hw_pagetable.c | 4 +
drivers/iommu/iommufd/main.c | 1 +
include/linux/iommu.h | 2 +
include/uapi/linux/iommufd.h | 64 +++++
9 files changed, 323 insertions(+), 47 deletions(-)

--
2.39.2



2023-03-09 10:55:11

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

Add the following data structures for corresponding ioctls:
iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC
iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE

Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
to the header and corresponding type/size arrays.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/hw_pagetable.c | 4 +++
drivers/iommu/iommufd/main.c | 1 +
include/uapi/linux/iommufd.h | 50 ++++++++++++++++++++++++++++
3 files changed, 55 insertions(+)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 8f9985bddeeb..5e798b2f9a3a 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -173,6 +173,7 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
static const size_t iommufd_hwpt_alloc_data_size[] = {
[IOMMU_HWPT_TYPE_DEFAULT] = 0,
[IOMMU_HWPT_TYPE_VTD_S1] = sizeof(struct iommu_hwpt_intel_vtd),
+ [IOMMU_HWPT_TYPE_ARM_SMMUV3] = sizeof(struct iommu_hwpt_arm_smmuv3),
};

/*
@@ -183,6 +184,8 @@ const u64 iommufd_hwpt_type_bitmaps[] = {
[IOMMU_HW_INFO_TYPE_DEFAULT] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT),
[IOMMU_HW_INFO_TYPE_INTEL_VTD] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT) |
BIT_ULL(IOMMU_HWPT_TYPE_VTD_S1),
+ [IOMMU_HW_INFO_TYPE_ARM_SMMUV3] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT) |
+ BIT_ULL(IOMMU_HWPT_TYPE_ARM_SMMUV3),
};

/* Return true if type is supported, otherwise false */
@@ -329,6 +332,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
*/
static const size_t iommufd_hwpt_invalidate_info_size[] = {
[IOMMU_HWPT_TYPE_VTD_S1] = sizeof(struct iommu_hwpt_invalidate_intel_vtd),
+ [IOMMU_HWPT_TYPE_ARM_SMMUV3] = sizeof(struct iommu_hwpt_invalidate_arm_smmuv3),
};

int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 514db4c26927..0b0097af7c86 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -280,6 +280,7 @@ union ucmd_buffer {
* path.
*/
struct iommu_hwpt_invalidate_intel_vtd vtd;
+ struct iommu_hwpt_invalidate_arm_smmuv3 smmuv3;
};

struct iommufd_ioctl_op {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2a6c326391b2..0d5551b1b2be 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -352,10 +352,13 @@ struct iommu_vfio_ioas {
* enum iommu_hwpt_type - IOMMU HWPT Type
* @IOMMU_HWPT_TYPE_DEFAULT: default
* @IOMMU_HWPT_TYPE_VTD_S1: Intel VT-d stage-1 page table
+ * @IOMMU_HWPT_TYPE_ARM_SMMUV3: ARM SMMUv3 stage-1 Context Descriptor
+ * table
*/
enum iommu_hwpt_type {
IOMMU_HWPT_TYPE_DEFAULT,
IOMMU_HWPT_TYPE_VTD_S1,
+ IOMMU_HWPT_TYPE_ARM_SMMUV3,
};

/**
@@ -411,6 +414,28 @@ struct iommu_hwpt_intel_vtd {
__u32 __reserved;
};

+/**
+ * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data
+ *
+ * @flags: page table entry attributes
+ * @s2vmid: Virtual machine identifier
+ * @s1ctxptr: Stage-1 context descriptor pointer
+ * @s1cdmax: Number of CDs pointed to by s1ContextPtr
+ * @s1fmt: Stage-1 Format
+ * @s1dss: Default substream
+ */
+struct iommu_hwpt_arm_smmuv3 {
+#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */
+#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
+ __u64 flags;
+ __u32 s2vmid;
+ __u32 __reserved;
+ __u64 s1ctxptr;
+ __u64 s1cdmax;
+ __u64 s1fmt;
+ __u64 s1dss;
+};
+
/**
* struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
* @size: sizeof(struct iommu_hwpt_alloc)
@@ -446,6 +471,8 @@ struct iommu_hwpt_intel_vtd {
* +------------------------------+-------------------------------------+-----------+
* | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_intel_vtd | HWPT |
* +------------------------------+-------------------------------------+-----------+
+ * | IOMMU_HWPT_TYPE_ARM_SMMUV3 | struct iommu_hwpt_arm_smmuv3 | IOAS/HWPT |
+ * +------------------------------+-------------------------------------------------+
*/
struct iommu_hwpt_alloc {
__u32 size;
@@ -463,10 +490,12 @@ struct iommu_hwpt_alloc {
/**
* enum iommu_hw_info_type - IOMMU Hardware Info Types
* @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
*/
enum iommu_hw_info_type {
IOMMU_HW_INFO_TYPE_DEFAULT,
IOMMU_HW_INFO_TYPE_INTEL_VTD,
+ IOMMU_HW_INFO_TYPE_ARM_SMMUV3,
};

/**
@@ -591,6 +620,25 @@ struct iommu_hwpt_invalidate_intel_vtd {
__u64 nb_granules;
};

+/**
+ * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info
+ * @flags: boolean attributes of cache invalidation command
+ * @opcode: opcode of cache invalidation command
+ * @ssid: SubStream ID
+ * @granule_size: page/block size of the mapping in bytes
+ * @range: IOVA range to invalidate
+ */
+struct iommu_hwpt_invalidate_arm_smmuv3 {
+#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0)
+ __u64 flags;
+ __u8 opcode;
+ __u8 padding[3];
+ __u32 asid;
+ __u32 ssid;
+ __u32 granule_size;
+ struct iommu_iova_range range;
+};
+
/**
* struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
* @size: sizeof(struct iommu_hwpt_invalidate)
@@ -609,6 +657,8 @@ struct iommu_hwpt_invalidate_intel_vtd {
* +------------------------------+----------------------------------------+
* | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_invalidate_intel_vtd |
* +------------------------------+----------------------------------------+
+ * | IOMMU_HWPT_TYPE_ARM_SMMUV3 | struct iommu_hwpt_invalidate_arm_smmuv3|
+ * +------------------------------+----------------------------------------+
*/
struct iommu_hwpt_invalidate {
__u32 size;
--
2.39.2


2023-03-09 10:55:18

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 03/14] iommufd/device: Setup MSI on kernel-managed domains

The IOMMU_RESV_SW_MSI is a kernel-managed domain thing. So, it should be
only setup on a kernel-managed domain only. If the attaching domain is a
user-managed domain, redirect the hwpt to hwpt->parent to do it correctly.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/iommufd/device.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index f95b558f5e95..a3e7d2889164 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -350,7 +350,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
* call iommu_get_msi_cookie() on its behalf. This is necessary to setup
* the MSI window so iommu_dma_prepare_msi() can install pages into our
* domain after request_irq(). If it is not done interrupts will not
- * work on this domain.
+ * work on this domain. And the msi_cookie should be always set into the
+ * kernel-managed (parent) domain.
*
* FIXME: This is conceptually broken for iommufd since we want to allow
* userspace to change the domains, eg switch from an identity IOAS to a
@@ -358,6 +359,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
* matches what the IRQ layer actually expects in a newly created
* domain.
*/
+ if (hwpt->parent)
+ hwpt = hwpt->parent;
if (sw_msi_start != PHYS_ADDR_MAX && !hwpt->msi_cookie) {
rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
if (rc)
--
2.39.2


2023-03-09 10:55:20

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 05/14] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED

IOMMUFD designs two iommu_domain pointers to represent two stages. The S1
iommu_domain (IOMMU_DOMAIN_NESTED type) represents the Context Descriptor
table in the user space. The S2 iommu_domain (IOMMU_DOMAIN_UNMANAGED type)
represents the translation table in the kernel, owned by a hypervisor.

So there comes to no use case of the ARM_SMMU_DOMAIN_NESTED. Drop it, and
use the type IOMMU_DOMAIN_NESTED instead.

Also drop the unused arm_smmu_enable_nesting(). One following patche will
configure the correct smmu_domain->stage.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ------------------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 -
2 files changed, 19 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c1aac695ae0d..c5616145e2a3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1279,7 +1279,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
s1_cfg = &smmu_domain->s1_cfg;
break;
case ARM_SMMU_DOMAIN_S2:
- case ARM_SMMU_DOMAIN_NESTED:
s2_cfg = &smmu_domain->s2_cfg;
break;
default:
@@ -2220,7 +2219,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
fmt = ARM_64_LPAE_S1;
finalise_stage_fn = arm_smmu_domain_finalise_s1;
break;
- case ARM_SMMU_DOMAIN_NESTED:
case ARM_SMMU_DOMAIN_S2:
ias = smmu->ias;
oas = smmu->oas;
@@ -2747,21 +2745,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
return group;
}

-static int arm_smmu_enable_nesting(struct iommu_domain *domain)
-{
- struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
- int ret = 0;
-
- mutex_lock(&smmu_domain->init_mutex);
- if (smmu_domain->smmu)
- ret = -EPERM;
- else
- smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
- mutex_unlock(&smmu_domain->init_mutex);
-
- return ret;
-}
-
static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
{
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2890,7 +2873,6 @@ static struct iommu_ops arm_smmu_ops = {
.flush_iotlb_all = arm_smmu_flush_iotlb_all,
.iotlb_sync = arm_smmu_iotlb_sync,
.iova_to_phys = arm_smmu_iova_to_phys,
- .enable_nesting = arm_smmu_enable_nesting,
.free = arm_smmu_domain_free,
}
};
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index ba2b4562f4b2..233bfc377267 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -704,7 +704,6 @@ struct arm_smmu_master {
enum arm_smmu_domain_stage {
ARM_SMMU_DOMAIN_S1 = 0,
ARM_SMMU_DOMAIN_S2,
- ARM_SMMU_DOMAIN_NESTED,
ARM_SMMU_DOMAIN_BYPASS,
};

--
2.39.2


2023-03-09 10:55:27

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

This is used to forward the host IDR values to the user space, so the
hypervisor and the guest VM can learn about the underlying hardware's
capabilities.

Also, set the driver_type to IOMMU_HW_INFO_TYPE_ARM_SMMUV3 to pass the
corresponding type sanity in the core.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
include/uapi/linux/iommufd.h | 14 ++++++++++++
3 files changed, 41 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f2425b0f0cd6..c1aac695ae0d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2005,6 +2005,29 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
}
}

+static void *arm_smmu_hw_info(struct device *dev, u32 *length)
+{
+ struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+ struct iommu_hw_info_smmuv3 *info;
+ void *base_idr;
+ int i;
+
+ if (!master || !master->smmu)
+ return ERR_PTR(-ENODEV);
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info)
+ return ERR_PTR(-ENOMEM);
+
+ base_idr = master->smmu->base + ARM_SMMU_IDR0;
+ for (i = 0; i <= 5; i++)
+ info->idr[i] = readl_relaxed(base_idr + 0x4 * i);
+
+ *length = sizeof(*info);
+
+ return info;
+}
+
static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
{
struct arm_smmu_domain *smmu_domain;
@@ -2845,6 +2868,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)

static struct iommu_ops arm_smmu_ops = {
.capable = arm_smmu_capable,
+ .hw_info = arm_smmu_hw_info,
.domain_alloc = arm_smmu_domain_alloc,
.probe_device = arm_smmu_probe_device,
.release_device = arm_smmu_release_device,
@@ -2857,6 +2881,7 @@ static struct iommu_ops arm_smmu_ops = {
.page_response = arm_smmu_page_response,
.def_domain_type = arm_smmu_def_domain_type,
.pgsize_bitmap = -1UL, /* Restricted during device attach */
+ .driver_type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3,
.owner = THIS_MODULE,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = arm_smmu_attach_dev,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 8d772ea8a583..ba2b4562f4b2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -14,6 +14,8 @@
#include <linux/mmzone.h>
#include <linux/sizes.h>

+#include <uapi/linux/iommufd.h>
+
/* MMIO registers */
#define ARM_SMMU_IDR0 0x0
#define IDR0_ST_LVL GENMASK(28, 27)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 0d5551b1b2be..c7a37915b49c 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -519,6 +519,20 @@ struct iommu_hw_info_vtd {
__aligned_u64 ecap_reg;
};

+/**
+ * struct iommu_hw_info_smmuv3 - ARM SMMUv3 device info
+ *
+ * @flags: Must be set to 0
+ * @__reserved: Must be 0
+ * @idr: Implemented features for the SMMU Non-secure programming interface.
+ * Please refer to the chapters from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
+ */
+struct iommu_hw_info_smmuv3 {
+ __u32 flags;
+ __u32 __reserved;
+ __u32 idr[6];
+};
+
/**
* struct iommu_hw_info - ioctl(IOMMU_DEVICE_GET_HW_INFO)
* @size: sizeof(struct iommu_hw_info)
--
2.39.2


2023-03-09 10:55:31

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 08/14] iommu/arm-smmu-v3: Prepare for nested domain support

In a nested translation setup, the device is attached to a stage-1 domain
that represents the guest-level Context Descriptor table. A Stream Table
Entry for a 2-stage translation needs both the stage-1 Context Descriptor
table info and the stage-2 Translation table information, i.e. a pair of
s1_cfg and s2_cfg.

Add an "s2" pointer in struct arm_smmu_domain, so a nested stage-1 domain
can simply navigate its stage-2 domain for the s2_cfg pointer. Also, add
a to_s2_cfg() helper for this purpose, and use it at proper places.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 21d819979865..fee5977feef3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -100,6 +100,24 @@ static void parse_driver_options(struct arm_smmu_device *smmu)
} while (arm_smmu_options[++i].opt);
}

+static struct arm_smmu_s2_cfg *to_s2_cfg(struct arm_smmu_domain *smmu_domain)
+{
+ if (!smmu_domain)
+ return NULL;
+
+ switch (smmu_domain->stage) {
+ case ARM_SMMU_DOMAIN_S1:
+ if (smmu_domain->s2)
+ return &smmu_domain->s2->s2_cfg;
+ return NULL;
+ case ARM_SMMU_DOMAIN_S2:
+ return &smmu_domain->s2_cfg;
+ case ARM_SMMU_DOMAIN_BYPASS:
+ default:
+ return NULL;
+ }
+}
+
/* Low-level queue manipulation functions */
static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
{
@@ -1277,6 +1295,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
switch (smmu_domain->stage) {
case ARM_SMMU_DOMAIN_S1:
s1_cfg = &smmu_domain->s1_cfg;
+ s2_cfg = to_s2_cfg(smmu_domain);
break;
case ARM_SMMU_DOMAIN_S2:
s2_cfg = &smmu_domain->s2_cfg;
@@ -1846,6 +1865,7 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
static void arm_smmu_tlb_inv_context(void *cookie)
{
struct arm_smmu_domain *smmu_domain = cookie;
+ struct arm_smmu_s2_cfg *s2_cfg = to_s2_cfg(smmu_domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;
struct arm_smmu_cmdq_ent cmd;

@@ -1860,7 +1880,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
} else {
cmd.opcode = CMDQ_OP_TLBI_S12_VMALL;
- cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
+ cmd.tlbi.vmid = s2_cfg->vmid;
arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
}
arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
@@ -1931,6 +1951,7 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
size_t granule, bool leaf,
struct arm_smmu_domain *smmu_domain)
{
+ struct arm_smmu_s2_cfg *s2_cfg = to_s2_cfg(smmu_domain);
struct arm_smmu_cmdq_ent cmd = {
.tlbi = {
.leaf = leaf,
@@ -1943,7 +1964,7 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
cmd.tlbi.asid = smmu_domain->s1_cfg.cd.asid;
} else {
cmd.opcode = CMDQ_OP_TLBI_S2_IPA;
- cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
+ cmd.tlbi.vmid = s2_cfg->vmid;
}
__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1a93eeb993ea..6cf516852721 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -709,6 +709,7 @@ enum arm_smmu_domain_stage {
};

struct arm_smmu_domain {
+ struct arm_smmu_domain *s2;
struct arm_smmu_device *smmu;
struct mutex init_mutex; /* Protects smmu pointer */

--
2.39.2


2023-03-09 10:55:34

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 09/14] iommu/arm-smmu-v3: Implement arm_smmu_get_unmanaged_domain

In a 1-stage translation setup, a device is attached to an iommu_domain
(ARM_SMMU_DOMAIN_S1) that is IOMMU_DOMAIN_UNMANAGED type.

In a 2-stage translation setup, a device is attached to an iommu_domain
(ARM_SMMU_DOMAIN_S1) that is IOMMU_DOMAIN_NESTED type, which must have
a valid "s2" pointer for an iommu_domain (ARM_SMMU_DOMAIN_S2) that is
IOMMU_DOMAIN_UNMANAGED type.

Add a function to return the correct iommu_domain pointer accordingly.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fee5977feef3..18ab5d516cf2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2082,6 +2082,17 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
return &smmu_domain->domain;
}

+static struct iommu_domain *arm_smmu_get_unmanaged_domain(struct device *dev)
+{
+ struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+ struct arm_smmu_domain *smmu_domain = master->domain;
+
+ if (smmu_domain->s2)
+ return &smmu_domain->s2->domain;
+
+ return &smmu_domain->domain;
+}
+
static int arm_smmu_bitmap_alloc(unsigned long *map, int span)
{
int idx, size = 1 << span;
@@ -2878,6 +2889,7 @@ static struct iommu_ops arm_smmu_ops = {
.capable = arm_smmu_capable,
.hw_info = arm_smmu_hw_info,
.domain_alloc = arm_smmu_domain_alloc,
+ .get_unmanaged_domain = arm_smmu_get_unmanaged_domain,
.probe_device = arm_smmu_probe_device,
.release_device = arm_smmu_release_device,
.device_group = arm_smmu_device_group,
--
2.39.2


2023-03-09 10:55:38

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 06/14] iommu/arm-smmu-v3: Unset corresponding STE fields when s2_cfg is NULL

From: Eric Auger <[email protected]>

Despite the spec does not seem to mention this, on some implementations,
when the STE configuration switches from an S1+S2 cfg to an S1 only one,
a C_BAD_STE error would happen if dst[3] (S2TTB) is not reset.

Explicitly reset those two higher 64b fields, to prevent that.

Note that this is not a bug at this moment, since a 2-stage translation
setup is not yet enabled, until the following patches add its support.

Reported-by: Shameer Kolothum <[email protected]>
Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c5616145e2a3..29e36448d23b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1361,6 +1361,9 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
dst[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);

val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
+ } else {
+ dst[2] = 0;
+ dst[3] = 0;
}

if (master->ats_enabled)
--
2.39.2


2023-03-09 10:55:46

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 07/14] iommu/arm-smmu-v3: Add STRTAB_STE_0_CFG_NESTED for 2-stage translation

From: Eric Auger <[email protected]>

The value of the STRTAB_STE_0_CFG field can be 0b111 as the configuration
for a 2-stage translation, meaning that both S1 and S2 are valid. Add it
and mark the ste_live accordingly.

Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 29e36448d23b..21d819979865 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1292,6 +1292,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
break;
case STRTAB_STE_0_CFG_S1_TRANS:
case STRTAB_STE_0_CFG_S2_TRANS:
+ case STRTAB_STE_0_CFG_NESTED:
ste_live = true;
break;
case STRTAB_STE_0_CFG_ABORT:
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 233bfc377267..1a93eeb993ea 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -208,6 +208,7 @@
#define STRTAB_STE_0_CFG_BYPASS 4
#define STRTAB_STE_0_CFG_S1_TRANS 5
#define STRTAB_STE_0_CFG_S2_TRANS 6
+#define STRTAB_STE_0_CFG_NESTED 7

#define STRTAB_STE_0_S1FMT GENMASK_ULL(5, 4)
#define STRTAB_STE_0_S1FMT_LINEAR 0
--
2.39.2


2023-03-09 10:55:49

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 11/14] iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user

The arm_smmu_domain_alloc_user callback function is used for userspace to
allocate iommu_domains, such as standalone stage-1 domain, nested stage-1
domain, and nested stage-2 domain. The input user_data is in the type of
struct iommu_hwpt_arm_smmuv3 that contains the configurations of a nested
stage-1 or a nested stage-2 iommu_domain. A NULL user_data will just opt
in a standalone stage-1 domain allocation.

Add a constitutive function __arm_smmu_domain_alloc to support that.

Since ops->domain_alloc_user has a valid dev pointer, the master pointer
is available when calling __arm_smmu_domain_alloc() in this case, meaning
that arm_smmu_domain_finalise() can be done at the allocation stage. This
allows IOMMUFD to initialize the hw_pagetable for the domain.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 95 ++++++++++++++-------
1 file changed, 65 insertions(+), 30 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2d29f7320570..5ff74edfbd68 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2053,36 +2053,6 @@ static void *arm_smmu_hw_info(struct device *dev, u32 *length)
return info;
}

-static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
-{
- struct arm_smmu_domain *smmu_domain;
-
- if (type == IOMMU_DOMAIN_SVA)
- return arm_smmu_sva_domain_alloc();
-
- if (type != IOMMU_DOMAIN_UNMANAGED &&
- type != IOMMU_DOMAIN_DMA &&
- type != IOMMU_DOMAIN_DMA_FQ &&
- type != IOMMU_DOMAIN_IDENTITY)
- return NULL;
-
- /*
- * Allocate the domain and initialise some of its data structures.
- * We can't really do anything meaningful until we've added a
- * master.
- */
- smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
- if (!smmu_domain)
- return NULL;
-
- mutex_init(&smmu_domain->init_mutex);
- INIT_LIST_HEAD(&smmu_domain->devices);
- spin_lock_init(&smmu_domain->devices_lock);
- INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
-
- return &smmu_domain->domain;
-}
-
static struct iommu_domain *arm_smmu_get_unmanaged_domain(struct device *dev)
{
struct arm_smmu_master *master = dev_iommu_priv_get(dev);
@@ -2893,10 +2863,75 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
}

+static struct iommu_domain *
+__arm_smmu_domain_alloc(unsigned type,
+ struct arm_smmu_domain *s2,
+ struct arm_smmu_master *master,
+ const struct iommu_hwpt_arm_smmuv3 *user_cfg)
+{
+ struct arm_smmu_domain *smmu_domain;
+ struct iommu_domain *domain;
+ int ret = 0;
+
+ if (type == IOMMU_DOMAIN_SVA)
+ return arm_smmu_sva_domain_alloc();
+
+ if (type != IOMMU_DOMAIN_UNMANAGED &&
+ type != IOMMU_DOMAIN_DMA &&
+ type != IOMMU_DOMAIN_DMA_FQ &&
+ type != IOMMU_DOMAIN_IDENTITY)
+ return NULL;
+
+ /*
+ * Allocate the domain and initialise some of its data structures.
+ * We can't really finalise the domain unless a master is given.
+ */
+ smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
+ if (!smmu_domain)
+ return NULL;
+ domain = &smmu_domain->domain;
+
+ domain->type = type;
+ domain->ops = arm_smmu_ops.default_domain_ops;
+
+ mutex_init(&smmu_domain->init_mutex);
+ INIT_LIST_HEAD(&smmu_domain->devices);
+ spin_lock_init(&smmu_domain->devices_lock);
+ INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
+
+ if (master) {
+ smmu_domain->smmu = master->smmu;
+ ret = arm_smmu_domain_finalise(domain, master, user_cfg);
+ if (ret) {
+ kfree(smmu_domain);
+ return NULL;
+ }
+ }
+
+ return domain;
+}
+
+static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
+{
+ return __arm_smmu_domain_alloc(type, NULL, NULL, NULL);
+}
+
+static struct iommu_domain *
+arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
+ const void *user_data)
+{
+ const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
+ struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+ unsigned type = IOMMU_DOMAIN_UNMANAGED;
+
+ return __arm_smmu_domain_alloc(type, NULL, master, user_cfg);
+}
+
static struct iommu_ops arm_smmu_ops = {
.capable = arm_smmu_capable,
.hw_info = arm_smmu_hw_info,
.domain_alloc = arm_smmu_domain_alloc,
+ .domain_alloc_user = arm_smmu_domain_alloc_user,
.get_unmanaged_domain = arm_smmu_get_unmanaged_domain,
.probe_device = arm_smmu_probe_device,
.release_device = arm_smmu_release_device,
--
2.39.2


2023-03-09 10:55:54

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 10/14] iommu/arm-smmu-v3: Pass in user_cfg to arm_smmu_domain_finalise

The struct iommu_hwpt_arm_smmuv3 contains the userspace Stream Table Entry
info (for ARM_SMMU_DOMAIN_S1) and an "S2" flag (for ARM_SMMU_DOMAIN_S2).

Pass in a valid user_cfg pointer, so arm_smmu_domain_finalise() can handle
both types of user domain finalizations.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 18ab5d516cf2..2d29f7320570 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -26,6 +26,7 @@
#include <linux/pci.h>
#include <linux/pci-ats.h>
#include <linux/platform_device.h>
+#include <uapi/linux/iommufd.h>

#include "arm-smmu-v3.h"
#include "../../dma-iommu.h"
@@ -2223,7 +2224,8 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
}

static int arm_smmu_domain_finalise(struct iommu_domain *domain,
- struct arm_smmu_master *master)
+ struct arm_smmu_master *master,
+ const struct iommu_hwpt_arm_smmuv3 *user_cfg)
{
int ret;
unsigned long ias, oas;
@@ -2235,12 +2237,18 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
struct io_pgtable_cfg *);
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;
+ bool user_cfg_s2 = user_cfg && (user_cfg->flags & IOMMU_SMMUV3_FLAG_S2);

if (domain->type == IOMMU_DOMAIN_IDENTITY) {
smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
return 0;
}

+ if (user_cfg_s2 && !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
+ return -EINVAL;
+ if (user_cfg_s2)
+ smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
+
/* Restrict the stage to what we can actually support */
if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
@@ -2484,7 +2492,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)

if (!smmu_domain->smmu) {
smmu_domain->smmu = smmu;
- ret = arm_smmu_domain_finalise(domain, master);
+ ret = arm_smmu_domain_finalise(domain, master, NULL);
if (ret) {
smmu_domain->smmu = NULL;
goto out_unlock;
--
2.39.2


2023-03-09 10:55:58

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

Add domain allocation support for IOMMU_DOMAIN_NESTED type. This includes
the "finalise" part to log in the user space Stream Table Entry info.

Co-developed-by: Eric Auger <[email protected]>
Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 38 +++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5ff74edfbd68..1f318b5e0921 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2214,6 +2214,19 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
return 0;
}

+ if (domain->type == IOMMU_DOMAIN_NESTED) {
+ if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) ||
+ !(smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
+ dev_dbg(smmu->dev, "does not implement two stages\n");
+ return -EINVAL;
+ }
+ smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
+ smmu_domain->s1_cfg.s1fmt = user_cfg->s1fmt;
+ smmu_domain->s1_cfg.s1cdmax = user_cfg->s1cdmax;
+ smmu_domain->s1_cfg.cdcfg.cdtab_dma = user_cfg->s1ctxptr;
+ return 0;
+ }
+
if (user_cfg_s2 && !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
return -EINVAL;
if (user_cfg_s2)
@@ -2863,6 +2876,11 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
}

+static const struct iommu_domain_ops arm_smmu_nested_domain_ops = {
+ .attach_dev = arm_smmu_attach_dev,
+ .free = arm_smmu_domain_free,
+};
+
static struct iommu_domain *
__arm_smmu_domain_alloc(unsigned type,
struct arm_smmu_domain *s2,
@@ -2877,11 +2895,15 @@ __arm_smmu_domain_alloc(unsigned type,
return arm_smmu_sva_domain_alloc();

if (type != IOMMU_DOMAIN_UNMANAGED &&
+ type != IOMMU_DOMAIN_NESTED &&
type != IOMMU_DOMAIN_DMA &&
type != IOMMU_DOMAIN_DMA_FQ &&
type != IOMMU_DOMAIN_IDENTITY)
return NULL;

+ if (s2 && s2->stage != ARM_SMMU_DOMAIN_S2)
+ return NULL;
+
/*
* Allocate the domain and initialise some of its data structures.
* We can't really finalise the domain unless a master is given.
@@ -2889,10 +2911,14 @@ __arm_smmu_domain_alloc(unsigned type,
smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
if (!smmu_domain)
return NULL;
+ smmu_domain->s2 = s2;
domain = &smmu_domain->domain;

domain->type = type;
- domain->ops = arm_smmu_ops.default_domain_ops;
+ if (s2)
+ domain->ops = &arm_smmu_nested_domain_ops;
+ else
+ domain->ops = arm_smmu_ops.default_domain_ops;

mutex_init(&smmu_domain->init_mutex);
INIT_LIST_HEAD(&smmu_domain->devices);
@@ -2923,8 +2949,16 @@ arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
struct arm_smmu_master *master = dev_iommu_priv_get(dev);
unsigned type = IOMMU_DOMAIN_UNMANAGED;
+ struct arm_smmu_domain *s2 = NULL;
+
+ if (parent) {
+ if (parent->ops != arm_smmu_ops.default_domain_ops)
+ return NULL;
+ type = IOMMU_DOMAIN_NESTED;
+ s2 = to_smmu_domain(parent);
+ }

- return __arm_smmu_domain_alloc(type, NULL, master, user_cfg);
+ return __arm_smmu_domain_alloc(type, s2, master, user_cfg);
}

static struct iommu_ops arm_smmu_ops = {
--
2.39.2


2023-03-09 10:56:19

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 13/14] iommu/arm-smmu-v3: Add CMDQ_OP_TLBI_NH_VAA and CMDQ_OP_TLBI_NH_ALL

With a nested translation setup, a stage-1 Context Descriptor table can be
managed by a guest OS in the user space. So, the kernel driver should not
assume that the guest OS will use a user space device driver that doesn't
support TLBI_NH_VAA and TLBI_NH_ALL commands.

Add them in the arm_smmu_cmdq_build_cmd(), to prepare for support of these
two TLBI invalidation requests from the guest level.

Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 4 ++++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1f318b5e0921..ac63185ae268 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -277,6 +277,9 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
/* Cover the entire SID range */
cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
break;
+ case CMDQ_OP_TLBI_NH_VAA:
+ ent->tlbi.asid = 0;
+ fallthrough;
case CMDQ_OP_TLBI_NH_VA:
cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
fallthrough;
@@ -301,6 +304,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
case CMDQ_OP_TLBI_NH_ASID:
cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
fallthrough;
+ case CMDQ_OP_TLBI_NH_ALL:
case CMDQ_OP_TLBI_S12_VMALL:
cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
break;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 6cf516852721..6181d6cd8b51 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -454,8 +454,10 @@ struct arm_smmu_cmdq_ent {
};
} cfgi;

+ #define CMDQ_OP_TLBI_NH_ALL 0x10
#define CMDQ_OP_TLBI_NH_ASID 0x11
#define CMDQ_OP_TLBI_NH_VA 0x12
+ #define CMDQ_OP_TLBI_NH_VAA 0x13
#define CMDQ_OP_TLBI_EL2_ALL 0x20
#define CMDQ_OP_TLBI_EL2_ASID 0x21
#define CMDQ_OP_TLBI_EL2_VA 0x22
--
2.39.2


2023-03-09 10:56:22

by Nicolin Chen

[permalink] [raw]
Subject: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

Add arm_smmu_cache_invalidate_user() function for user space to invalidate
TLB entries and Context Descriptors, since either an IO page table entrie
or a Context Descriptor in the user space is still cached by the hardware.

The input user_data is defined in "struct iommu_hwpt_invalidate_arm_smmuv3"
that contains the essential data for corresponding invalidation commands.

Co-developed-by: Eric Auger <[email protected]>
Signed-off-by: Eric Auger <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 56 +++++++++++++++++++++
1 file changed, 56 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ac63185ae268..7d73eab5e7f4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2880,9 +2880,65 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
}

+static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
+ void *user_data)
+{
+ struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
+ struct arm_smmu_cmdq_ent cmd = { .opcode = inv_info->opcode };
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct arm_smmu_device *smmu = smmu_domain->smmu;
+ size_t granule_size = inv_info->granule_size;
+ unsigned long iova = 0;
+ size_t size = 0;
+ int ssid = 0;
+
+ if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
+ return;
+
+ switch (inv_info->opcode) {
+ case CMDQ_OP_CFGI_CD:
+ case CMDQ_OP_CFGI_CD_ALL:
+ return arm_smmu_sync_cd(smmu_domain, inv_info->ssid, true);
+ case CMDQ_OP_TLBI_NH_VA:
+ cmd.tlbi.asid = inv_info->asid;
+ fallthrough;
+ case CMDQ_OP_TLBI_NH_VAA:
+ if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||
+ granule_size & ~(1ULL << __ffs(granule_size)))
+ return;
+
+ iova = inv_info->range.start;
+ size = inv_info->range.last - inv_info->range.start + 1;
+ if (!size)
+ return;
+
+ cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
+ cmd.tlbi.leaf = inv_info->flags & IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF;
+ __arm_smmu_tlb_inv_range(&cmd, iova, size, granule_size, smmu_domain);
+ break;
+ case CMDQ_OP_TLBI_NH_ASID:
+ cmd.tlbi.asid = inv_info->asid;
+ fallthrough;
+ case CMDQ_OP_TLBI_NH_ALL:
+ cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
+ arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
+ break;
+ case CMDQ_OP_ATC_INV:
+ ssid = inv_info->ssid;
+ iova = inv_info->range.start;
+ size = inv_info->range.last - inv_info->range.start + 1;
+ break;
+ default:
+ return;
+ }
+
+ arm_smmu_atc_inv_domain(smmu_domain, ssid, iova, size);
+}
+
static const struct iommu_domain_ops arm_smmu_nested_domain_ops = {
.attach_dev = arm_smmu_attach_dev,
.free = arm_smmu_domain_free,
+ .cache_invalidate_user = arm_smmu_cache_invalidate_user,
};

static struct iommu_domain *
--
2.39.2


2023-03-09 13:04:10

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

On 2023-03-09 10:53, Nicolin Chen wrote:
> This is used to forward the host IDR values to the user space, so the
> hypervisor and the guest VM can learn about the underlying hardware's
> capabilities.
>
> Also, set the driver_type to IOMMU_HW_INFO_TYPE_ARM_SMMUV3 to pass the
> corresponding type sanity in the core.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++++
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
> include/uapi/linux/iommufd.h | 14 ++++++++++++
> 3 files changed, 41 insertions(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index f2425b0f0cd6..c1aac695ae0d 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2005,6 +2005,29 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
> }
> }
>
> +static void *arm_smmu_hw_info(struct device *dev, u32 *length)
> +{
> + struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> + struct iommu_hw_info_smmuv3 *info;
> + void *base_idr;
> + int i;
> +
> + if (!master || !master->smmu)
> + return ERR_PTR(-ENODEV);
> +
> + info = kzalloc(sizeof(*info), GFP_KERNEL);
> + if (!info)
> + return ERR_PTR(-ENOMEM);
> +
> + base_idr = master->smmu->base + ARM_SMMU_IDR0;
> + for (i = 0; i <= 5; i++)
> + info->idr[i] = readl_relaxed(base_idr + 0x4 * i);

You need to take firmware overrides etc. into account here. In
particular, features like BTM may need to be hidden to work around
errata either in the system integration or the SMMU itself. It isn't
reasonable to expect every VMM to be aware of every erratum and
workaround, and there may even be workarounds where we need to go out of
our way to prevent guests from trying to use certain features in order
to maintain correctness at S2.

In general this should probably follow the same principle as KVM, where
we only expose sanitised feature registers representing the
functionality the host understands. Code written today is almost
guaranteed to be running on hardware released in 2030, at least *somewhere*.

Thanks,
Robin.

> +
> + *length = sizeof(*info);
> +
> + return info;
> +}
> +
> static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
> {
> struct arm_smmu_domain *smmu_domain;
> @@ -2845,6 +2868,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
>
> static struct iommu_ops arm_smmu_ops = {
> .capable = arm_smmu_capable,
> + .hw_info = arm_smmu_hw_info,
> .domain_alloc = arm_smmu_domain_alloc,
> .probe_device = arm_smmu_probe_device,
> .release_device = arm_smmu_release_device,
> @@ -2857,6 +2881,7 @@ static struct iommu_ops arm_smmu_ops = {
> .page_response = arm_smmu_page_response,
> .def_domain_type = arm_smmu_def_domain_type,
> .pgsize_bitmap = -1UL, /* Restricted during device attach */
> + .driver_type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3,
> .owner = THIS_MODULE,
> .default_domain_ops = &(const struct iommu_domain_ops) {
> .attach_dev = arm_smmu_attach_dev,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 8d772ea8a583..ba2b4562f4b2 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -14,6 +14,8 @@
> #include <linux/mmzone.h>
> #include <linux/sizes.h>
>
> +#include <uapi/linux/iommufd.h>
> +
> /* MMIO registers */
> #define ARM_SMMU_IDR0 0x0
> #define IDR0_ST_LVL GENMASK(28, 27)
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 0d5551b1b2be..c7a37915b49c 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -519,6 +519,20 @@ struct iommu_hw_info_vtd {
> __aligned_u64 ecap_reg;
> };
>
> +/**
> + * struct iommu_hw_info_smmuv3 - ARM SMMUv3 device info
> + *
> + * @flags: Must be set to 0
> + * @__reserved: Must be 0
> + * @idr: Implemented features for the SMMU Non-secure programming interface.
> + * Please refer to the chapters from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> + */
> +struct iommu_hw_info_smmuv3 {
> + __u32 flags;
> + __u32 __reserved;
> + __u32 idr[6];
> +};
> +
> /**
> * struct iommu_hw_info - ioctl(IOMMU_DEVICE_GET_HW_INFO)
> * @size: sizeof(struct iommu_hw_info)

2023-03-09 13:13:25

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 06/14] iommu/arm-smmu-v3: Unset corresponding STE fields when s2_cfg is NULL

On 2023-03-09 10:53, Nicolin Chen wrote:
> From: Eric Auger <[email protected]>
>
> Despite the spec does not seem to mention this, on some implementations,
> when the STE configuration switches from an S1+S2 cfg to an S1 only one,
> a C_BAD_STE error would happen if dst[3] (S2TTB) is not reset.

Can you provide more details, since it's not clear whether this is a
hardware erratum workaround or a bodge around the driver itself doing
something wrong like not doing a proper break-before-make transition of
the STE. The architecture explicitly states that all the STE.S2* fields
except S2VMID and potentially S2S are ignored when Stage 2 is bypassed.

Thanks,
Robin.

> Explicitly reset those two higher 64b fields, to prevent that.
>
> Note that this is not a bug at this moment, since a 2-stage translation
> setup is not yet enabled, until the following patches add its support.
>
> Reported-by: Shameer Kolothum <[email protected]>
> Signed-off-by: Eric Auger <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index c5616145e2a3..29e36448d23b 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1361,6 +1361,9 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> dst[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>
> val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
> + } else {
> + dst[2] = 0;
> + dst[3] = 0;
> }
>
> if (master->ats_enabled)

2023-03-09 13:20:53

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

On 2023-03-09 10:53, Nicolin Chen wrote:
> Add domain allocation support for IOMMU_DOMAIN_NESTED type. This includes
> the "finalise" part to log in the user space Stream Table Entry info.
>
> Co-developed-by: Eric Auger <[email protected]>
> Signed-off-by: Eric Auger <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 38 +++++++++++++++++++--
> 1 file changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 5ff74edfbd68..1f318b5e0921 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2214,6 +2214,19 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
> return 0;
> }
>
> + if (domain->type == IOMMU_DOMAIN_NESTED) {
> + if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) ||
> + !(smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
> + dev_dbg(smmu->dev, "does not implement two stages\n");
> + return -EINVAL;
> + }
> + smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
> + smmu_domain->s1_cfg.s1fmt = user_cfg->s1fmt;
> + smmu_domain->s1_cfg.s1cdmax = user_cfg->s1cdmax;
> + smmu_domain->s1_cfg.cdcfg.cdtab_dma = user_cfg->s1ctxptr;
> + return 0;

How's that going to work? If the caller's asked for something we can't
provide, returning something else and hoping it fails later is not
sensible, we should just fail right here. It's even more worrying if
there's a chance it *won't* fail later, and a guest ends up with
"nested" translation giving it full access to host PA space :/

Thanks,
Robin.

> + }
> +
> if (user_cfg_s2 && !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
> return -EINVAL;
> if (user_cfg_s2)
> @@ -2863,6 +2876,11 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
> arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
> }
>
> +static const struct iommu_domain_ops arm_smmu_nested_domain_ops = {
> + .attach_dev = arm_smmu_attach_dev,
> + .free = arm_smmu_domain_free,
> +};
> +
> static struct iommu_domain *
> __arm_smmu_domain_alloc(unsigned type,
> struct arm_smmu_domain *s2,
> @@ -2877,11 +2895,15 @@ __arm_smmu_domain_alloc(unsigned type,
> return arm_smmu_sva_domain_alloc();
>
> if (type != IOMMU_DOMAIN_UNMANAGED &&
> + type != IOMMU_DOMAIN_NESTED &&
> type != IOMMU_DOMAIN_DMA &&
> type != IOMMU_DOMAIN_DMA_FQ &&
> type != IOMMU_DOMAIN_IDENTITY)
> return NULL;
>
> + if (s2 && s2->stage != ARM_SMMU_DOMAIN_S2)
> + return NULL;
> +
> /*
> * Allocate the domain and initialise some of its data structures.
> * We can't really finalise the domain unless a master is given.
> @@ -2889,10 +2911,14 @@ __arm_smmu_domain_alloc(unsigned type,
> smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> if (!smmu_domain)
> return NULL;
> + smmu_domain->s2 = s2;
> domain = &smmu_domain->domain;
>
> domain->type = type;
> - domain->ops = arm_smmu_ops.default_domain_ops;
> + if (s2)
> + domain->ops = &arm_smmu_nested_domain_ops;
> + else
> + domain->ops = arm_smmu_ops.default_domain_ops;
>
> mutex_init(&smmu_domain->init_mutex);
> INIT_LIST_HEAD(&smmu_domain->devices);
> @@ -2923,8 +2949,16 @@ arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
> const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
> struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> unsigned type = IOMMU_DOMAIN_UNMANAGED;
> + struct arm_smmu_domain *s2 = NULL;
> +
> + if (parent) {
> + if (parent->ops != arm_smmu_ops.default_domain_ops)
> + return NULL;
> + type = IOMMU_DOMAIN_NESTED;
> + s2 = to_smmu_domain(parent);
> + }
>
> - return __arm_smmu_domain_alloc(type, NULL, master, user_cfg);
> + return __arm_smmu_domain_alloc(type, s2, master, user_cfg);
> }
>
> static struct iommu_ops arm_smmu_ops = {

2023-03-09 13:42:48

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

Hi Nicolin,

On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote:
> Add the following data structures for corresponding ioctls:
> iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC
> iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE
>
> Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
> to the header and corresponding type/size arrays.
>
> Signed-off-by: Nicolin Chen <[email protected]>

> +/**
> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data
> + *
> + * @flags: page table entry attributes
> + * @s2vmid: Virtual machine identifier
> + * @s1ctxptr: Stage-1 context descriptor pointer
> + * @s1cdmax: Number of CDs pointed to by s1ContextPtr
> + * @s1fmt: Stage-1 Format
> + * @s1dss: Default substream
> + */
> +struct iommu_hwpt_arm_smmuv3 {
> +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */

I don't understand the purpose of this flag, since the structure only
provides stage-1 configuration fields

> +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */

Doesn't this break isolation? The VMID space is global for the SMMU, so
the guest could access cached mappings of another device

> + __u64 flags;
> + __u32 s2vmid;
> + __u32 __reserved;
> + __u64 s1ctxptr;
> + __u64 s1cdmax;
> + __u64 s1fmt;
> + __u64 s1dss;
> +};
> +


> +/**
> + * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info
> + * @flags: boolean attributes of cache invalidation command
> + * @opcode: opcode of cache invalidation command
> + * @ssid: SubStream ID
> + * @granule_size: page/block size of the mapping in bytes
> + * @range: IOVA range to invalidate
> + */
> +struct iommu_hwpt_invalidate_arm_smmuv3 {
> +#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0)
> + __u64 flags;
> + __u8 opcode;
> + __u8 padding[3];
> + __u32 asid;
> + __u32 ssid;
> + __u32 granule_size;
> + struct iommu_iova_range range;
> +};

Although we can keep the alloc and hardware info separate for each IOMMU
architecture, we should try to come up with common invalidation methods.

It matters because things like vSVA, or just efficient dynamic mappings,
will require optimal invalidation latency. A paravirtual interface like
vhost-iommu can help with that, as the host kernel will handle guest
invalidations directly instead of doing a round-trip to host userspace
(and we'll likely want the same path for PRI.)

Supporting HW page tables for a common PV IOMMU does require some
architecture-specific knowledge, but invalidation messages contain roughly
the same information on all architectures. The PV IOMMU won't include
command opcodes for each possible architecture if one generic command does
the same job.

Ideally I'd like something like this for vhost-iommu:

* slow path through userspace for complex requests like attach-table and
probe, where the VMM can decode arch-specific information and translate
them to iommufd and vhost-iommu ioctls to update the configuration.

* fast path within the kernel for performance-critical requests like
invalidate, page request and response. It would be absurd for the
vhost-iommu driver to translate generic invalidation requests from the
guest into arch-specific commands with special opcodes, when the next
step is calling the IOMMU driver which does that for free.

During previous discussions we came up with generic invalidations that
could fit both Arm and x86 [1][2]. The only difference was the ASID
(called archid/id in those proposals) which VT-d didn't need. Could we try
to build on that?

[1] https://elixir.bootlin.com/linux/v5.17/source/include/uapi/linux/iommu.h#L161
[2] https://lists.oasis-open.org/archives/virtio-dev/202102/msg00014.html

Thanks,
Jean


2023-03-09 13:44:47

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 13/14] iommu/arm-smmu-v3: Add CMDQ_OP_TLBI_NH_VAA and CMDQ_OP_TLBI_NH_ALL

On 2023-03-09 10:53, Nicolin Chen wrote:
> With a nested translation setup, a stage-1 Context Descriptor table can be
> managed by a guest OS in the user space. So, the kernel driver should not
> assume that the guest OS will use a user space device driver that doesn't
> support TLBI_NH_VAA and TLBI_NH_ALL commands.
>
> Add them in the arm_smmu_cmdq_build_cmd(), to prepare for support of these
> two TLBI invalidation requests from the guest level.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 4 ++++
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
> 2 files changed, 6 insertions(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 1f318b5e0921..ac63185ae268 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -277,6 +277,9 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> /* Cover the entire SID range */
> cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
> break;
> + case CMDQ_OP_TLBI_NH_VAA:
> + ent->tlbi.asid = 0;

This is backwards - NH_VA is a superset of NH_VAA (not to mention that
quietly modifying the input argument is ugly; in fact it might be nice
if ent was const here).

Please follow the existing pattern, and decouple NH_VA from EL2_VA if
necessary.

Thanks,
Robin.

> + fallthrough;
> case CMDQ_OP_TLBI_NH_VA:
> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> fallthrough;
> @@ -301,6 +304,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> case CMDQ_OP_TLBI_NH_ASID:
> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
> fallthrough;
> + case CMDQ_OP_TLBI_NH_ALL:
> case CMDQ_OP_TLBI_S12_VMALL:
> cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> break;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 6cf516852721..6181d6cd8b51 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -454,8 +454,10 @@ struct arm_smmu_cmdq_ent {
> };
> } cfgi;
>
> + #define CMDQ_OP_TLBI_NH_ALL 0x10
> #define CMDQ_OP_TLBI_NH_ASID 0x11
> #define CMDQ_OP_TLBI_NH_VA 0x12
> + #define CMDQ_OP_TLBI_NH_VAA 0x13
> #define CMDQ_OP_TLBI_EL2_ALL 0x20
> #define CMDQ_OP_TLBI_EL2_ASID 0x21
> #define CMDQ_OP_TLBI_EL2_VA 0x22

2023-03-09 14:28:26

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

On 2023-03-09 13:20, Robin Murphy wrote:
> On 2023-03-09 10:53, Nicolin Chen wrote:
>> Add domain allocation support for IOMMU_DOMAIN_NESTED type. This includes
>> the "finalise" part to log in the user space Stream Table Entry info.
>>
>> Co-developed-by: Eric Auger <[email protected]>
>> Signed-off-by: Eric Auger <[email protected]>
>> Signed-off-by: Nicolin Chen <[email protected]>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 38 +++++++++++++++++++--
>>   1 file changed, 36 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 5ff74edfbd68..1f318b5e0921 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2214,6 +2214,19 @@ static int arm_smmu_domain_finalise(struct
>> iommu_domain *domain,
>>           return 0;
>>       }
>> +    if (domain->type == IOMMU_DOMAIN_NESTED) {
>> +        if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) ||
>> +            !(smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
>> +            dev_dbg(smmu->dev, "does not implement two stages\n");
>> +            return -EINVAL;
>> +        }
>> +        smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
>> +        smmu_domain->s1_cfg.s1fmt = user_cfg->s1fmt;
>> +        smmu_domain->s1_cfg.s1cdmax = user_cfg->s1cdmax;
>> +        smmu_domain->s1_cfg.cdcfg.cdtab_dma = user_cfg->s1ctxptr;
>> +        return 0;
>
> How's that going to work? If the caller's asked for something we can't
> provide, returning something else and hoping it fails later is not
> sensible, we should just fail right here. It's even more worrying if
> there's a chance it *won't* fail later, and a guest ends up with
> "nested" translation giving it full access to host PA space :/

Oops, apologies - in part thanks to the confusing indentation, I managed
to miss the early return and misread this all being under the if
condition for nesting not being supported. Sorry for the confusion :(

Thanks,
Robin.

>> +    }
>> +
>>       if (user_cfg_s2 && !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
>>           return -EINVAL;
>>       if (user_cfg_s2)
>> @@ -2863,6 +2876,11 @@ static void arm_smmu_remove_dev_pasid(struct
>> device *dev, ioasid_t pasid)
>>       arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
>>   }
>> +static const struct iommu_domain_ops arm_smmu_nested_domain_ops = {
>> +    .attach_dev        = arm_smmu_attach_dev,
>> +    .free            = arm_smmu_domain_free,
>> +};
>> +
>>   static struct iommu_domain *
>>   __arm_smmu_domain_alloc(unsigned type,
>>               struct arm_smmu_domain *s2,
>> @@ -2877,11 +2895,15 @@ __arm_smmu_domain_alloc(unsigned type,
>>           return arm_smmu_sva_domain_alloc();
>>       if (type != IOMMU_DOMAIN_UNMANAGED &&
>> +        type != IOMMU_DOMAIN_NESTED &&
>>           type != IOMMU_DOMAIN_DMA &&
>>           type != IOMMU_DOMAIN_DMA_FQ &&
>>           type != IOMMU_DOMAIN_IDENTITY)
>>           return NULL;
>> +    if (s2 && s2->stage != ARM_SMMU_DOMAIN_S2)
>> +        return NULL;
>> +
>>       /*
>>        * Allocate the domain and initialise some of its data structures.
>>        * We can't really finalise the domain unless a master is given.
>> @@ -2889,10 +2911,14 @@ __arm_smmu_domain_alloc(unsigned type,
>>       smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
>>       if (!smmu_domain)
>>           return NULL;
>> +    smmu_domain->s2 = s2;
>>       domain = &smmu_domain->domain;
>>       domain->type = type;
>> -    domain->ops = arm_smmu_ops.default_domain_ops;
>> +    if (s2)
>> +        domain->ops = &arm_smmu_nested_domain_ops;
>> +    else
>> +        domain->ops = arm_smmu_ops.default_domain_ops;
>>       mutex_init(&smmu_domain->init_mutex);
>>       INIT_LIST_HEAD(&smmu_domain->devices);
>> @@ -2923,8 +2949,16 @@ arm_smmu_domain_alloc_user(struct device *dev,
>> struct iommu_domain *parent,
>>       const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
>>       struct arm_smmu_master *master = dev_iommu_priv_get(dev);
>>       unsigned type = IOMMU_DOMAIN_UNMANAGED;
>> +    struct arm_smmu_domain *s2 = NULL;
>> +
>> +    if (parent) {
>> +        if (parent->ops != arm_smmu_ops.default_domain_ops)
>> +            return NULL;
>> +        type = IOMMU_DOMAIN_NESTED;
>> +        s2 = to_smmu_domain(parent);
>> +    }
>> -    return __arm_smmu_domain_alloc(type, NULL, master, user_cfg);
>> +    return __arm_smmu_domain_alloc(type, s2, master, user_cfg);
>>   }
>>   static struct iommu_ops arm_smmu_ops = {
>

2023-03-09 14:51:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:

> Although we can keep the alloc and hardware info separate for each IOMMU
> architecture, we should try to come up with common invalidation methods.

The invalidation language is tightly linked to the actual cache block
and cache tag in the IOMMU HW design. Generality will loose or
obfuscate the necessary specificity that is required for creating real
vIOMMUs.

Further, invalidation is a fast path, it is crazy to take a vIOMMU of
a real HW receving a native invalidation request, mangle it to some
obfuscated kernel version and then de-mangle it again in the kernel
driver. IMHO ideally qemu will simply point the invalidation at the
WQE in the SW vIOMMU command queue and invoke the ioctl. (Nicolin, we
should check more into this)

The purpose of these interfaces is to support high performance full
functionality vIOMMUs of the real HW, we should not loose sight of
that goal.

We are actually planning to go futher and expose direct invalidation
work queues complete with HW doorbells to userspace. This obviously
must be in native HW format.

Nicolin, I think we should tweak the uAPI here so that the
invalidation opaque data has a format tagged on its own, instead of
re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
tag and also a virtio-viommu invalidate type tag.

This will allow Jean to put the invalidation decoding in the iommu
drivers if we think that is the right direction. virtio can
standardize it as a "HW format".

> Ideally I'd like something like this for vhost-iommu:
>
> * slow path through userspace for complex requests like attach-table and
> probe, where the VMM can decode arch-specific information and translate
> them to iommufd and vhost-iommu ioctls to update the configuration.
>
> * fast path within the kernel for performance-critical requests like
> invalidate, page request and response. It would be absurd for the
> vhost-iommu driver to translate generic invalidation requests from the
> guest into arch-specific commands with special opcodes, when the next
> step is calling the IOMMU driver which does that for free.

Someone has to do the conversion. If you don't think virito should do
it then I'd be OK to add a type tag for virtio format invalidation and
put it in the IOMMU driver.

But given virtio overall already has to know *alot* about how the HW
it is wrapping works I don't think it is necessarily absurd for virtio
to do the conversion. I'd like to evaluate this in patches in context
with how much other unique HW code ends up in kernel-side vhost-iommu.

However, I don't know the rational for virtio-viommu, it seems like a
strange direction to me. All the iommu drivers have native command
queues. ARM and AMD are both supporting native command queues directly
in the guest, complete with a direct guest MMIO doorbell ring.

If someone wants to optimize this I'd think the way to do it is to use
virtio like techniques to put SW command queue processing in the
kernel iommu driver and continue to use the HW native interface in the
VM.

What benifit comes from replacing the HW native interface with virtio?
Especially on ARM where the native interface is pretty clean?

> During previous discussions we came up with generic invalidations that
> could fit both Arm and x86 [1][2]. The only difference was the ASID
> (called archid/id in those proposals) which VT-d didn't need. Could we try
> to build on that?

IMHO this was just unioning all the different invalidation types
together. It makes sense for something like virtio but it is
illogical/obfuscated as a user/kernel interface since it still
requires a userspace HW driver to understand what subset of the
invalidations are used on the actual HW.

Jason

2023-03-09 14:51:58

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On 2023-03-09 10:53, Nicolin Chen wrote:
> Add arm_smmu_cache_invalidate_user() function for user space to invalidate
> TLB entries and Context Descriptors, since either an IO page table entrie
> or a Context Descriptor in the user space is still cached by the hardware.
>
> The input user_data is defined in "struct iommu_hwpt_invalidate_arm_smmuv3"
> that contains the essential data for corresponding invalidation commands.
>
> Co-developed-by: Eric Auger <[email protected]>
> Signed-off-by: Eric Auger <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 56 +++++++++++++++++++++
> 1 file changed, 56 insertions(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index ac63185ae268..7d73eab5e7f4 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2880,9 +2880,65 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
> arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
> }
>
> +static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
> + void *user_data)
> +{
> + struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
> + struct arm_smmu_cmdq_ent cmd = { .opcode = inv_info->opcode };
> + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct arm_smmu_device *smmu = smmu_domain->smmu;
> + size_t granule_size = inv_info->granule_size;
> + unsigned long iova = 0;
> + size_t size = 0;
> + int ssid = 0;
> +
> + if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
> + return;
> +
> + switch (inv_info->opcode) {
> + case CMDQ_OP_CFGI_CD:
> + case CMDQ_OP_CFGI_CD_ALL:
> + return arm_smmu_sync_cd(smmu_domain, inv_info->ssid, true);

Since we let the guest choose its own S1Fmt (and S1CDMax, yet not
S1DSS?), how can we assume leaf = true here?

> + case CMDQ_OP_TLBI_NH_VA:
> + cmd.tlbi.asid = inv_info->asid;
> + fallthrough;
> + case CMDQ_OP_TLBI_NH_VAA:
> + if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||

Non-range invalidations with TG=0 are perfectly legal, and should not be
ignored.

> + granule_size & ~(1ULL << __ffs(granule_size)))

If that's intended to mean is_power_of_2(), please just use is_power_of_2().

> + return;
> +
> + iova = inv_info->range.start;
> + size = inv_info->range.last - inv_info->range.start + 1;

If the design here is that user_data is so deeply driver-specific and
special to the point that it can't possibly be passed as a type-checked
union of the known and publicly-visible UAPI types that it is, wouldn't
it make sense to just encode the whole thing in the expected format and
not have to make these kinds of niggling little conversions at both ends?

> + if (!size)
> + return;
> +
> + cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
> + cmd.tlbi.leaf = inv_info->flags & IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF;
> + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule_size, smmu_domain);
> + break;
> + case CMDQ_OP_TLBI_NH_ASID:
> + cmd.tlbi.asid = inv_info->asid;
> + fallthrough;
> + case CMDQ_OP_TLBI_NH_ALL:
> + cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
> + arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> + break;
> + case CMDQ_OP_ATC_INV:
> + ssid = inv_info->ssid;
> + iova = inv_info->range.start;
> + size = inv_info->range.last - inv_info->range.start + 1;
> + break;

Can we do any better than multiplying every single ATC_INV command, even
for random bogus StreamIDs, into multiple commands across every physical
device? In fact, I'm not entirely confident this isn't problematic, if
the guest wishes to send invalidations for one device specifically while
it's put some other device into a state where sending it a command would
do something bad. At the very least, it's liable to be confusing if the
guest sends a command for one StreamID but gets an error back for a
different one.

And if we expect ATS, what about PRI? Per patch #4 you're currently
offering that to the guest as well.

> + default:
> + return;

What about NSNH_ALL? That still needs to invalidate all the S1 context
that the guest *thinks* it's invalidating.

Also, perhaps I've overlooked something obvious, but what's the
procedure for reflecting illegal commands back to userspace? Some of the
things we're silently ignoring here would be expected to raise
CERROR_ILL. Same goes for all the other fault events which may occur due
to invalid S1 config, come to think of it.

Thanks,
Robin.

> + }
> +
> + arm_smmu_atc_inv_domain(smmu_domain, ssid, iova, size);
> +}
> +
> static const struct iommu_domain_ops arm_smmu_nested_domain_ops = {
> .attach_dev = arm_smmu_attach_dev,
> .free = arm_smmu_domain_free,
> + .cache_invalidate_user = arm_smmu_cache_invalidate_user,
> };
>
> static struct iommu_domain *

Subject: RE: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3



> -----Original Message-----
> From: Jean-Philippe Brucker [mailto:[email protected]]
> Sent: 09 March 2023 13:42
> To: Nicolin Chen <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Shameerali Kolothum Thodi
> <[email protected]>;
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> for ARM SMMUv3
>
> Hi Nicolin,
>
> On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote:
> > Add the following data structures for corresponding ioctls:
> > iommu_hwpt_arm_smmuv3 =>
> IOMMUFD_CMD_HWPT_ALLOC
> > iommu_hwpt_invalidate_arm_smmuv3 =>
> IOMMUFD_CMD_HWPT_INVALIDATE
> >
> > Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and
> IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
> > to the header and corresponding type/size arrays.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
>
> > +/**
> > + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table
> data
> > + *
> > + * @flags: page table entry attributes
> > + * @s2vmid: Virtual machine identifier
> > + * @s1ctxptr: Stage-1 context descriptor pointer
> > + * @s1cdmax: Number of CDs pointed to by s1ContextPtr
> > + * @s1fmt: Stage-1 Format
> > + * @s1dss: Default substream
> > + */
> > +struct iommu_hwpt_arm_smmuv3 {
> > +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */
>
> I don't understand the purpose of this flag, since the structure only
> provides stage-1 configuration fields
>
> > +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
>
> Doesn't this break isolation? The VMID space is global for the SMMU, so
> the guest could access cached mappings of another device

On platforms that supports BTM [1], we may need the VMID allocated by KVM.
But again getting that from user pace doesn't look safe. I have attempted to revise
the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].

But the problem is getting the KVM instance associated with the device. Currently I am
going through the VFIO layer to retrieve the KVM instance(vfio_device->kvm).

On the previous RFC discussion thread[3], Jean has proposed,

" In the new design we can require from the start that creating a nesting IOMMU
container through /dev/iommu *must* come with a KVM context, that way
we're sure to reuse the existing VMID. "

Is that something we can still do or there is a better way to handle this now?

Thanks,
Shameer


1. https://lore.kernel.org/linux-arm-kernel/YEEUocRn3IfIDpLj@myrica/T/#m478f7e7d5dcb729e02721beda35efa12c1d20707
2. https://github.com/hisilicon/kernel-dev/commits/iommufd-v6.2-rc4-nesting-arm-btm-v2
3. https://lore.kernel.org/linux-arm-kernel/YEEUocRn3IfIDpLj@myrica/T/#m11cde7534943ea7cf35f534cb809a023eabd9da3


2023-03-09 15:31:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 09, 2023 at 02:49:14PM +0000, Robin Murphy wrote:

> If the design here is that user_data is so deeply driver-specific and
> special to the point that it can't possibly be passed as a type-checked
> union of the known and publicly-visible UAPI types that it is, wouldn't it
> make sense to just encode the whole thing in the expected format and not
> have to make these kinds of niggling little conversions at both ends?

Yes, I suspect the design for ARM should have the input be the entire
actual command work queue entry. There is no reason to burn CPU cycles
in userspace marshalling it to something else and then decode it again
in the kernel. Organize things to point the ioctl directly at the
queue entry, and the kernel can do a single memcpy from guest
controlled pages to kernel memory then parse it?

More broadly, maybe should this be able to process a list of commands?
If the queue has a number of invalidations batching them to the kernel
sure would be nice.

Maybe also for Intel? Kevin?

> Also, perhaps I've overlooked something obvious, but what's the procedure
> for reflecting illegal commands back to userspace? Some of the things we're
> silently ignoring here would be expected to raise CERROR_ILL. Same goes for
> all the other fault events which may occur due to invalid S1 config, come to
> think of it.

Perhaps the ioctl should fail and the userpace viommu should inject
this CERROR_ILL?

But I'm also wondering if we are making a mistake to not just have the
kernel driver to expose a SW work queue in its native format and the
ioctl is only just 'read the queue'. Then it could (asynchronously!)
push back answers, real or emulated, as well, including all error
indications.

I think we got down this synchronous one-ioctl-per-invalidation path
because that was what the original generic stuff wanted to do. Is it
what we really want? Kevin what is your perspective?

Jason

2023-03-09 15:40:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi wrote:

> On platforms that supports BTM [1], we may need the VMID allocated by KVM.
> But again getting that from user pace doesn't look safe. I have attempted to revise
> the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].

Gurk

> " In the new design we can require from the start that creating a nesting IOMMU
> container through /dev/iommu *must* come with a KVM context, that way
> we're sure to reuse the existing VMID. "

I've been dreading this but yes I execpt we will eventually need to
connect kvm and iommufd together. The iommu driver can receive a kvm
pointer as part of the alloc domain operation to do any setups like
this.

If there is no KVM it should either fail to setup the domain or setup
a domain disconnected from KVM.

If IOMMU HW and KVM HW are using the same ID number space then
arguably the two kernel drivers need to use a shared ID allocator in
the arch, regardless of what iommufd/etc does. Using KVM should not be
mandatory for iommufd.

For ARM cases where there is no shared VMID space with KVM, the ARM
VMID should be somehow assigned to the iommfd_ctx itself and the alloc
domain op should receive it from there.

Nicolin, that seems to be missing in this series? I'm not entirely
sure how to elegantly code it :\

Jason

Subject: RE: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3



> -----Original Message-----
> From: Jason Gunthorpe [mailto:[email protected]]
> Sent: 09 March 2023 15:40
> To: Shameerali Kolothum Thodi <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>; Nicolin Chen
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> for ARM SMMUv3
>
> On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi
> wrote:
>
> > On platforms that supports BTM [1], we may need the VMID allocated by
> KVM.
> > But again getting that from user pace doesn't look safe. I have attempted
> to revise
> > the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].
>
> Gurk
>
> > " In the new design we can require from the start that creating a nesting
> IOMMU
> > container through /dev/iommu *must* come with a KVM context, that way
> > we're sure to reuse the existing VMID. "
>
> I've been dreading this but yes I execpt we will eventually need to
> connect kvm and iommufd together. The iommu driver can receive a kvm
> pointer as part of the alloc domain operation to do any setups like
> this.

That will make life easier :)

> If there is no KVM it should either fail to setup the domain or setup
> a domain disconnected from KVM.
>

If no KVM the SMMUv3 can fall back to its internal VMID allocation I guess.
And my intention was to use KVM VMID only if the platform supports
BTM.

> If IOMMU HW and KVM HW are using the same ID number space then
> arguably the two kernel drivers need to use a shared ID allocator in
> the arch, regardless of what iommufd/etc does. Using KVM should not be
> mandatory for iommufd.
>
> For ARM cases where there is no shared VMID space with KVM, the ARM
> VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> domain op should receive it from there.

Is there any use of VMID outside SMMUv3? I was thinking if nested domain alloc
doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.

Thanks,
Shameer

> Nicolin, that seems to be missing in this series? I'm not entirely
> sure how to elegantly code it :\
>
> Jason

2023-03-09 16:00:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi wrote:

> > For ARM cases where there is no shared VMID space with KVM, the ARM
> > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > domain op should receive it from there.
>
> Is there any use of VMID outside SMMUv3? I was thinking if nested domain alloc
> doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.

When we talk about exposing an SMMUv3 IOMMU CMDQ directly to userspace then
VMID is the security token that protects it.

So in that environment every domain under the same iommufd should
share the same VMID so that the CMDQ's also share the same VMID.

I expect this to be a common sort of requirement as we will see
userspace command queues in the other HW as well.

So, I suppose the answer for now is that ARM SMMUv3 should just
allocate one VMID per iommu_domain and there should be no VMID in the
uapi at all.

Moving all iommu_domains to share the same VMID is a future patch.

Though.. I have no idea how vVMID is handled in the SMMUv3
architecture. I suppose the guest IOMMU HW caps are set in a way that
it knows it does not have VMID?

Jason

Subject: RE: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3



> -----Original Message-----
> From: Jason Gunthorpe [mailto:[email protected]]
> Sent: 09 March 2023 16:00
> To: Shameerali Kolothum Thodi <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>; Nicolin Chen
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> for ARM SMMUv3
>
> On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi
> wrote:
>
> > > For ARM cases where there is no shared VMID space with KVM, the ARM
> > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > > domain op should receive it from there.
> >
> > Is there any use of VMID outside SMMUv3? I was thinking if nested domain
> alloc
> > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.
>
> When we talk about exposing an SMMUv3 IOMMU CMDQ directly to
> userspace then
> VMID is the security token that protects it.
>
> So in that environment every domain under the same iommufd should
> share the same VMID so that the CMDQ's also share the same VMID.
>
> I expect this to be a common sort of requirement as we will see
> userspace command queues in the other HW as well.
>
> So, I suppose the answer for now is that ARM SMMUv3 should just
> allocate one VMID per iommu_domain and there should be no VMID in the
> uapi at all.
>
> Moving all iommu_domains to share the same VMID is a future patch.
>
> Though.. I have no idea how vVMID is handled in the SMMUv3
> architecture. I suppose the guest IOMMU HW caps are set in a way that
> it knows it does not have VMID?

I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID.

Thanks,
Shameer

Subject: RE: [PATCH v1 06/14] iommu/arm-smmu-v3: Unset corresponding STE fields when s2_cfg is NULL



> -----Original Message-----
> From: Robin Murphy [mailto:[email protected]]
> Sent: 09 March 2023 13:13
> To: Nicolin Chen <[email protected]>; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; Shameerali Kolothum Thodi
> <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v1 06/14] iommu/arm-smmu-v3: Unset corresponding
> STE fields when s2_cfg is NULL
>
> On 2023-03-09 10:53, Nicolin Chen wrote:
> > From: Eric Auger <[email protected]>
> >
> > Despite the spec does not seem to mention this, on some implementations,
> > when the STE configuration switches from an S1+S2 cfg to an S1 only one,
> > a C_BAD_STE error would happen if dst[3] (S2TTB) is not reset.
>
> Can you provide more details, since it's not clear whether this is a
> hardware erratum workaround or a bodge around the driver itself doing
> something wrong like not doing a proper break-before-make transition of
> the STE. The architecture explicitly states that all the STE.S2* fields
> except S2VMID and potentially S2S are ignored when Stage 2 is bypassed.

Took a while to locate the email thread where this was discussed,
https://patchwork.kernel.org/cover/11449895/#23244457

This was observed on a HiSilicon implementation where, if the SMMUv3 is configured with
both Stage 1 and Stage 2 (nested) mode once, then it is not possible to configure it back
for Stage 1 mode for the same device(stream id).

IIRC, the SMMUv3 implementation on these boards expects to set the S2TTB field in STE to zero
when using S1, otherwise it reports C_BAD_STE error. :(

You are right that the specification doesn't demand this and I am not sure there are any other
Hardware that requires this.

Could we please have this with a comment added in the code?

Thanks,
Shameer


2023-03-09 18:27:06

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:
>
> > Although we can keep the alloc and hardware info separate for each IOMMU
> > architecture, we should try to come up with common invalidation methods.
>
> The invalidation language is tightly linked to the actual cache block
> and cache tag in the IOMMU HW design.

Concretely though, what are the incompatibilities between the HW designs?
They all need to remove a range of TLB entries, using some address space
tag. But if there is an actual difference I do need to know.

> Generality will loose or
> obfuscate the necessary specificity that is required for creating real
> vIOMMUs.
>
> Further, invalidation is a fast path, it is crazy to take a vIOMMU of
> a real HW receving a native invalidation request, mangle it to some
> obfuscated kernel version and then de-mangle it again in the kernel
> driver. IMHO ideally qemu will simply point the invalidation at the
> WQE in the SW vIOMMU command queue and invoke the ioctl. (Nicolin, we
> should check more into this)

Avoiding copying a few bytes won't make up for the extra context switches
to userspace. An emulated IOMMU can easily decode commands and translate
them to generic kernel structures, in a handful of CPU cycles, just like
they decode STEs. It's what they do, and it's the opposite of obfuscation.

>
> The purpose of these interfaces is to support high performance full
> functionality vIOMMUs of the real HW, we should not loose sight of
> that goal.
>
> We are actually planning to go futher and expose direct invalidation
> work queues complete with HW doorbells to userspace. This obviously
> must be in native HW format.

Doesn't seem relevant since direct access to command queue wouldn't use
this struct.

>
> Nicolin, I think we should tweak the uAPI here so that the
> invalidation opaque data has a format tagged on its own, instead of
> re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> tag and also a virtio-viommu invalidate type tag.
>
> This will allow Jean to put the invalidation decoding in the iommu
> drivers if we think that is the right direction. virtio can
> standardize it as a "HW format".
>
> > Ideally I'd like something like this for vhost-iommu:
> >
> > * slow path through userspace for complex requests like attach-table and
> > probe, where the VMM can decode arch-specific information and translate
> > them to iommufd and vhost-iommu ioctls to update the configuration.
> >
> > * fast path within the kernel for performance-critical requests like
> > invalidate, page request and response. It would be absurd for the
> > vhost-iommu driver to translate generic invalidation requests from the
> > guest into arch-specific commands with special opcodes, when the next
> > step is calling the IOMMU driver which does that for free.
>
> Someone has to do the conversion. If you don't think virito should do
> it then I'd be OK to add a type tag for virtio format invalidation and
> put it in the IOMMU driver.

Implementing two invalidation formats in each IOMMU driver does not seem
practical.

>
> But given virtio overall already has to know *alot* about how the HW
> it is wrapping works I don't think it is necessarily absurd for virtio
> to do the conversion. I'd like to evaluate this in patches in context
> with how much other unique HW code ends up in kernel-side vhost-iommu.

Ideally none. I'd rather leave those, attach and probe, in userspace and
if possible compatible with iommufd to avoid register decoding.

>
> However, I don't know the rational for virtio-viommu, it seems like a
> strange direction to me.

A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
all vendor IOMMUs, new architectures get vIOMMU mostly for free, and vhost
provides a faster path. Also the ability to optimize paravirtual
interfaces for things like combined invalidation (IOTLB+ATC) or, later,
nested page requests.

For a while the main vIOMMU use-case was assignment to guest userspace,
mainly DPDK, which works great with a generic and slow map/unmap
interface. Since vSVA is still a niche use-case, and nesting without page
faults requires pinning the whole guest memory, map/unmap still seems more
desirable to me. But there is some renewed interest in supporting page
tables with virtio-iommu for the reasons above.

> All the iommu drivers have native command
> queues. ARM and AMD are both supporting native command queues directly
> in the guest, complete with a direct guest MMIO doorbell ring.

Arm SMMUv3 mandates a single global command queue (SMMUv2 uses registers).
An SMMUv3 can optionally implement multiple command queues, though I don't
know if they can be safely assigned to guests. For a lot of SMMUv3
implementations that have a single queue and for other architectures, we
can do better than hardware emulation.

>
> If someone wants to optimize this I'd think the way to do it is to use
> virtio like techniques to put SW command queue processing in the
> kernel iommu driver and continue to use the HW native interface in the
> VM.

I didn't get which kernel this is.

>
> What benifit comes from replacing the HW native interface with virtio?
> Especially on ARM where the native interface is pretty clean?
>
> > During previous discussions we came up with generic invalidations that
> > could fit both Arm and x86 [1][2]. The only difference was the ASID
> > (called archid/id in those proposals) which VT-d didn't need. Could we try
> > to build on that?
>
> IMHO this was just unioning all the different invalidation types
> together. It makes sense for something like virtio but it is
> illogical/obfuscated as a user/kernel interface since it still
> requires a userspace HW driver to understand what subset of the
> invalidations are used on the actual HW.

As above, decoding arch-specific structures into generic ones is what an
emulated IOMMU does, and it doesn't make a performance difference in which
format it forwards that to the kernel. The host IOMMU driver checks the
guest request and copies them into the command queue. Whether that request
comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
some generic structure, does not make a difference.

Thanks,
Jean


2023-03-09 21:01:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 06:26:59PM +0000, Jean-Philippe Brucker wrote:
> On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:
> >
> > > Although we can keep the alloc and hardware info separate for each IOMMU
> > > architecture, we should try to come up with common invalidation methods.
> >
> > The invalidation language is tightly linked to the actual cache block
> > and cache tag in the IOMMU HW design.
>
> Concretely though, what are the incompatibilities between the HW designs?
> They all need to remove a range of TLB entries, using some address space
> tag. But if there is an actual difference I do need to know.

For instance the address space tags and the cache entires they match
to are wildly different.

ARM uses a fine grained ASID and Intel uses a shared ASID called a DID
and incorporates the PASID into the cache tag.

AMD uses something called a DID that covers a different set of stuff
than the Intel DID, and it doesn't seem to work for nesting. AMD uses
PASID as the primary nested cache tag.

Superficially you can say all three have an ASID and you can have an
invalidate ASID Operation and make it "look" the same, but the actual
behavior is totally ill defined and the whole thing is utterly
obfuscated as to what does it actually MEAN.

But this doesn't matter for virtio. You have already got a spec that
defines invalidation in terms of virtio objects that sort of match
things like iommu_domains. I hope the virtio
VIRTIO_IOMMU_INVAL_S_DOMAIN is very well defined as to exactly what
objects a DOMAIN match applies to. The job of the hypervisor is to
translate that to whatever invalidation(s) the real HW requires.

ie virito is going to say "invalidate this domain" and expect the
hypervisor to spew it to every attached PASID if that is what the HW
requires (eg AMD), or do a single ASID invalidation (Intel, sometimes)

But when a vIOMMU gets a vDID or vPASID invalidation command it
doesn't mean the same thing as virtio. The driver must invalidate
exactly what the vIOMMU programming model says to invalidate because
the guest is going to spew more invalidations to cover what it
needs. Over invalidation would be a performance problem.

Exposing these subtle differences to userspace is necessary. What I do
not want is leaking those differences through an ill-defined "generic"
interface.

Even more importantly Intel and ARM should not have to fight about the
subtle implementation specific details of the specification of the
"generic" interface. If Intel needs something dumb to make their
viommu work well then they should simply be able to do it. I don't
want to spend 6 months of pointless arguing about language details in
an unnecessary "generic" spec.

> > The purpose of these interfaces is to support high performance full
> > functionality vIOMMUs of the real HW, we should not loose sight of
> > that goal.
> >
> > We are actually planning to go futher and expose direct invalidation
> > work queues complete with HW doorbells to userspace. This obviously
> > must be in native HW format.
>
> Doesn't seem relevant since direct access to command queue wouldn't use
> this struct.

The point is our design direction with iommufd is to expose the raw HW
to userspace, not to obfuscate it with ill defined generalizations.

> > Someone has to do the conversion. If you don't think virito should do
> > it then I'd be OK to add a type tag for virtio format invalidation and
> > put it in the IOMMU driver.
>
> Implementing two invalidation formats in each IOMMU driver does not seem
> practical.

I don't see why.

The advantage of the kernel side is that the implementation is not
strong ABI. If we want to adjust and review the virtio invalidation
path as new HW comes along we can, so long as it is only in the
kernel.

On the other hand if we mess up the uABI for iommufd we are stuck with
it.

The safest and best uABI for iommufd is the HW native uABI because it,
almost by definition, cannot be wrong.

Anyhow, I'm still not very convinced adapting to virtio invalidation
format should be in iommu drivers. I think what you end up with for
virtio is that Intel/AMD have some nice common code to invalidate an
iommu_domain address range (probably even the existing invalidation
interface), and SMMUv3 is just totally different and special.

This is because SMMUv3 has no option to keep the PASID table in the
hypervisor so you are sadly forced to expose both the native ASID and
native PASID caches to the virtio protocol.

Given that the VM virtio driver has to have SMMUv3 specific code to
handle the CD table it must get, I don't see the problem with also
having SMMUv3 specific code in the hypervisor virtio driver to handle
invalidating based on the CD table.

Really, I want to see patches implementing all of this before we make
any decision on what the ops interface is for virtio-iommu kernel
side.

> > However, I don't know the rational for virtio-viommu, it seems like a
> > strange direction to me.
>
> A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
> all vendor IOMMUs, new architectures get vIOMMU mostly for free,

So your argument is you can implement a simple map/unmap API riding
on the common IOMMU API and this is portable?

Seems sensible, but that falls apart pretty quickly when we talk about
nesting.. I don't think we can avoid VMM components to set this up, so
it stops being portable. At that point I'm back to asking why not use
the real HW model?

> > All the iommu drivers have native command
> > queues. ARM and AMD are both supporting native command queues directly
> > in the guest, complete with a direct guest MMIO doorbell ring.
>
> Arm SMMUv3 mandates a single global command queue (SMMUv2 uses
> registers). An SMMUv3 can optionally implement multiple command
> queues, though I don't know if they can be safely assigned to
> guests.

It is not standardized by ARM, but it can (and has) been done.

> For a lot of SMMUv3 implementations that have a single queue and for
> other architectures, we can do better than hardware emulation.

How is using a SW emulated virtio formatted queue better than using a
SW emulated SMMUv3 ECMDQ?

The vSMMUv3 driver controls what capabilites are shown to the guest it
can definitely create a ECMDQ enabled device and do something like
assign the 2ndary ECMDQs to hypervisor kernel SW queues the same way
virito does.

I don't think there is a very solid argument that virtio-iommu is
necessary to get faster invalidation.

> > If someone wants to optimize this I'd think the way to do it is to use
> > virtio like techniques to put SW command queue processing in the
> > kernel iommu driver and continue to use the HW native interface in the
> > VM.
>
> I didn't get which kernel this is.

hypervisor kernel.

> > IMHO this was just unioning all the different invalidation types
> > together. It makes sense for something like virtio but it is
> > illogical/obfuscated as a user/kernel interface since it still
> > requires a userspace HW driver to understand what subset of the
> > invalidations are used on the actual HW.
>
> As above, decoding arch-specific structures into generic ones is what an
> emulated IOMMU does,

No, it is what virtio wants to do. We are deliberately trying not to
do that for real accelerated HW vIOMMU emulators.

> and it doesn't make a performance difference in which
> format it forwards that to the kernel. The host IOMMU driver checks the
> guest request and copies them into the command queue. Whether that request
> comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
> some generic structure, does not make a difference.

It is not the structure layouts that matter!

It is the semantic meaning of each request, on each unique piece of
hardware. We actually want to leak the subtle semantic differences to
userspace.

Doing that and continuing to give them the same command label is
exactly the kind of obfuscated ill defined "generic" interface I do
not want.

Jason

2023-03-10 01:17:30

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

Hi Robin,

Thanks for the inputs.

On Thu, Mar 09, 2023 at 01:03:41PM +0000, Robin Murphy wrote:
> External email: Use caution opening links or attachments
>
>
> On 2023-03-09 10:53, Nicolin Chen wrote:
> > This is used to forward the host IDR values to the user space, so the
> > hypervisor and the guest VM can learn about the underlying hardware's
> > capabilities.
> >
> > Also, set the driver_type to IOMMU_HW_INFO_TYPE_ARM_SMMUV3 to pass the
> > corresponding type sanity in the core.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++++
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
> > include/uapi/linux/iommufd.h | 14 ++++++++++++
> > 3 files changed, 41 insertions(+)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index f2425b0f0cd6..c1aac695ae0d 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -2005,6 +2005,29 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
> > }
> > }
> >
> > +static void *arm_smmu_hw_info(struct device *dev, u32 *length)
> > +{
> > + struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > + struct iommu_hw_info_smmuv3 *info;
> > + void *base_idr;
> > + int i;
> > +
> > + if (!master || !master->smmu)
> > + return ERR_PTR(-ENODEV);
> > +
> > + info = kzalloc(sizeof(*info), GFP_KERNEL);
> > + if (!info)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + base_idr = master->smmu->base + ARM_SMMU_IDR0;
> > + for (i = 0; i <= 5; i++)
> > + info->idr[i] = readl_relaxed(base_idr + 0x4 * i);
>
> You need to take firmware overrides etc. into account here. In
> particular, features like BTM may need to be hidden to work around
> errata either in the system integration or the SMMU itself. It isn't
> reasonable to expect every VMM to be aware of every erratum and
> workaround, and there may even be workarounds where we need to go out of
> our way to prevent guests from trying to use certain features in order
> to maintain correctness at S2.

We can add a bit of overrides after this for errata, perhaps?

I have some trouble with finding the errata docs. Would it be
possible for you to direct me to it with a link maybe?

> In general this should probably follow the same principle as KVM, where
> we only expose sanitised feature registers representing the
> functionality the host understands. Code written today is almost
> guaranteed to be running on hardware released in 2030, at least *somewhere*.

Yes.

Thanks
Nicolin

2023-03-10 01:19:35

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 13/14] iommu/arm-smmu-v3: Add CMDQ_OP_TLBI_NH_VAA and CMDQ_OP_TLBI_NH_ALL

On Thu, Mar 09, 2023 at 01:44:34PM +0000, Robin Murphy wrote:
> External email: Use caution opening links or attachments
>
>
> On 2023-03-09 10:53, Nicolin Chen wrote:
> > With a nested translation setup, a stage-1 Context Descriptor table can be
> > managed by a guest OS in the user space. So, the kernel driver should not
> > assume that the guest OS will use a user space device driver that doesn't
> > support TLBI_NH_VAA and TLBI_NH_ALL commands.
> >
> > Add them in the arm_smmu_cmdq_build_cmd(), to prepare for support of these
> > two TLBI invalidation requests from the guest level.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 4 ++++
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
> > 2 files changed, 6 insertions(+)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 1f318b5e0921..ac63185ae268 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -277,6 +277,9 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> > /* Cover the entire SID range */
> > cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
> > break;
> > + case CMDQ_OP_TLBI_NH_VAA:
> > + ent->tlbi.asid = 0;
>
> This is backwards - NH_VA is a superset of NH_VAA (not to mention that
> quietly modifying the input argument is ugly; in fact it might be nice
> if ent was const here).

I see.

> Please follow the existing pattern, and decouple NH_VA from EL2_VA if
> necessary.

OK. I was trying to keep it neat, but it looks like decoupling
is the right way.

Thanks
Nic

2023-03-10 01:34:37

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

On Thu, Mar 09, 2023 at 02:28:09PM +0000, Robin Murphy wrote:
> External email: Use caution opening links or attachments
>
>
> On 2023-03-09 13:20, Robin Murphy wrote:
> > On 2023-03-09 10:53, Nicolin Chen wrote:
> > > Add domain allocation support for IOMMU_DOMAIN_NESTED type. This includes
> > > the "finalise" part to log in the user space Stream Table Entry info.
> > >
> > > Co-developed-by: Eric Auger <[email protected]>
> > > Signed-off-by: Eric Auger <[email protected]>
> > > Signed-off-by: Nicolin Chen <[email protected]>
> > > ---
> > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 38 +++++++++++++++++++--
> > > 1 file changed, 36 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index 5ff74edfbd68..1f318b5e0921 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -2214,6 +2214,19 @@ static int arm_smmu_domain_finalise(struct
> > > iommu_domain *domain,
> > > return 0;
> > > }
> > > + if (domain->type == IOMMU_DOMAIN_NESTED) {
> > > + if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) ||
> > > + !(smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
> > > + dev_dbg(smmu->dev, "does not implement two stages\n");
> > > + return -EINVAL;
> > > + }
> > > + smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
> > > + smmu_domain->s1_cfg.s1fmt = user_cfg->s1fmt;
> > > + smmu_domain->s1_cfg.s1cdmax = user_cfg->s1cdmax;
> > > + smmu_domain->s1_cfg.cdcfg.cdtab_dma = user_cfg->s1ctxptr;
> > > + return 0;
> >
> > How's that going to work? If the caller's asked for something we can't
> > provide, returning something else and hoping it fails later is not
> > sensible, we should just fail right here. It's even more worrying if
> > there's a chance it *won't* fail later, and a guest ends up with
> > "nested" translation giving it full access to host PA space :/
>
> Oops, apologies - in part thanks to the confusing indentation, I managed
> to miss the early return and misread this all being under the if
> condition for nesting not being supported. Sorry for the confusion :(

Perhaps this can help readability, considering that we have
multiple places checking the TRANS_S1 and TRANS_S2 features:

bool feat_has_s1 smmu->features & ARM_SMMU_FEAT_TRANS_S1;
bool feat_has_s2 smmu->features & ARM_SMMU_FEAT_TRANS_S2;

if (domain->type == IOMMU_DOMAIN_NESTED) {
if (!feat_has_s1 || !feat_has_s2) {
dev_dbg(smmu->dev, "does not implement two stages\n");
return -EINVAL;
}
...
return 0;
}

if (user_cfg_s2 && !feat_has_s2)
return -EINVAL;
...
if (!feat_has_s1)
smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
if (!feat_has_s2)
smmu_domain->stage = ARM_SMMU_DOMAIN_S1;

Would you like this?

Thanks
Nic

2023-03-10 01:54:58

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 06/14] iommu/arm-smmu-v3: Unset corresponding STE fields when s2_cfg is NULL

On Thu, Mar 09, 2023 at 06:24:29PM +0000, Shameerali Kolothum Thodi wrote:
> External email: Use caution opening links or attachments
>
>
> > -----Original Message-----
> > From: Robin Murphy [mailto:[email protected]]
> > Sent: 09 March 2023 13:13
> > To: Nicolin Chen <[email protected]>; [email protected]; [email protected]
> > Cc: [email protected]; [email protected]; [email protected];
> > [email protected]; Shameerali Kolothum Thodi
> > <[email protected]>; [email protected];
> > [email protected]; [email protected];
> > [email protected]
> > Subject: Re: [PATCH v1 06/14] iommu/arm-smmu-v3: Unset corresponding
> > STE fields when s2_cfg is NULL
> >
> > On 2023-03-09 10:53, Nicolin Chen wrote:
> > > From: Eric Auger <[email protected]>
> > >
> > > Despite the spec does not seem to mention this, on some implementations,
> > > when the STE configuration switches from an S1+S2 cfg to an S1 only one,
> > > a C_BAD_STE error would happen if dst[3] (S2TTB) is not reset.
> >
> > Can you provide more details, since it's not clear whether this is a
> > hardware erratum workaround or a bodge around the driver itself doing
> > something wrong like not doing a proper break-before-make transition of
> > the STE. The architecture explicitly states that all the STE.S2* fields
> > except S2VMID and potentially S2S are ignored when Stage 2 is bypassed.
>
> Took a while to locate the email thread where this was discussed,
> https://patchwork.kernel.org/cover/11449895/#23244457
>
> This was observed on a HiSilicon implementation where, if the SMMUv3 is configured with
> both Stage 1 and Stage 2 (nested) mode once, then it is not possible to configure it back
> for Stage 1 mode for the same device(stream id).
>
> IIRC, the SMMUv3 implementation on these boards expects to set the S2TTB field in STE to zero
> when using S1, otherwise it reports C_BAD_STE error. :(
>
> You are right that the specification doesn't demand this and I am not sure there are any other
> Hardware that requires this.
>
> Could we please have this with a comment added in the code?

Yes, I can add that, and put that link in the commit message too.

Thanks
Nicolin

2023-03-10 03:52:03

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 09, 2023 at 02:49:14PM +0000, Robin Murphy wrote:
> External email: Use caution opening links or attachments
>
>
> On 2023-03-09 10:53, Nicolin Chen wrote:
> > Add arm_smmu_cache_invalidate_user() function for user space to invalidate
> > TLB entries and Context Descriptors, since either an IO page table entrie
> > or a Context Descriptor in the user space is still cached by the hardware.
> >
> > The input user_data is defined in "struct iommu_hwpt_invalidate_arm_smmuv3"
> > that contains the essential data for corresponding invalidation commands.
> >
> > Co-developed-by: Eric Auger <[email protected]>
> > Signed-off-by: Eric Auger <[email protected]>
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 56 +++++++++++++++++++++
> > 1 file changed, 56 insertions(+)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index ac63185ae268..7d73eab5e7f4 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -2880,9 +2880,65 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
> > arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
> > }
> >
> > +static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
> > + void *user_data)
> > +{
> > + struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
> > + struct arm_smmu_cmdq_ent cmd = { .opcode = inv_info->opcode };
> > + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > + struct arm_smmu_device *smmu = smmu_domain->smmu;
> > + size_t granule_size = inv_info->granule_size;
> > + unsigned long iova = 0;
> > + size_t size = 0;
> > + int ssid = 0;
> > +
> > + if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
> > + return;
> > +
> > + switch (inv_info->opcode) {
> > + case CMDQ_OP_CFGI_CD:
> > + case CMDQ_OP_CFGI_CD_ALL:
> > + return arm_smmu_sync_cd(smmu_domain, inv_info->ssid, true);
>
> Since we let the guest choose its own S1Fmt (and S1CDMax, yet not
> S1DSS?), how can we assume leaf = true here?

The s1dss is forwarded in the user_data structure too. So, the
driver should have set that too down to a nested STE. Will add
this missing pathway.

And you are right that the guest OS can use a 2-level table, so
we should set leaf = false to cover all cases, I think.

> > + case CMDQ_OP_TLBI_NH_VA:
> > + cmd.tlbi.asid = inv_info->asid;
> > + fallthrough;
> > + case CMDQ_OP_TLBI_NH_VAA:
> > + if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||
>
> Non-range invalidations with TG=0 are perfectly legal, and should not be
> ignored.

I assume that you are talking about the pgsize_bitmap check.

QEMU embeds a !tg case into the granule_size [1]. So it might
not be straightforward to cover that case. Let me see how to
untangle different cases and handle them accordingly.

[1] https://patchew.org/QEMU/[email protected]/[email protected]/

> > + granule_size & ~(1ULL << __ffs(granule_size)))
>
> If that's intended to mean is_power_of_2(), please just use is_power_of_2().
>
> > + return;
> > +
> > + iova = inv_info->range.start;
> > + size = inv_info->range.last - inv_info->range.start + 1;
>
> If the design here is that user_data is so deeply driver-specific and
> special to the point that it can't possibly be passed as a type-checked
> union of the known and publicly-visible UAPI types that it is, wouldn't
> it make sense to just encode the whole thing in the expected format and
> not have to make these kinds of niggling little conversions at both ends?

Hmm, that makes sense to me.

I just tracked back the history of Eric's previous work. There
was a mismatch between guest and host that RIL isn't supported
by the hardware. Now, guest can have whatever information it'd
need from the host to send supported instructions.

> > + if (!size)
> > + return;
> > +
> > + cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
> > + cmd.tlbi.leaf = inv_info->flags & IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF;
> > + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule_size, smmu_domain);
> > + break;
> > + case CMDQ_OP_TLBI_NH_ASID:
> > + cmd.tlbi.asid = inv_info->asid;
> > + fallthrough;
> > + case CMDQ_OP_TLBI_NH_ALL:
> > + cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
> > + arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> > + break;
> > + case CMDQ_OP_ATC_INV:
> > + ssid = inv_info->ssid;
> > + iova = inv_info->range.start;
> > + size = inv_info->range.last - inv_info->range.start + 1;
> > + break;
>
> Can we do any better than multiplying every single ATC_INV command, even
> for random bogus StreamIDs, into multiple commands across every physical
> device? In fact, I'm not entirely confident this isn't problematic, if
> the guest wishes to send invalidations for one device specifically while
> it's put some other device into a state where sending it a command would
> do something bad. At the very least, it's liable to be confusing if the
> guest sends a command for one StreamID but gets an error back for a
> different one.

We'd need here an sid translation from the guest value to the
host value to specify a device, so as not to multiply the cmd
with the device list, if I understand it correctly?

> And if we expect ATS, what about PRI? Per patch #4 you're currently
> offering that to the guest as well.

Oh, I should have probably blocked PRI. The PRI and the fault
injection will be followed after the basic 2-stage translation
patches. And I don't have a supporting hardware to test PRI.

>
> > + default:
> > + return;
>
> What about NSNH_ALL? That still needs to invalidate all the S1 context
> that the guest *thinks* it's invalidating.

NSNH_ALL is translated to NH_ALL at the guest level. But maybe
it should have been done here instead.

Thanks
Nic

2023-03-10 04:23:30

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 09, 2023 at 11:31:04AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 02:49:14PM +0000, Robin Murphy wrote:
>
> > If the design here is that user_data is so deeply driver-specific and
> > special to the point that it can't possibly be passed as a type-checked
> > union of the known and publicly-visible UAPI types that it is, wouldn't it
> > make sense to just encode the whole thing in the expected format and not
> > have to make these kinds of niggling little conversions at both ends?
>
> Yes, I suspect the design for ARM should have the input be the entire
> actual command work queue entry. There is no reason to burn CPU cycles
> in userspace marshalling it to something else and then decode it again
> in the kernel. Organize things to point the ioctl directly at the
> queue entry, and the kernel can do a single memcpy from guest
> controlled pages to kernel memory then parse it?

There still can be complications to do something straightforward
like that. Firstly, the consumer and producer indexes might need
to be synced between the host and kernel? Secondly, things like
SID and VMID fields in the commands need to be replaced manually
when the host kernel reads commands out, which means that there
need to be a translation table(s) in the host kernel to replace
those fields. These actually are parts of the features of VCMDQ
hardware itself.

Though I am not sure about the amounts of burning CPU cycles, it
at least can simplify the uAPI a bit and meanwhile address the
multiplying issue at the ATC_INV command that Robin raised, so
long as we ensure the consumer and producer indexes wouldn't be
messed between host and guest?

Thanks
Nic

2023-03-10 04:51:17

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:

> Nicolin, I think we should tweak the uAPI here so that the
> invalidation opaque data has a format tagged on its own, instead of
> re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> tag and also a virtio-viommu invalidate type tag.

The invalidation tage is shared with the hwpt allocation. Does
it mean that virtio-iommu won't have it's own allocation tag?

Thanks
Nic

2023-03-10 05:04:42

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3


Hi Jeans,

Allow me to partially reply your email:

On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:

> > +struct iommu_hwpt_arm_smmuv3 {
> > +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */
>
> I don't understand the purpose of this flag, since the structure only
> provides stage-1 configuration fields

I should have probably put more description for this flag. It
is used to allocate a stage-2 domain for a nested translation
setup. The default allocation for a kernel-managed domain will
allocate an S1 format of IO page table, at ARM_SMMU_DOMAIN_S1
stage. But a nested kernel-managed domain needs an S2 format,
at ARM_SMMU_DOMAIN_S2.

So the whole structure seems to only provide stage-1 info but
it's used for both stages. And a stage-2 allocation will only
need s2vmid if VMID flag is set (explaining below).

> > +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
>
> Doesn't this break isolation? The VMID space is global for the SMMU, so
> the guest could access cached mappings of another device

This flag isn't mature yet. I kept it from my internal RFC to
see if we can have a better solution. There are use cases on
certain platforms where the VMIDs across all devices in the
same VM need to be aligned.

Thanks
Nic

2023-03-10 05:24:10

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 11:40:16AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi wrote:
>
> > On platforms that supports BTM [1], we may need the VMID allocated by KVM.
> > But again getting that from user pace doesn't look safe. I have attempted to revise
> > the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].
>
> Gurk
>
> > " In the new design we can require from the start that creating a nesting IOMMU
> > container through /dev/iommu *must* come with a KVM context, that way
> > we're sure to reuse the existing VMID. "
>
> I've been dreading this but yes I execpt we will eventually need to
> connect kvm and iommufd together. The iommu driver can receive a kvm
> pointer as part of the alloc domain operation to do any setups like
> this.
>
> If there is no KVM it should either fail to setup the domain or setup
> a domain disconnected from KVM.
>
> If IOMMU HW and KVM HW are using the same ID number space then
> arguably the two kernel drivers need to use a shared ID allocator in
> the arch, regardless of what iommufd/etc does. Using KVM should not be
> mandatory for iommufd.
>
> For ARM cases where there is no shared VMID space with KVM, the ARM
> VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> domain op should receive it from there.
>
> Nicolin, that seems to be missing in this series? I'm not entirely
> sure how to elegantly code it :\

Yea, it's missing. The VMID thing is supposed to be a sneak peek
of my next VCMDQ solution. Now it seems that BTM needs this too.

Remember that my previous VCMDQ series had a big complication to
share VMID across the passthrough devices in the same VM? During
that patch review, we concluded that IOMMUFD would simply align
VMIDs using a unified ctx ID or so, IIRC.

Thanks
Nic

2023-03-10 05:27:23

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 04:07:54PM +0000, Shameerali Kolothum Thodi wrote:
> External email: Use caution opening links or attachments
>
>
> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:[email protected]]
> > Sent: 09 March 2023 16:00
> > To: Shameerali Kolothum Thodi <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>; Nicolin Chen
> > <[email protected]>; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected]
> > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> > for ARM SMMUv3
> >
> > On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi
> > wrote:
> >
> > > > For ARM cases where there is no shared VMID space with KVM, the ARM
> > > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > > > domain op should receive it from there.
> > >
> > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain
> > alloc
> > > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.
> >
> > When we talk about exposing an SMMUv3 IOMMU CMDQ directly to
> > userspace then
> > VMID is the security token that protects it.
> >
> > So in that environment every domain under the same iommufd should
> > share the same VMID so that the CMDQ's also share the same VMID.
> >
> > I expect this to be a common sort of requirement as we will see
> > userspace command queues in the other HW as well.
> >
> > So, I suppose the answer for now is that ARM SMMUv3 should just
> > allocate one VMID per iommu_domain and there should be no VMID in the
> > uapi at all.
> >
> > Moving all iommu_domains to share the same VMID is a future patch.
> >
> > Though.. I have no idea how vVMID is handled in the SMMUv3
> > architecture. I suppose the guest IOMMU HW caps are set in a way that
> > it knows it does not have VMID?
>
> I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID.

Yea, a vmid is only allocated in an S2 domain allocation. So,
a guest allocating only S1 domains always sets VMID=0. Yet, I
think that the hypervisor or some where in host kernel should
replace the VMID=0 with a unified VMID.

Thanks
Nic

2023-03-10 05:36:46

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 09:26:57PM -0800, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 04:07:54PM +0000, Shameerali Kolothum Thodi wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > > -----Original Message-----
> > > From: Jason Gunthorpe [mailto:[email protected]]
> > > Sent: 09 March 2023 16:00
> > > To: Shameerali Kolothum Thodi <[email protected]>
> > > Cc: Jean-Philippe Brucker <[email protected]>; Nicolin Chen
> > > <[email protected]>; [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected]
> > > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> > > for ARM SMMUv3
> > >
> > > On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi
> > > wrote:
> > >
> > > > > For ARM cases where there is no shared VMID space with KVM, the ARM
> > > > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > > > > domain op should receive it from there.
> > > >
> > > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain
> > > alloc
> > > > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.
> > >
> > > When we talk about exposing an SMMUv3 IOMMU CMDQ directly to
> > > userspace then
> > > VMID is the security token that protects it.
> > >
> > > So in that environment every domain under the same iommufd should
> > > share the same VMID so that the CMDQ's also share the same VMID.
> > >
> > > I expect this to be a common sort of requirement as we will see
> > > userspace command queues in the other HW as well.
> > >
> > > So, I suppose the answer for now is that ARM SMMUv3 should just
> > > allocate one VMID per iommu_domain and there should be no VMID in the
> > > uapi at all.
> > >
> > > Moving all iommu_domains to share the same VMID is a future patch.
> > >
> > > Though.. I have no idea how vVMID is handled in the SMMUv3
> > > architecture. I suppose the guest IOMMU HW caps are set in a way that
> > > it knows it does not have VMID?
> >
> > I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID.
>
> Yea, a vmid is only allocated in an S2 domain allocation. So,
> a guest allocating only S1 domains always sets VMID=0. Yet, I
> think that the hypervisor or some where in host kernel should
> replace the VMID=0 with a unified VMID.

Ah, I just recall a conversation with Jason that a VM should only
have one S2 domain. In that case, the VMID is already unified?

Thanks
Nic

2023-03-10 11:34:12

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

Hi,
On 3/9/23 14:42, Jean-Philippe Brucker wrote:
> Hi Nicolin,
>
> On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote:
>> Add the following data structures for corresponding ioctls:
>> iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC
>> iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE
>>
>> Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
>> to the header and corresponding type/size arrays.
>>
>> Signed-off-by: Nicolin Chen <[email protected]>
>> +/**
>> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data
>> + *
>> + * @flags: page table entry attributes
>> + * @s2vmid: Virtual machine identifier
>> + * @s1ctxptr: Stage-1 context descriptor pointer
>> + * @s1cdmax: Number of CDs pointed to by s1ContextPtr
>> + * @s1fmt: Stage-1 Format
>> + * @s1dss: Default substream
>> + */
>> +struct iommu_hwpt_arm_smmuv3 {
>> +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */
> I don't understand the purpose of this flag, since the structure only
> provides stage-1 configuration fields
>
>> +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
> Doesn't this break isolation? The VMID space is global for the SMMU, so
> the guest could access cached mappings of another device
>
>> + __u64 flags;
>> + __u32 s2vmid;
>> + __u32 __reserved;
>> + __u64 s1ctxptr;
>> + __u64 s1cdmax;
>> + __u64 s1fmt;
>> + __u64 s1dss;
>> +};
>> +
>
>> +/**
>> + * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info
>> + * @flags: boolean attributes of cache invalidation command
>> + * @opcode: opcode of cache invalidation command
>> + * @ssid: SubStream ID
>> + * @granule_size: page/block size of the mapping in bytes
>> + * @range: IOVA range to invalidate
>> + */
>> +struct iommu_hwpt_invalidate_arm_smmuv3 {
>> +#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0)
>> + __u64 flags;
>> + __u8 opcode;
>> + __u8 padding[3];
>> + __u32 asid;
>> + __u32 ssid;
>> + __u32 granule_size;
>> + struct iommu_iova_range range;
>> +};
> Although we can keep the alloc and hardware info separate for each IOMMU
> architecture, we should try to come up with common invalidation methods.
>
> It matters because things like vSVA, or just efficient dynamic mappings,
> will require optimal invalidation latency. A paravirtual interface like
> vhost-iommu can help with that, as the host kernel will handle guest
> invalidations directly instead of doing a round-trip to host userspace
> (and we'll likely want the same path for PRI.)
>
> Supporting HW page tables for a common PV IOMMU does require some
> architecture-specific knowledge, but invalidation messages contain roughly
> the same information on all architectures. The PV IOMMU won't include
> command opcodes for each possible architecture if one generic command does
> the same job.
>
> Ideally I'd like something like this for vhost-iommu:
>
> * slow path through userspace for complex requests like attach-table and
> probe, where the VMM can decode arch-specific information and translate
> them to iommufd and vhost-iommu ioctls to update the configuration.
>
> * fast path within the kernel for performance-critical requests like
> invalidate, page request and response. It would be absurd for the
> vhost-iommu driver to translate generic invalidation requests from the
> guest into arch-specific commands with special opcodes, when the next
> step is calling the IOMMU driver which does that for free.
>
> During previous discussions we came up with generic invalidations that
> could fit both Arm and x86 [1][2]. The only difference was the ASID
> (called archid/id in those proposals) which VT-d didn't need. Could we try
> to build on that?

I do agree with Jean. We spent a lot of efforts all together to define
this generic invalidation API and if there is compelling reason that
prevents from using it, we should try to reuse it.

Thanks

Eric
>
> [1] https://elixir.bootlin.com/linux/v5.17/source/include/uapi/linux/iommu.h#L161
> [2] https://lists.oasis-open.org/archives/virtio-dev/202102/msg00014.html
>
> Thanks,
> Jean
>


2023-03-10 12:16:10

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 05:01:15PM -0400, Jason Gunthorpe wrote:
> > Concretely though, what are the incompatibilities between the HW designs?
> > They all need to remove a range of TLB entries, using some address space
> > tag. But if there is an actual difference I do need to know.
>
> For instance the address space tags and the cache entires they match
> to are wildly different.
>
> ARM uses a fine grained ASID and Intel uses a shared ASID called a DID
> and incorporates the PASID into the cache tag.
>
> AMD uses something called a DID that covers a different set of stuff
> than the Intel DID, and it doesn't seem to work for nesting. AMD uses
> PASID as the primary nested cache tag.

Thanks, we'll look into that


> This is because SMMUv3 has no option to keep the PASID table in the
> hypervisor so you are sadly forced to expose both the native ASID and
> native PASID caches to the virtio protocol.

It is possible to keep the PASID table in the host, but you need a way to
allocate GPA since the SMMU accesses it after stage-2 translation. I think
that necessarily requires a PV interface, but you could look into it.
Anyway, even with that, ATC invalidations take a PASID.

>
> Given that the VM virtio driver has to have SMMUv3 specific code to
> handle the CD table it must get, I don't see the problem with also
> having SMMUv3 specific code in the hypervisor virtio driver to handle
> invalidating based on the CD table.

There isn't much we can't do, I'm just hoping to build something
straightforward instead of having to work around awkward interfaces


> > A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
> > all vendor IOMMUs, new architectures get vIOMMU mostly for free,
>
> So your argument is you can implement a simple map/unmap API riding
> on the common IOMMU API and this is portable?
>
> Seems sensible, but that falls apart pretty quickly when we talk about
> nesting.. I don't think we can avoid VMM components to set this up, so
> it stops being portable. At that point I'm back to asking why not use
> the real HW model?

A single VMM component that shovels data from the virtqueue to the kernel
API and back, rather than four different hardware emulations, four
different queues, four different device tables. It's obviously better for
VMMs that don't do full-system emulation like QEMU, especially as they
generally already implement a virtio transport. Smaller attack surface,
fewer bugs.

The VMM developer gets a multi-platform vIOMMU without having to study all
the different architecture manuals. There is a small amount of HW specific
data in there, but it only relates to table formats.

Ideally it wouldn't need any HW knowledge, but that would requires the
APIs to be aligned: instead of ID registers we pass plain features, and
invalidations don't require HW specific opcodes. Otherwise there is going
to be a layer of glue everywhere, which is what I'm trying to avoid here.

>
> > > All the iommu drivers have native command
> > > queues. ARM and AMD are both supporting native command queues directly
> > > in the guest, complete with a direct guest MMIO doorbell ring.
> >
> > Arm SMMUv3 mandates a single global command queue (SMMUv2 uses
> > registers). An SMMUv3 can optionally implement multiple command
> > queues, though I don't know if they can be safely assigned to
> > guests.
>
> It is not standardized by ARM, but it can (and has) been done.
>
> > For a lot of SMMUv3 implementations that have a single queue and for
> > other architectures, we can do better than hardware emulation.
>
> How is using a SW emulated virtio formatted queue better than using a
> SW emulated SMMUv3 ECMDQ?

We don't need to repeat it for all IOMMU architectures, not emulate a new
queue in the kernel. The first motivator for virtio-iommu was avoiding to
emulate hardware in the kernel. The SMMU maintainer saw how painful that
was to do for the GIC, saw that there is a virtualization queue readily
available in vhost and, well, it just made sense. Still does.


> > As above, decoding arch-specific structures into generic ones is what an
> > emulated IOMMU does,
>
> No, it is what virtio wants to do. We are deliberately trying not to
> do that for real accelerated HW vIOMMU emulators.

Yes there is a line somewhere, and I'd prefer it to be the page table.
Given how many possible hardware combinations exist and how many more will
show up, it would be good to abstract things where possible.

>
> > and it doesn't make a performance difference in which
> > format it forwards that to the kernel. The host IOMMU driver checks the
> > guest request and copies them into the command queue. Whether that request
> > comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
> > some generic structure, does not make a difference.
>
> It is not the structure layouts that matter!
>
> It is the semantic meaning of each request, on each unique piece of
> hardware. We actually want to leak the subtle semantic differences to
> userspace.

These are hardware emulations, of course they have to know about hardware
semantics. The QEMU IOMMUs can work in TCG mode where they decode and
handle everything themselves.

Thanks,
Jean

2023-03-10 12:51:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Fri, Mar 10, 2023 at 12:33:12PM +0100, Eric Auger wrote:

> I do agree with Jean. We spent a lot of efforts all together to define
> this generic invalidation API and if there is compelling reason that
> prevents from using it, we should try to reuse it.

That's the compelling reason in a nutshell right there.

Alot of time was invested to create something that might be
general. We still don't know if it is well defined and general. Even
more time is going to be required on it before it could go forward. In
future more time will be needed for every future HW to try and fit
into it. We don't even know if it will scale to future HW. Nobody has
even checked what today's POWER and S390 HW need.

vs, this stuff was made in a few days. We know it is correct as a uAPI
since it mirrors the HW and we know it is scalable to different HW
schemes if they come up.

So I don't see a good reason to take a risk on a "general" uAPI. If we
make this wrong it could seriously damage the main goal of iommufd -
to build accelerated vIOMMU models.

Especially since the motivating reason in this thread - use it for
virtio-iommu - doesn't even want to use it as a uAPI!

If we get a vhost-virtio then we can decide what to do in-kernel and
maybe this general API returns as an in-kernel API, I dont know, we
need to see what it is this thing ends up looking like.

Jason

2023-03-10 12:55:28

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
>
> > Nicolin, I think we should tweak the uAPI here so that the
> > invalidation opaque data has a format tagged on its own, instead of
> > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > tag and also a virtio-viommu invalidate type tag.
>
> The invalidation tage is shared with the hwpt allocation. Does
> it mean that virtio-iommu won't have it's own allocation tag?

I'm not entirely sure what you mean by allocation tag. For example with
SMMU, when attaching page tables (SMMUv2), the guest passes an ASID at
allocation, and when it modifies that address space it passes the same
ASID for invalidation. When attaching PASID tables (SMMUv3), it writes the
ASID/PASID in the PASID table, and passes both in the invalidation.

Note that none of this is set in stone. It copies the Linux API we
originally discussed, but we were waiting for progress on that front
before committing to anything. Now we'll probably align to the new API
where possible, leaving out what doesn't work for virtio-iommu.

Thanks,
Jean


2023-03-10 12:55:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 09:36:18PM -0800, Nicolin Chen wrote:
> > Yea, a vmid is only allocated in an S2 domain allocation. So,
> > a guest allocating only S1 domains always sets VMID=0. Yet, I
> > think that the hypervisor or some where in host kernel should
> > replace the VMID=0 with a unified VMID.
>
> Ah, I just recall a conversation with Jason that a VM should only
> have one S2 domain. In that case, the VMID is already unified?

Not requried per-say, but yes, most likely qemu would run that way.

But you can't just re-use the VMID however you like. AFAIK the VMID is
the cache tag for the S2 IOPTEs, so every VMID must refer to the same
S2 translation.

You can't mix different S2's with the same VMID.

Thus you are stuck with the single S2 model in qemu if you want to use
a userspace CMDQ.

I suppose that suggests that if KVM supplies the VMID then it is
assigned to a singular S2 iommu_domain also.

Jason

2023-03-10 14:01:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Fri, Mar 10, 2023 at 12:54:53PM +0000, Jean-Philippe Brucker wrote:
> On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> >
> > > Nicolin, I think we should tweak the uAPI here so that the
> > > invalidation opaque data has a format tagged on its own, instead of
> > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > > tag and also a virtio-viommu invalidate type tag.
> >
> > The invalidation tage is shared with the hwpt allocation. Does
> > it mean that virtio-iommu won't have it's own allocation tag?
>
> I'm not entirely sure what you mean by allocation tag.

He means the tag identifying the allocation driver specific data is
the same tag that is passed in to identify the invalidation driver
specific data.

With the notion that the allocation data and invalidation data would
be in the same driver's format.

> Note that none of this is set in stone. It copies the Linux API we
> originally discussed, but we were waiting for progress on that front
> before committing to anything. Now we'll probably align to the new API
> where possible, leaving out what doesn't work for virtio-iommu.

IMHO virtio-iommu should stand alone and make sense with its own
internal object model.

eg I would probably try not to have guests invalidate PASID. Have a
strong ASID model and in most cases have the hypervisor track where
the ASID's are mapped to PASID/etc and rely on the hypervisor to spew
the invalidations to PASID as required. It is more abstracted from the
actual HW for the guest. The guest can simply say it changed an IOPTE
under a certain ASID.

The ugly wrinkle is SMMUv3 but perhaps your idea of allowing the
hypervisor to manage the CD table in guest memory is reasonable.

IMHO it is a missing SMMUv3 HW feature that the CD table doesn't have
the option to be in hypervisor memory. AMD allows both options - so
I'm not sure I would invest a huge amount to make special cases to
support this... Assume a SMMUv3 update might gain the option someday.

Jason

2023-03-10 15:00:23

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On 2023-03-09 21:01, Jason Gunthorpe wrote:
>> For a lot of SMMUv3 implementations that have a single queue and for
>> other architectures, we can do better than hardware emulation.
>
> How is using a SW emulated virtio formatted queue better than using a
> SW emulated SMMUv3 ECMDQ?

Since it's not been said, the really big thing is that virtio explicitly
informs the host whenever the guest maps something. Emulating SMMUv3
means the host has to chase all the pagetable pointers in guest memory
and trap writes such that it has visibility of invalid->valid
transitions and can update the physical shadow pagetable correspondingly.

FWIW we spent quite some time on and off discussing something like
VT-d's "caching mode", but never found a convincing argument that it was
a gap which needed filling, since we already had hardware nesting for
maximum performance and a paravirtualisation option for efficient
emulation. Thus full SMMUv3 emulation seems to just sit at the bottom as
the maximum-compatibility option for pushing an unmodified legacy
bare-metal software stack into a VM where nesting isn't available.

Cheers,
Robin.

2023-03-10 15:38:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Fri, Mar 10, 2023 at 02:52:42PM +0000, Robin Murphy wrote:
> On 2023-03-09 21:01, Jason Gunthorpe wrote:
> > > For a lot of SMMUv3 implementations that have a single queue and for
> > > other architectures, we can do better than hardware emulation.
> >
> > How is using a SW emulated virtio formatted queue better than using a
> > SW emulated SMMUv3 ECMDQ?
>
> Since it's not been said, the really big thing is that virtio explicitly
> informs the host whenever the guest maps something. Emulating SMMUv3 means
> the host has to chase all the pagetable pointers in guest memory and trap
> writes such that it has visibility of invalid->valid transitions and can
> update the physical shadow pagetable correspondingly.

Sorry, I mean in the context of future virtio-iommu that is providing
nested translation.

eg why would anyone want to use virtio to provide SMMUv3 based HW
accelerated nesting?

Jean suggested that the invalidation flow for virtio-iommu could be
faster because it is in kernel, but I'm saying that we could also make
the SMMUv3 invalidation in-kernel with the same basic technique. (and
actively wondering if we should put more focus on that)

I understand the appeal of the virtio scheme with its current
map/unmap interface.

I could also see some appeal of a simple virtio-iommu SVA that could
point map a CPU page table as an option. The guest already has to know
how to manage these anyhow so it is nicely general.

If iommufd could provide a general cross-driver API to set exactly
that scenario up then VMM code could also be general. That seems
prettty interesting.

But if the plan is to expose more detailed stuff like the CD or GCR3
PASID tables as something the guest has to manipulate and then a bunch
of special invalidation to support that, and VMM code to back it, then
I'm questioning the whole point. We lost the generality.

Just use the normal HW accelerated SMMUv3 nesting model instead.

If virtio-iommu SVA is really important for ARM then I'd suggest
SMMUv3 should gain a new HW capability to allowed the CD table to be
in hypervisor memory so it works consistently for virtio-iommu SVA.

Jason

2023-03-10 15:42:34

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

On 2023-03-10 01:17, Nicolin Chen wrote:
> Hi Robin,
>
> Thanks for the inputs.
>
> On Thu, Mar 09, 2023 at 01:03:41PM +0000, Robin Murphy wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2023-03-09 10:53, Nicolin Chen wrote:
>>> This is used to forward the host IDR values to the user space, so the
>>> hypervisor and the guest VM can learn about the underlying hardware's
>>> capabilities.
>>>
>>> Also, set the driver_type to IOMMU_HW_INFO_TYPE_ARM_SMMUV3 to pass the
>>> corresponding type sanity in the core.
>>>
>>> Signed-off-by: Nicolin Chen <[email protected]>
>>> ---
>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++++
>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
>>> include/uapi/linux/iommufd.h | 14 ++++++++++++
>>> 3 files changed, 41 insertions(+)
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index f2425b0f0cd6..c1aac695ae0d 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -2005,6 +2005,29 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
>>> }
>>> }
>>>
>>> +static void *arm_smmu_hw_info(struct device *dev, u32 *length)
>>> +{
>>> + struct arm_smmu_master *master = dev_iommu_priv_get(dev);
>>> + struct iommu_hw_info_smmuv3 *info;
>>> + void *base_idr;
>>> + int i;
>>> +
>>> + if (!master || !master->smmu)
>>> + return ERR_PTR(-ENODEV);
>>> +
>>> + info = kzalloc(sizeof(*info), GFP_KERNEL);
>>> + if (!info)
>>> + return ERR_PTR(-ENOMEM);
>>> +
>>> + base_idr = master->smmu->base + ARM_SMMU_IDR0;
>>> + for (i = 0; i <= 5; i++)
>>> + info->idr[i] = readl_relaxed(base_idr + 0x4 * i);
>>
>> You need to take firmware overrides etc. into account here. In
>> particular, features like BTM may need to be hidden to work around
>> errata either in the system integration or the SMMU itself. It isn't
>> reasonable to expect every VMM to be aware of every erratum and
>> workaround, and there may even be workarounds where we need to go out of
>> our way to prevent guests from trying to use certain features in order
>> to maintain correctness at S2.
>
> We can add a bit of overrides after this for errata, perhaps?
>
> I have some trouble with finding the errata docs. Would it be
> possible for you to direct me to it with a link maybe?

The key Arm term is "Software Developer Errata Notice", or just SDEN.
Here's the ones for MMU-600 and MMU-700:

https://developer.arm.com/documentation/SDEN-946810/latest/
https://developer.arm.com/documentation/SDEN-1786925/latest/

Note that until now it has been extremely fortunate that in pretty much
every case Linux either hasn't supported the affected feature at all, or
has happened to avoid meeting the conditions. Once we do introduce
nesting support that all goes out the window (and I'll have to think
more when reviewing new errata in future...)

I've been putting off revisiting all the existing errata to figure out
what we'd need to do until new nesting patches appeared, so I'll try to
get to that soon now. I think in many cases it's likely to be best to
just disallowing nesting entirely on affected implementations.

Thanks,
Robin.

>> In general this should probably follow the same principle as KVM, where
>> we only expose sanitised feature registers representing the
>> functionality the host understands. Code written today is almost
>> guaranteed to be running on hardware released in 2030, at least *somewhere*.
>
> Yes.
>
> Thanks
> Nicolin

2023-03-10 16:02:39

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On 2023-03-10 15:25, Jason Gunthorpe wrote:
> On Fri, Mar 10, 2023 at 02:52:42PM +0000, Robin Murphy wrote:
>> On 2023-03-09 21:01, Jason Gunthorpe wrote:
>>>> For a lot of SMMUv3 implementations that have a single queue and for
>>>> other architectures, we can do better than hardware emulation.
>>>
>>> How is using a SW emulated virtio formatted queue better than using a
>>> SW emulated SMMUv3 ECMDQ?
>>
>> Since it's not been said, the really big thing is that virtio explicitly
>> informs the host whenever the guest maps something. Emulating SMMUv3 means
>> the host has to chase all the pagetable pointers in guest memory and trap
>> writes such that it has visibility of invalid->valid transitions and can
>> update the physical shadow pagetable correspondingly.
>
> Sorry, I mean in the context of future virtio-iommu that is providing
> nested translation.

Ah, that's probably me missing the context again.

> eg why would anyone want to use virtio to provide SMMUv3 based HW
> accelerated nesting?
>
> Jean suggested that the invalidation flow for virtio-iommu could be
> faster because it is in kernel, but I'm saying that we could also make
> the SMMUv3 invalidation in-kernel with the same basic technique. (and
> actively wondering if we should put more focus on that)
>
> I understand the appeal of the virtio scheme with its current
> map/unmap interface.
>
> I could also see some appeal of a simple virtio-iommu SVA that could
> point map a CPU page table as an option. The guest already has to know
> how to manage these anyhow so it is nicely general.
>
> If iommufd could provide a general cross-driver API to set exactly
> that scenario up then VMM code could also be general. That seems
> prettty interesting.

Indeed, I've always assumed the niche for virtio would be that kind of
in-between use-case using nesting to accelerate simple translation,
where we plug a guest-owned pagetable into a host-owned context. That
way the guest retains the simple virtio interface and only needs to
understand a pagetable format (or as you say, simply share a CPU
pagetable) without having to care about the nitty-gritty of all the
IOMMU-specific moving parts around it. For guests that want to get into
more advanced stuff like managing their own PASID tables, pushing them
towards "native" nesting probably does make more sense.

> But if the plan is to expose more detailed stuff like the CD or GCR3
> PASID tables as something the guest has to manipulate and then a bunch
> of special invalidation to support that, and VMM code to back it, then
> I'm questioning the whole point. We lost the generality.
>
> Just use the normal HW accelerated SMMUv3 nesting model instead.
>
> If virtio-iommu SVA is really important for ARM then I'd suggest
> SMMUv3 should gain a new HW capability to allowed the CD table to be
> in hypervisor memory so it works consistently for virtio-iommu SVA.

Oh, maybe I should have read this far before reasoning the exact same
thing from scratch... oh well, this time I'm not going to go back and
edit :)

Thanks,
Robin.

2023-03-10 16:06:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Fri, Mar 10, 2023 at 03:57:27PM +0000, Robin Murphy wrote:

> about the nitty-gritty of all the IOMMU-specific moving parts around it. For
> guests that want to get into more advanced stuff like managing their own
> PASID tables, pushing them towards "native" nesting probably does make more
> sense.

IMHO with the simplified virtio model I would say the guest should
not have its own PASID table.

hyper trap to install a PASID and let the hypervisor driver handle
this abstractly. If abstractly is the whole point and benifit then
virtio should lean into that.

This also means virtio protocol doesn't do PASID invalidation. It
invalidates an ASID and the hypervisor takes care of whatever it is
connected to. Very simple and general for the VM.

Adding a S1 iommu_domain op for invalidate address range is perfectly
fine and the virtio kernel hypervisor driver can call it generically.

The primary reason to have guest-owned PASID tables is CC stuff, which
definitely won't be part of virtio-iommu.

Jason

2023-03-10 16:07:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
>
> > Nicolin, I think we should tweak the uAPI here so that the
> > invalidation opaque data has a format tagged on its own, instead of
> > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > tag and also a virtio-viommu invalidate type tag.
>
> The invalidation tage is shared with the hwpt allocation. Does
> it mean that virtio-iommu won't have it's own allocation tag?

We probably shouldn't assume it will

Jason

2023-03-10 16:24:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 09, 2023 at 08:20:03PM -0800, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 11:31:04AM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 09, 2023 at 02:49:14PM +0000, Robin Murphy wrote:
> >
> > > If the design here is that user_data is so deeply driver-specific and
> > > special to the point that it can't possibly be passed as a type-checked
> > > union of the known and publicly-visible UAPI types that it is, wouldn't it
> > > make sense to just encode the whole thing in the expected format and not
> > > have to make these kinds of niggling little conversions at both ends?
> >
> > Yes, I suspect the design for ARM should have the input be the entire
> > actual command work queue entry. There is no reason to burn CPU cycles
> > in userspace marshalling it to something else and then decode it again
> > in the kernel. Organize things to point the ioctl directly at the
> > queue entry, and the kernel can do a single memcpy from guest
> > controlled pages to kernel memory then parse it?
>
> There still can be complications to do something straightforward
> like that.

> Firstly, the consumer and producer indexes might need
> to be synced between the host and kernel?

No, qemu would handles this. The kernel would just read the command
entries it is told by qemu to read which qemu has already sorted out.

> Secondly, things like SID and VMID fields in the commands need to
> be replaced manually when the host kernel reads commands out, which
> means that there need to be a translation table(s) in the host
> kernel to replace those fields. These actually are parts of the
> features of VCMDQ hardware itself.

VMID should be ignored in a guest request.

SID translation is a good point. Can qemu do this? How does SID
translation work with VCMDQ in HW? (Jean this is exactly the sort of
tiny detail that the generic interface ignored)

What I'm broadly thinking is if we have to make the infrastructure for
VCMDQ HW accelerated invalidation then it is not a big step to also
have the kernel SW path use the same infrastructure just with a CPU
wake up instead of a MMIO poke.

Ie we have a SW version of VCMDQ to speed up SMMUv3 cases without HW
support.

I suspect the answer to Robin's question on how to handle errors is
the most important deciding factor. If we have to capture and relay
actual HW errors back to userspace that really suggests we should do
something different than a synchronous ioctl.

Jason

2023-03-10 16:43:50

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v1 05/14] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED

Hi Nicolin,

On 3/9/23 11:53, Nicolin Chen wrote:
> IOMMUFD designs two iommu_domain pointers to represent two stages. The S1
s/designs/uses?
> iommu_domain (IOMMU_DOMAIN_NESTED type) represents the Context Descriptor
> table in the user space. The S2 iommu_domain (IOMMU_DOMAIN_UNMANAGED type)
> represents the translation table in the kernel, owned by a hypervisor.
>
> So there comes to no use case of the ARM_SMMU_DOMAIN_NESTED. Drop it, and
> use the type IOMMU_DOMAIN_NESTED instead.
last sentence may be rephrased as this patch does not use
IOMMU_DOMAIN_NESTED anywhere:
Generic IOMMU_DOMAIN_NESTED type will be used in nested SMMU
implementation instead.
>
> Also drop the unused arm_smmu_enable_nesting(). One following patche will
> configure the correct smmu_domain->stage.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ------------------
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 -
If you go this way you may also remove it from arm/arm-smmu/arm-smmu.c.
Then if I am not wrong no other driver does implement enable_nesting cb.
Shouldn't we also remove it and fellow iommu_enable_nesting()?

Thanks

Eric
> 2 files changed, 19 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index c1aac695ae0d..c5616145e2a3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1279,7 +1279,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> s1_cfg = &smmu_domain->s1_cfg;
> break;
> case ARM_SMMU_DOMAIN_S2:
> - case ARM_SMMU_DOMAIN_NESTED:
> s2_cfg = &smmu_domain->s2_cfg;
> break;
> default:
> @@ -2220,7 +2219,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
> fmt = ARM_64_LPAE_S1;
> finalise_stage_fn = arm_smmu_domain_finalise_s1;
> break;
> - case ARM_SMMU_DOMAIN_NESTED:
> case ARM_SMMU_DOMAIN_S2:
> ias = smmu->ias;
> oas = smmu->oas;
> @@ -2747,21 +2745,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
> return group;
> }
>
> -static int arm_smmu_enable_nesting(struct iommu_domain *domain)
> -{
> - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> - int ret = 0;
> -
> - mutex_lock(&smmu_domain->init_mutex);
> - if (smmu_domain->smmu)
> - ret = -EPERM;
> - else
> - smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
> - mutex_unlock(&smmu_domain->init_mutex);
> -
> - return ret;
> -}
> -
> static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
> {
> return iommu_fwspec_add_ids(dev, args->args, 1);
> @@ -2890,7 +2873,6 @@ static struct iommu_ops arm_smmu_ops = {
> .flush_iotlb_all = arm_smmu_flush_iotlb_all,
> .iotlb_sync = arm_smmu_iotlb_sync,
> .iova_to_phys = arm_smmu_iova_to_phys,
> - .enable_nesting = arm_smmu_enable_nesting,
> .free = arm_smmu_domain_free,
> }
> };
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index ba2b4562f4b2..233bfc377267 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -704,7 +704,6 @@ struct arm_smmu_master {
> enum arm_smmu_domain_stage {
> ARM_SMMU_DOMAIN_S1 = 0,
> ARM_SMMU_DOMAIN_S2,
> - ARM_SMMU_DOMAIN_NESTED,
> ARM_SMMU_DOMAIN_BYPASS,
> };
>


2023-03-10 16:49:26

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] iommufd/device: Setup MSI on kernel-managed domains

Hi Nicolin,

On 3/9/23 11:53, Nicolin Chen wrote:
> The IOMMU_RESV_SW_MSI is a kernel-managed domain thing. So, it should be
> only setup on a kernel-managed domain only. If the attaching domain is a
> user-managed domain, redirect the hwpt to hwpt->parent to do it correctly.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/iommufd/device.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index f95b558f5e95..a3e7d2889164 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -350,7 +350,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
> * call iommu_get_msi_cookie() on its behalf. This is necessary to setup
> * the MSI window so iommu_dma_prepare_msi() can install pages into our
> * domain after request_irq(). If it is not done interrupts will not
> - * work on this domain.
> + * work on this domain. And the msi_cookie should be always set into the
s/And the/The/
> + * kernel-managed (parent) domain.
> *
> * FIXME: This is conceptually broken for iommufd since we want to allow
> * userspace to change the domains, eg switch from an identity IOAS to a
> @@ -358,6 +359,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
> * matches what the IRQ layer actually expects in a newly created
> * domain.
> */
> + if (hwpt->parent)
> + hwpt = hwpt->parent;
I guess there is a garantee the parent hwpt is necessarily a
kernel-managed domain?
Is it that part of the spec that enforces it?
IOMMU_HWPT_ALLOC doc says:
" * A user-managed HWPT will be created from a given parent HWPT via
@pt_id, in
 * which the parent HWPT must be allocated previously via the same ioctl
from a
 * given IOAS.
"
Maybe precise that in the commit msg?

Thanks

Eric
> if (sw_msi_start != PHYS_ADDR_MAX && !hwpt->msi_cookie) {
> rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
> if (rc)


2023-03-10 17:09:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 05/14] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED

On Fri, Mar 10, 2023 at 05:39:22PM +0100, Eric Auger wrote:

> > Also drop the unused arm_smmu_enable_nesting(). One following patche will
> > configure the correct smmu_domain->stage.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ------------------
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 -
> If you go this way you may also remove it from arm/arm-smmu/arm-smmu.c.
> Then if I am not wrong no other driver does implement enable_nesting cb.
> Shouldn't we also remove it and fellow iommu_enable_nesting()?

Yes, lets just put this patch in the series please:

https://lore.kernel.org/kvm/[email protected]/

Jason

2023-03-10 17:54:15

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On 2023-03-10 03:51, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 02:49:14PM +0000, Robin Murphy wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2023-03-09 10:53, Nicolin Chen wrote:
>>> Add arm_smmu_cache_invalidate_user() function for user space to invalidate
>>> TLB entries and Context Descriptors, since either an IO page table entrie
>>> or a Context Descriptor in the user space is still cached by the hardware.
>>>
>>> The input user_data is defined in "struct iommu_hwpt_invalidate_arm_smmuv3"
>>> that contains the essential data for corresponding invalidation commands.
>>>
>>> Co-developed-by: Eric Auger <[email protected]>
>>> Signed-off-by: Eric Auger <[email protected]>
>>> Signed-off-by: Nicolin Chen <[email protected]>
>>> ---
>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 56 +++++++++++++++++++++
>>> 1 file changed, 56 insertions(+)
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index ac63185ae268..7d73eab5e7f4 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -2880,9 +2880,65 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
>>> arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
>>> }
>>>
>>> +static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
>>> + void *user_data)
>>> +{
>>> + struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
>>> + struct arm_smmu_cmdq_ent cmd = { .opcode = inv_info->opcode };
>>> + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>>> + struct arm_smmu_device *smmu = smmu_domain->smmu;
>>> + size_t granule_size = inv_info->granule_size;
>>> + unsigned long iova = 0;
>>> + size_t size = 0;
>>> + int ssid = 0;
>>> +
>>> + if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
>>> + return;
>>> +
>>> + switch (inv_info->opcode) {
>>> + case CMDQ_OP_CFGI_CD:
>>> + case CMDQ_OP_CFGI_CD_ALL:
>>> + return arm_smmu_sync_cd(smmu_domain, inv_info->ssid, true);
>>
>> Since we let the guest choose its own S1Fmt (and S1CDMax, yet not
>> S1DSS?), how can we assume leaf = true here?
>
> The s1dss is forwarded in the user_data structure too. So, the
> driver should have set that too down to a nested STE. Will add
> this missing pathway.
>
> And you are right that the guest OS can use a 2-level table, so
> we should set leaf = false to cover all cases, I think.
>
>>> + case CMDQ_OP_TLBI_NH_VA:
>>> + cmd.tlbi.asid = inv_info->asid;
>>> + fallthrough;
>>> + case CMDQ_OP_TLBI_NH_VAA:
>>> + if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||
>>
>> Non-range invalidations with TG=0 are perfectly legal, and should not be
>> ignored.
>
> I assume that you are talking about the pgsize_bitmap check.
>
> QEMU embeds a !tg case into the granule_size [1]. So it might
> not be straightforward to cover that case. Let me see how to
> untangle different cases and handle them accordingly.

Oh, double-checking patch #2, that might be me misunderstanding the
interface. I hadn't realised that the UAPI was apparently modelled on
arm_smmu_tlb_inv_range_asid() rather than actual SMMU commands :)

I really think UAPI should reflect the hardware and encode TG and TTL
directly. Especially since there's technically a flaw in the current
driver where we assume TTL in cases where it isn't actually known, thus
may potentially fail to invalidate level 2 block entries when removing a
level 1 table, since io-pgtable passes the level 3 granule in that case.
When range invalidation came along, the distinction between "all leaves
are definitely at the last level" and "use last-level granularity to
make sure everything at at any level is hit" started to matter, but the
interface never caught up. It hasn't seemed desperately urgent to fix
(who does 1GB+ unmaps outside of VFIO teardown anyway?), but we must
definitely not bake the same mistake into user ABI.

Of course, there might then be cases where we need to transform
non-range commands into range commands for the sake of workarounds, but
that's our own problem to deal with.

> [1] https://patchew.org/QEMU/[email protected]/[email protected]/
>
>>> + granule_size & ~(1ULL << __ffs(granule_size)))
>>
>> If that's intended to mean is_power_of_2(), please just use is_power_of_2().
>>
>>> + return;
>>> +
>>> + iova = inv_info->range.start;
>>> + size = inv_info->range.last - inv_info->range.start + 1;
>>
>> If the design here is that user_data is so deeply driver-specific and
>> special to the point that it can't possibly be passed as a type-checked
>> union of the known and publicly-visible UAPI types that it is, wouldn't
>> it make sense to just encode the whole thing in the expected format and
>> not have to make these kinds of niggling little conversions at both ends?
>
> Hmm, that makes sense to me.
>
> I just tracked back the history of Eric's previous work. There
> was a mismatch between guest and host that RIL isn't supported
> by the hardware. Now, guest can have whatever information it'd
> need from the host to send supported instructions.
>
>>> + if (!size)
>>> + return;
>>> +
>>> + cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
>>> + cmd.tlbi.leaf = inv_info->flags & IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF;
>>> + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule_size, smmu_domain);
>>> + break;
>>> + case CMDQ_OP_TLBI_NH_ASID:
>>> + cmd.tlbi.asid = inv_info->asid;
>>> + fallthrough;
>>> + case CMDQ_OP_TLBI_NH_ALL:
>>> + cmd.tlbi.vmid = smmu_domain->s2->s2_cfg.vmid;
>>> + arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>>> + break;
>>> + case CMDQ_OP_ATC_INV:
>>> + ssid = inv_info->ssid;
>>> + iova = inv_info->range.start;
>>> + size = inv_info->range.last - inv_info->range.start + 1;
>>> + break;
>>
>> Can we do any better than multiplying every single ATC_INV command, even
>> for random bogus StreamIDs, into multiple commands across every physical
>> device? In fact, I'm not entirely confident this isn't problematic, if
>> the guest wishes to send invalidations for one device specifically while
>> it's put some other device into a state where sending it a command would
>> do something bad. At the very least, it's liable to be confusing if the
>> guest sends a command for one StreamID but gets an error back for a
>> different one.
>
> We'd need here an sid translation from the guest value to the
> host value to specify a device, so as not to multiply the cmd
> with the device list, if I understand it correctly?

I guess it depends on whether IOMMUFD is aware of the vSID->device
relationships that the VMM is using. If so, then it should be OK for the
VMM to pass through the vSID directly, and we can translate and
sanity-check it internally. Otherwise, the interface might have to
require the VMM to translate vSID->RID and pass the corresponding host
RID, which we can then map back to a SID (userspace cannot do the full
vSID->SID by itself, and even if it could that would probably be more
awkward to validate).

>> And if we expect ATS, what about PRI? Per patch #4 you're currently
>> offering that to the guest as well.
>
> Oh, I should have probably blocked PRI. The PRI and the fault
> injection will be followed after the basic 2-stage translation
> patches. And I don't have a supporting hardware to test PRI.
>
>>
>>> + default:
>>> + return;
>>
>> What about NSNH_ALL? That still needs to invalidate all the S1 context
>> that the guest *thinks* it's invalidating.
>
> NSNH_ALL is translated to NH_ALL at the guest level. But maybe
> it should have been done here instead.

Yes. It seems the worst of both worlds to have an interface which takes
raw opcodes rather than an enum of supported commands, but still
requires userspace to know which opcodes are supported and which ones
don't work as expected even though they are entirely reasonable to use
in the context of the stage-1-only SMMU being emulated.

Thanks,
Robin.

2023-03-10 18:50:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 10, 2023 at 05:53:46PM +0000, Robin Murphy wrote:

> I guess it depends on whether IOMMUFD is aware of the vSID->device
> relationships that the VMM is using. If so, then it should be OK for the VMM
> to pass through the vSID directly, and we can translate and sanity-check it
> internally. Otherwise, the interface might have to require the VMM to
> translate vSID->RID and pass the corresponding host RID, which we can then
> map back to a SID (userspace cannot do the full vSID->SID by itself, and
> even if it could that would probably be more awkward to validate).

The thing we have in iommufd is the "idevid" ie the handle for
the 'struct device' which is also the handle for the phyiscal SID in
the iommu..

The trouble is that there is not such an easy way for the iommu driver
to translate an idevid at this point since it would have to call out
from a built-in kernel driver to the iommufd module :( :( We have to
eventually solve that but I was hoping it wouldn't have to be on the
fast path...

So, having a vSID xarray in the driver that holds the struct device *
is possibly a good thing. Especially if the vCMDQ scheme needs the
same information.

Jason

2023-03-10 20:39:37

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] iommu/arm-smmu-v3: Prepare for nested domain support

On 2023-03-09 10:53, Nicolin Chen wrote:
> In a nested translation setup, the device is attached to a stage-1 domain
> that represents the guest-level Context Descriptor table. A Stream Table
> Entry for a 2-stage translation needs both the stage-1 Context Descriptor
> table info and the stage-2 Translation table information, i.e. a pair of
> s1_cfg and s2_cfg.
>
> Add an "s2" pointer in struct arm_smmu_domain, so a nested stage-1 domain
> can simply navigate its stage-2 domain for the s2_cfg pointer. Also, add
> a to_s2_cfg() helper for this purpose, and use it at proper places.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++--
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
> 2 files changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 21d819979865..fee5977feef3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -100,6 +100,24 @@ static void parse_driver_options(struct arm_smmu_device *smmu)
> } while (arm_smmu_options[++i].opt);
> }
>
> +static struct arm_smmu_s2_cfg *to_s2_cfg(struct arm_smmu_domain *smmu_domain)
> +{
> + if (!smmu_domain)
> + return NULL;
> +
> + switch (smmu_domain->stage) {
> + case ARM_SMMU_DOMAIN_S1:
> + if (smmu_domain->s2)
> + return &smmu_domain->s2->s2_cfg;
> + return NULL;
> + case ARM_SMMU_DOMAIN_S2:
> + return &smmu_domain->s2_cfg;
> + case ARM_SMMU_DOMAIN_BYPASS:
> + default:
> + return NULL;
> + }
> +}
> +
> /* Low-level queue manipulation functions */
> static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
> {
> @@ -1277,6 +1295,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> switch (smmu_domain->stage) {
> case ARM_SMMU_DOMAIN_S1:
> s1_cfg = &smmu_domain->s1_cfg;
> + s2_cfg = to_s2_cfg(smmu_domain);

TBH I'd say you only need a 2-line change here. All the other cases
below are when the stage is guaranteed to be ARM_SMMU_DOMAIN_S2 (once
ARM_SMMU_DOMAIN_NESTED is gone), so pretending it might be otherwise
seems unnecessarily confusing.

Thanks,
Robin.

> break;
> case ARM_SMMU_DOMAIN_S2:
> s2_cfg = &smmu_domain->s2_cfg;
> @@ -1846,6 +1865,7 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
> static void arm_smmu_tlb_inv_context(void *cookie)
> {
> struct arm_smmu_domain *smmu_domain = cookie;
> + struct arm_smmu_s2_cfg *s2_cfg = to_s2_cfg(smmu_domain);
> struct arm_smmu_device *smmu = smmu_domain->smmu;
> struct arm_smmu_cmdq_ent cmd;
>
> @@ -1860,7 +1880,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
> arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
> } else {
> cmd.opcode = CMDQ_OP_TLBI_S12_VMALL;
> - cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
> + cmd.tlbi.vmid = s2_cfg->vmid;
> arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> }
> arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
> @@ -1931,6 +1951,7 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
> size_t granule, bool leaf,
> struct arm_smmu_domain *smmu_domain)
> {
> + struct arm_smmu_s2_cfg *s2_cfg = to_s2_cfg(smmu_domain);
> struct arm_smmu_cmdq_ent cmd = {
> .tlbi = {
> .leaf = leaf,
> @@ -1943,7 +1964,7 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
> cmd.tlbi.asid = smmu_domain->s1_cfg.cd.asid;
> } else {
> cmd.opcode = CMDQ_OP_TLBI_S2_IPA;
> - cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
> + cmd.tlbi.vmid = s2_cfg->vmid;
> }
> __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 1a93eeb993ea..6cf516852721 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -709,6 +709,7 @@ enum arm_smmu_domain_stage {
> };
>
> struct arm_smmu_domain {
> + struct arm_smmu_domain *s2;
> struct arm_smmu_device *smmu;
> struct mutex init_mutex; /* Protects smmu pointer */
>

2023-03-11 00:17:58

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] iommufd/device: Setup MSI on kernel-managed domains

On Fri, Mar 10, 2023 at 05:45:20PM +0100, Eric Auger wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Nicolin,
>
> On 3/9/23 11:53, Nicolin Chen wrote:
> > The IOMMU_RESV_SW_MSI is a kernel-managed domain thing. So, it should be
> > only setup on a kernel-managed domain only. If the attaching domain is a
> > user-managed domain, redirect the hwpt to hwpt->parent to do it correctly.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/iommufd/device.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> > index f95b558f5e95..a3e7d2889164 100644
> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -350,7 +350,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
> > * call iommu_get_msi_cookie() on its behalf. This is necessary to setup
> > * the MSI window so iommu_dma_prepare_msi() can install pages into our
> > * domain after request_irq(). If it is not done interrupts will not
> > - * work on this domain.
> > + * work on this domain. And the msi_cookie should be always set into the
> s/And the/The/

OK.

> > + * kernel-managed (parent) domain.
> > *
> > * FIXME: This is conceptually broken for iommufd since we want to allow
> > * userspace to change the domains, eg switch from an identity IOAS to a
> > @@ -358,6 +359,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
> > * matches what the IRQ layer actually expects in a newly created
> > * domain.
> > */
> > + if (hwpt->parent)
> > + hwpt = hwpt->parent;
> I guess there is a garantee the parent hwpt is necessarily a
> kernel-managed domain?

Yes. It must be.

> Is it that part of the spec that enforces it?

The hwpt_alloc() function has a sanity to enforce that.

> IOMMU_HWPT_ALLOC doc says:
> " * A user-managed HWPT will be created from a given parent HWPT via
> @pt_id, in
> * which the parent HWPT must be allocated previously via the same ioctl
> from a
> * given IOAS.
> "
> Maybe precise that in the commit msg?

There is a paragraph just above that, for kernel-managed HWPT:

455 * A normal HWPT will be created with the mappings from the given IOAS.
456 * The @data_type for its allocation can be set to IOMMU_HWPT_TYPE_DEFAULT, or
457 * another type (being listed below) to specialize a kernel-managed HWPT.

Perhaps we could rephrase "normal HWPT" with "kernel-managed
HWPT", to make it more clear.

Thanks
Nic

2023-03-11 00:26:39

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 05/14] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED

On Fri, Mar 10, 2023 at 05:39:22PM +0100, Eric Auger wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Nicolin,
>
> On 3/9/23 11:53, Nicolin Chen wrote:
> > IOMMUFD designs two iommu_domain pointers to represent two stages. The S1
> s/designs/uses?
> > iommu_domain (IOMMU_DOMAIN_NESTED type) represents the Context Descriptor
> > table in the user space. The S2 iommu_domain (IOMMU_DOMAIN_UNMANAGED type)
> > represents the translation table in the kernel, owned by a hypervisor.
> >
> > So there comes to no use case of the ARM_SMMU_DOMAIN_NESTED. Drop it, and
> > use the type IOMMU_DOMAIN_NESTED instead.
> last sentence may be rephrased as this patch does not use
> IOMMU_DOMAIN_NESTED anywhere:
> Generic IOMMU_DOMAIN_NESTED type will be used in nested SMMU
> implementation instead.
> >
> > Also drop the unused arm_smmu_enable_nesting(). One following patche will
> > configure the correct smmu_domain->stage.
> >
> > Signed-off-by: Nicolin Chen <[email protected]>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ------------------
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 -
> If you go this way you may also remove it from arm/arm-smmu/arm-smmu.c.
> Then if I am not wrong no other driver does implement enable_nesting cb.
> Shouldn't we also remove it and fellow iommu_enable_nesting()?

We had a small discussion before this community version, where
Robin mentioned that we can remove that too after the nesting
series gets merged. Yet, I didn't want to touch the v2 driver
with this series since there's no nesting change adding to it.

And a few month ago, Jason had a patch removing everything of
that API from the top. Perhaps that one can be resent after
all?

Thanks
Nic

2023-03-11 00:27:45

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 05/14] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED

On Fri, Mar 10, 2023 at 01:05:36PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 10, 2023 at 05:39:22PM +0100, Eric Auger wrote:
>
> > > Also drop the unused arm_smmu_enable_nesting(). One following patche will
> > > configure the correct smmu_domain->stage.
> > >
> > > Signed-off-by: Nicolin Chen <[email protected]>
> > > ---
> > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ------------------
> > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 -
> > If you go this way you may also remove it from arm/arm-smmu/arm-smmu.c.
> > Then if I am not wrong no other driver does implement enable_nesting cb.
> > Shouldn't we also remove it and fellow iommu_enable_nesting()?
>
> Yes, lets just put this patch in the series please:
>
> https://lore.kernel.org/kvm/[email protected]/

Oh. Didn't read this before sending my previous reply..

Will do that.

Nic

2023-03-11 11:57:07

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 10, 2023 at 12:19:50PM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 08:20:03PM -0800, Nicolin Chen wrote:
> > On Thu, Mar 09, 2023 at 11:31:04AM -0400, Jason Gunthorpe wrote:
> > > On Thu, Mar 09, 2023 at 02:49:14PM +0000, Robin Murphy wrote:
> > >
> > > > If the design here is that user_data is so deeply driver-specific and
> > > > special to the point that it can't possibly be passed as a type-checked
> > > > union of the known and publicly-visible UAPI types that it is, wouldn't it
> > > > make sense to just encode the whole thing in the expected format and not
> > > > have to make these kinds of niggling little conversions at both ends?
> > >
> > > Yes, I suspect the design for ARM should have the input be the entire
> > > actual command work queue entry. There is no reason to burn CPU cycles
> > > in userspace marshalling it to something else and then decode it again
> > > in the kernel. Organize things to point the ioctl directly at the
> > > queue entry, and the kernel can do a single memcpy from guest
> > > controlled pages to kernel memory then parse it?
> >
> > There still can be complications to do something straightforward
> > like that.
>
> > Firstly, the consumer and producer indexes might need
> > to be synced between the host and kernel?
>
> No, qemu would handles this. The kernel would just read the command
> entries it is told by qemu to read which qemu has already sorted out.

Then, instead of sending command, forwarding the consumer index?

> > Secondly, things like SID and VMID fields in the commands need to
> > be replaced manually when the host kernel reads commands out, which
> > means that there need to be a translation table(s) in the host
> > kernel to replace those fields. These actually are parts of the
> > features of VCMDQ hardware itself.
>
> VMID should be ignored in a guest request.

The guest always set VMID fields to zero. But it should be then
handled in the host for most of TLBI commands.

VCMDQ has a register to set VMID explicitly so hardware can fill
the VMID fields spontaneously.

> SID translation is a good point. Can qemu do this? How does SID
> translation work with VCMDQ in HW? (Jean this is exactly the sort of
> tiny detail that the generic interface ignored)

VCMDQ has multiple pairs of MATCH and REPLACE registers to set
up hardware lookup table for SIDs. So hardware can do the job,
replacing the SID fields in the TLBI commands.

> What I'm broadly thinking is if we have to make the infrastructure for
> VCMDQ HW accelerated invalidation then it is not a big step to also
> have the kernel SW path use the same infrastructure just with a CPU
> wake up instead of a MMIO poke.
>
> Ie we have a SW version of VCMDQ to speed up SMMUv3 cases without HW
> support.

Very interesting idea!

I recall that one difficulty is to pass the vSID from the guest
down to the host kernel driver and to link with the pSID. What I
did previously for VCMDQ was to set the SID_MATCH register with
iommu_group_id(group) and set the SID_REPLACE register with the
pSID. Then hyper will use the iommu_group_id to search for the
pair of the registers, and to set vSID. Perhaps we should think
of something smarter.

> I suspect the answer to Robin's question on how to handle errors is
> the most important deciding factor. If we have to capture and relay
> actual HW errors back to userspace that really suggests we should do
> something different than a synchronous ioctl.

A synchronous ioctl is to return some values other than defining
cache_invalidate_user as void, like we are doing now? An fault
injection pathway to report CERROR asynchronously is what we've
been doing though -- even with Eric's previous VFIO solution.

Thanks
Nic

2023-03-11 12:38:17

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 10, 2023 at 05:53:46PM +0000, Robin Murphy wrote:

> > > > + case CMDQ_OP_TLBI_NH_VA:
> > > > + cmd.tlbi.asid = inv_info->asid;
> > > > + fallthrough;
> > > > + case CMDQ_OP_TLBI_NH_VAA:
> > > > + if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||
> > >
> > > Non-range invalidations with TG=0 are perfectly legal, and should not be
> > > ignored.
> >
> > I assume that you are talking about the pgsize_bitmap check.
> >
> > QEMU embeds a !tg case into the granule_size [1]. So it might
> > not be straightforward to cover that case. Let me see how to
> > untangle different cases and handle them accordingly.
>
> Oh, double-checking patch #2, that might be me misunderstanding the
> interface. I hadn't realised that the UAPI was apparently modelled on
> arm_smmu_tlb_inv_range_asid() rather than actual SMMU commands :)

Yea. In fact, most of the invalidation info in QEMU was packed
for the previously defined general cache invalidation structure,
and the range invalidation part is still not quite independent.

> I really think UAPI should reflect the hardware and encode TG and TTL
> directly. Especially since there's technically a flaw in the current
> driver where we assume TTL in cases where it isn't actually known, thus
> may potentially fail to invalidate level 2 block entries when removing a
> level 1 table, since io-pgtable passes the level 3 granule in that case.

Do you mean something like hw_info forwarding pgsize_bitmap/tg
to the guest? Or the other direction?

> When range invalidation came along, the distinction between "all leaves
> are definitely at the last level" and "use last-level granularity to
> make sure everything at at any level is hit" started to matter, but the
> interface never caught up. It hasn't seemed desperately urgent to fix
> (who does 1GB+ unmaps outside of VFIO teardown anyway?), but we must
> definitely not bake the same mistake into user ABI.
>
> Of course, there might then be cases where we need to transform
> non-range commands into range commands for the sake of workarounds, but
> that's our own problem to deal with.

Noted it down.

> > > What about NSNH_ALL? That still needs to invalidate all the S1 context
> > > that the guest *thinks* it's invalidating.
> >
> > NSNH_ALL is translated to NH_ALL at the guest level. But maybe
> > it should have been done here instead.
>
> Yes. It seems the worst of both worlds to have an interface which takes
> raw opcodes rather than an enum of supported commands, but still
> requires userspace to know which opcodes are supported and which ones
> don't work as expected even though they are entirely reasonable to use
> in the context of the stage-1-only SMMU being emulated.

Maybe a list of supported TLBI commands via the hw_info uAPI?

Thanks
Nic

2023-03-11 12:40:53

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] iommu/arm-smmu-v3: Prepare for nested domain support

On Fri, Mar 10, 2023 at 08:39:20PM +0000, Robin Murphy wrote:

> > @@ -1277,6 +1295,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > switch (smmu_domain->stage) {
> > case ARM_SMMU_DOMAIN_S1:
> > s1_cfg = &smmu_domain->s1_cfg;
> > + s2_cfg = to_s2_cfg(smmu_domain);
>
> TBH I'd say you only need a 2-line change here. All the other cases
> below are when the stage is guaranteed to be ARM_SMMU_DOMAIN_S2 (once
> ARM_SMMU_DOMAIN_NESTED is gone), so pretending it might be otherwise
> seems unnecessarily confusing.

Oh right...I will drop those.

Thanks!
Nic

2023-03-11 12:53:23

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Sat, Mar 11, 2023 at 03:56:56AM -0800, Nicolin Chen wrote:

> I recall that one difficulty is to pass the vSID from the guest
> down to the host kernel driver and to link with the pSID. What I
> did previously for VCMDQ was to set the SID_MATCH register with
> iommu_group_id(group) and set the SID_REPLACE register with the
> pSID. Then hyper will use the iommu_group_id to search for the
> pair of the registers, and to set vSID. Perhaps we should think
> of something smarter.

I just found that the CFGI_STE command has the SID field, yet
we just didn't pack it in the data structure for a hwpt_alloc
ioctl. So, perhaps it isn't that difficult at all. I'll try a
bit of a test run next week.

Thanks
Nic

2023-03-13 13:07:57

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On 2023-03-11 12:38, Nicolin Chen wrote:
> On Fri, Mar 10, 2023 at 05:53:46PM +0000, Robin Murphy wrote:
>
>>>>> + case CMDQ_OP_TLBI_NH_VA:
>>>>> + cmd.tlbi.asid = inv_info->asid;
>>>>> + fallthrough;
>>>>> + case CMDQ_OP_TLBI_NH_VAA:
>>>>> + if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||
>>>>
>>>> Non-range invalidations with TG=0 are perfectly legal, and should not be
>>>> ignored.
>>>
>>> I assume that you are talking about the pgsize_bitmap check.
>>>
>>> QEMU embeds a !tg case into the granule_size [1]. So it might
>>> not be straightforward to cover that case. Let me see how to
>>> untangle different cases and handle them accordingly.
>>
>> Oh, double-checking patch #2, that might be me misunderstanding the
>> interface. I hadn't realised that the UAPI was apparently modelled on
>> arm_smmu_tlb_inv_range_asid() rather than actual SMMU commands :)
>
> Yea. In fact, most of the invalidation info in QEMU was packed
> for the previously defined general cache invalidation structure,
> and the range invalidation part is still not quite independent.
>
>> I really think UAPI should reflect the hardware and encode TG and TTL
>> directly. Especially since there's technically a flaw in the current
>> driver where we assume TTL in cases where it isn't actually known, thus
>> may potentially fail to invalidate level 2 block entries when removing a
>> level 1 table, since io-pgtable passes the level 3 granule in that case.
>
> Do you mean something like hw_info forwarding pgsize_bitmap/tg
> to the guest? Or the other direction?

I mean if the interface wants to support range invalidations in a way
which works correctly, then it should ideally carry both the TG and TTL
fields from the guest command straight through to the host. If not, then
at the very least the host must always assume TTL=0, because it cannot
correctly infer otherwise once the guest command's original intent has
been lost.

>> When range invalidation came along, the distinction between "all leaves
>> are definitely at the last level" and "use last-level granularity to
>> make sure everything at at any level is hit" started to matter, but the
>> interface never caught up. It hasn't seemed desperately urgent to fix
>> (who does 1GB+ unmaps outside of VFIO teardown anyway?), but we must
>> definitely not bake the same mistake into user ABI.
>>
>> Of course, there might then be cases where we need to transform
>> non-range commands into range commands for the sake of workarounds, but
>> that's our own problem to deal with.
>
> Noted it down.
>
>>>> What about NSNH_ALL? That still needs to invalidate all the S1 context
>>>> that the guest *thinks* it's invalidating.
>>>
>>> NSNH_ALL is translated to NH_ALL at the guest level. But maybe
>>> it should have been done here instead.
>>
>> Yes. It seems the worst of both worlds to have an interface which takes
>> raw opcodes rather than an enum of supported commands, but still
>> requires userspace to know which opcodes are supported and which ones
>> don't work as expected even though they are entirely reasonable to use
>> in the context of the stage-1-only SMMU being emulated.
>
> Maybe a list of supported TLBI commands via the hw_info uAPI?

I don't think it's all that difficult to implicitly support all commands
that are valid for a stage-1-only SMMU, it just needs the right
interface design to be capable of encoding them all completely and
unambiguously. Coming back to the previous point about the address
encoding, I think that means basing it more directly on the actual
SMMUv3 commands, rather than on io-pgtable's abstraction of invalidation
with SMMUv3 opcodes bolted on.

Thanks,
Robin.

2023-03-16 00:01:41

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 13, 2023 at 01:07:42PM +0000, Robin Murphy wrote:
> External email: Use caution opening links or attachments
>
>
> On 2023-03-11 12:38, Nicolin Chen wrote:
> > On Fri, Mar 10, 2023 at 05:53:46PM +0000, Robin Murphy wrote:
> >
> > > > > > + case CMDQ_OP_TLBI_NH_VA:
> > > > > > + cmd.tlbi.asid = inv_info->asid;
> > > > > > + fallthrough;
> > > > > > + case CMDQ_OP_TLBI_NH_VAA:
> > > > > > + if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||
> > > > >
> > > > > Non-range invalidations with TG=0 are perfectly legal, and should not be
> > > > > ignored.
> > > >
> > > > I assume that you are talking about the pgsize_bitmap check.
> > > >
> > > > QEMU embeds a !tg case into the granule_size [1]. So it might
> > > > not be straightforward to cover that case. Let me see how to
> > > > untangle different cases and handle them accordingly.
> > >
> > > Oh, double-checking patch #2, that might be me misunderstanding the
> > > interface. I hadn't realised that the UAPI was apparently modelled on
> > > arm_smmu_tlb_inv_range_asid() rather than actual SMMU commands :)
> >
> > Yea. In fact, most of the invalidation info in QEMU was packed
> > for the previously defined general cache invalidation structure,
> > and the range invalidation part is still not quite independent.
> >
> > > I really think UAPI should reflect the hardware and encode TG and TTL
> > > directly. Especially since there's technically a flaw in the current
> > > driver where we assume TTL in cases where it isn't actually known, thus
> > > may potentially fail to invalidate level 2 block entries when removing a
> > > level 1 table, since io-pgtable passes the level 3 granule in that case.
> >
> > Do you mean something like hw_info forwarding pgsize_bitmap/tg
> > to the guest? Or the other direction?
>
> I mean if the interface wants to support range invalidations in a way
> which works correctly, then it should ideally carry both the TG and TTL
> fields from the guest command straight through to the host. If not, then
> at the very least the host must always assume TTL=0, because it cannot
> correctly infer otherwise once the guest command's original intent has
> been lost.

Oh, it's about hypervisor simply forwarding the entire CMD to
the host side. Jason is suggesting a fast approach by letting
host kernel read the CMDQ directly to get the raw CMD. Perhaps
that would address this comments about TG/TTL too.

I wonder if there could be other case than a WAR, where TG/TTL
fields from the guest's aren't supported by the host. And then
should the host handle it with a different CMD?

> > > When range invalidation came along, the distinction between "all leaves
> > > are definitely at the last level" and "use last-level granularity to
> > > make sure everything at at any level is hit" started to matter, but the
> > > interface never caught up. It hasn't seemed desperately urgent to fix
> > > (who does 1GB+ unmaps outside of VFIO teardown anyway?), but we must
> > > definitely not bake the same mistake into user ABI.
> > >
> > > Of course, there might then be cases where we need to transform
> > > non-range commands into range commands for the sake of workarounds, but
> > > that's our own problem to deal with.
> >
> > Noted it down.
> >
> > > > > What about NSNH_ALL? That still needs to invalidate all the S1 context
> > > > > that the guest *thinks* it's invalidating.
> > > >
> > > > NSNH_ALL is translated to NH_ALL at the guest level. But maybe
> > > > it should have been done here instead.
> > >
> > > Yes. It seems the worst of both worlds to have an interface which takes
> > > raw opcodes rather than an enum of supported commands, but still
> > > requires userspace to know which opcodes are supported and which ones
> > > don't work as expected even though they are entirely reasonable to use
> > > in the context of the stage-1-only SMMU being emulated.
> >
> > Maybe a list of supported TLBI commands via the hw_info uAPI?
>
> I don't think it's all that difficult to implicitly support all commands
> that are valid for a stage-1-only SMMU, it just needs the right
> interface design to be capable of encoding them all completely and
> unambiguously. Coming back to the previous point about the address
> encoding, I think that means basing it more directly on the actual
> SMMUv3 commands, rather than on io-pgtable's abstraction of invalidation
> with SMMUv3 opcodes bolted on.

Yea, with the actual commands from the guest, the host can do
something with its supported commands, I think.

Thanks
Nicolin

2023-03-16 00:13:40

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

On Fri, Mar 10, 2023 at 03:28:56PM +0000, Robin Murphy wrote:
> External email: Use caution opening links or attachments
>
>
> On 2023-03-10 01:17, Nicolin Chen wrote:
> > Hi Robin,
> >
> > Thanks for the inputs.
> >
> > On Thu, Mar 09, 2023 at 01:03:41PM +0000, Robin Murphy wrote:
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On 2023-03-09 10:53, Nicolin Chen wrote:
> > > > This is used to forward the host IDR values to the user space, so the
> > > > hypervisor and the guest VM can learn about the underlying hardware's
> > > > capabilities.
> > > >
> > > > Also, set the driver_type to IOMMU_HW_INFO_TYPE_ARM_SMMUV3 to pass the
> > > > corresponding type sanity in the core.
> > > >
> > > > Signed-off-by: Nicolin Chen <[email protected]>
> > > > ---
> > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++++
> > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
> > > > include/uapi/linux/iommufd.h | 14 ++++++++++++
> > > > 3 files changed, 41 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > index f2425b0f0cd6..c1aac695ae0d 100644
> > > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > @@ -2005,6 +2005,29 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
> > > > }
> > > > }
> > > >
> > > > +static void *arm_smmu_hw_info(struct device *dev, u32 *length)
> > > > +{
> > > > + struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > > > + struct iommu_hw_info_smmuv3 *info;
> > > > + void *base_idr;
> > > > + int i;
> > > > +
> > > > + if (!master || !master->smmu)
> > > > + return ERR_PTR(-ENODEV);
> > > > +
> > > > + info = kzalloc(sizeof(*info), GFP_KERNEL);
> > > > + if (!info)
> > > > + return ERR_PTR(-ENOMEM);
> > > > +
> > > > + base_idr = master->smmu->base + ARM_SMMU_IDR0;
> > > > + for (i = 0; i <= 5; i++)
> > > > + info->idr[i] = readl_relaxed(base_idr + 0x4 * i);
> > >
> > > You need to take firmware overrides etc. into account here. In
> > > particular, features like BTM may need to be hidden to work around
> > > errata either in the system integration or the SMMU itself. It isn't
> > > reasonable to expect every VMM to be aware of every erratum and
> > > workaround, and there may even be workarounds where we need to go out of
> > > our way to prevent guests from trying to use certain features in order
> > > to maintain correctness at S2.
> >
> > We can add a bit of overrides after this for errata, perhaps?
> >
> > I have some trouble with finding the errata docs. Would it be
> > possible for you to direct me to it with a link maybe?
>
> The key Arm term is "Software Developer Errata Notice", or just SDEN.
> Here's the ones for MMU-600 and MMU-700:
>
> https://developer.arm.com/documentation/SDEN-946810/latest/

This page shows "Arm CoreLink MMU-600 System Memory Management
Unit Software Developer Errata Notice" but the downloaded file
is "Arm CoreLink CI-700 Coherent Interconnect" errata notice.
And I don't quite understand what it's about.

> https://developer.arm.com/documentation/SDEN-1786925/latest/

Yea, this one I got an "MMU-700 System Memory Management Unit"
SMMU errata file that I can read and understand.

> Note that until now it has been extremely fortunate that in pretty much
> every case Linux either hasn't supported the affected feature at all, or
> has happened to avoid meeting the conditions. Once we do introduce
> nesting support that all goes out the window (and I'll have to think
> more when reviewing new errata in future...)
>
> I've been putting off revisiting all the existing errata to figure out
> what we'd need to do until new nesting patches appeared, so I'll try to
> get to that soon now. I think in many cases it's likely to be best to
> just disallowing nesting entirely on affected implementations.

Do we have already a list of "affected implementations"? Or,
we would need to make such a list now? In a latter case, can
these affected implementations be detected from their IRD0-5
registers, so that we can simply do something in hw_info()?

Thanks
Nic

2023-03-16 00:59:49

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

On Fri, Mar 10, 2023 at 12:06:18PM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> >
> > > Nicolin, I think we should tweak the uAPI here so that the
> > > invalidation opaque data has a format tagged on its own, instead of
> > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > > tag and also a virtio-viommu invalidate type tag.
> >
> > The invalidation tage is shared with the hwpt allocation. Does
> > it mean that virtio-iommu won't have it's own allocation tag?
>
> We probably shouldn't assume it will

In that case, why do have need an invalidation tag/type on its
own? Can't we use an IOMMU_HWPT_TYPE_VIRTIO tag for allocation
and invalidation together for virtio?

Or did you mean that we should define a flag inside the data
structure like this?

struct iommu_hwpt_invalidate_arm_smmuv3 {
#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0)
#define IOMMU_SMMUV3_FORMAT_VIRTIO (1 << 63)
__u64 flags;
}

Thanks
Nic

2023-03-16 14:58:51

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On 2023-03-16 00:01, Nicolin Chen wrote:
> On Mon, Mar 13, 2023 at 01:07:42PM +0000, Robin Murphy wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2023-03-11 12:38, Nicolin Chen wrote:
>>> On Fri, Mar 10, 2023 at 05:53:46PM +0000, Robin Murphy wrote:
>>>
>>>>>>> + case CMDQ_OP_TLBI_NH_VA:
>>>>>>> + cmd.tlbi.asid = inv_info->asid;
>>>>>>> + fallthrough;
>>>>>>> + case CMDQ_OP_TLBI_NH_VAA:
>>>>>>> + if (!granule_size || !(granule_size & smmu->pgsize_bitmap) ||
>>>>>>
>>>>>> Non-range invalidations with TG=0 are perfectly legal, and should not be
>>>>>> ignored.
>>>>>
>>>>> I assume that you are talking about the pgsize_bitmap check.
>>>>>
>>>>> QEMU embeds a !tg case into the granule_size [1]. So it might
>>>>> not be straightforward to cover that case. Let me see how to
>>>>> untangle different cases and handle them accordingly.
>>>>
>>>> Oh, double-checking patch #2, that might be me misunderstanding the
>>>> interface. I hadn't realised that the UAPI was apparently modelled on
>>>> arm_smmu_tlb_inv_range_asid() rather than actual SMMU commands :)
>>>
>>> Yea. In fact, most of the invalidation info in QEMU was packed
>>> for the previously defined general cache invalidation structure,
>>> and the range invalidation part is still not quite independent.
>>>
>>>> I really think UAPI should reflect the hardware and encode TG and TTL
>>>> directly. Especially since there's technically a flaw in the current
>>>> driver where we assume TTL in cases where it isn't actually known, thus
>>>> may potentially fail to invalidate level 2 block entries when removing a
>>>> level 1 table, since io-pgtable passes the level 3 granule in that case.
>>>
>>> Do you mean something like hw_info forwarding pgsize_bitmap/tg
>>> to the guest? Or the other direction?
>>
>> I mean if the interface wants to support range invalidations in a way
>> which works correctly, then it should ideally carry both the TG and TTL
>> fields from the guest command straight through to the host. If not, then
>> at the very least the host must always assume TTL=0, because it cannot
>> correctly infer otherwise once the guest command's original intent has
>> been lost.
>
> Oh, it's about hypervisor simply forwarding the entire CMD to
> the host side. Jason is suggesting a fast approach by letting
> host kernel read the CMDQ directly to get the raw CMD. Perhaps
> that would address this comments about TG/TTL too.

That did cross my mind, but given the usage model, having host userspace
give guest memory whose contents it can't control (unless it pauses the
whole VM on all CPUs) directly to the host kernel just seems to invite
more potential problems than necessary. Commands aren't big, so I think
it's fair to expect the VMM to marshal them into host memory, and save
the host kernel from ever having to reason about any races or other
emulation details which may exist between a VM and its VMM.

> I wonder if there could be other case than a WAR, where TG/TTL
> fields from the guest's aren't supported by the host. And then
> should the host handle it with a different CMD?

As Eric found previously, there's a clear benefit in emulating range
invalidation for guests even if the underlying hardware doesn't support
it, to minimise trapping. But that's not hard, and the patch as-is is
already achieving it. All we need to be careful to avoid is issuing
hardware commands with *less* scope than guest originally asked for - if
the guest asks for a nonsense TG/TTL which doesn't match its current
context, that's fine. The interface just has to ensure that a VMM's SMMU
emulation *is* able to make a nested S1 context behave as expected by
the architecture; we don't need to care if a guest uses the architecture
wrong, since it's only hurting itself.
>>>> When range invalidation came along, the distinction between "all leaves
>>>> are definitely at the last level" and "use last-level granularity to
>>>> make sure everything at at any level is hit" started to matter, but the
>>>> interface never caught up. It hasn't seemed desperately urgent to fix
>>>> (who does 1GB+ unmaps outside of VFIO teardown anyway?), but we must
>>>> definitely not bake the same mistake into user ABI.
>>>>
>>>> Of course, there might then be cases where we need to transform
>>>> non-range commands into range commands for the sake of workarounds, but
>>>> that's our own problem to deal with.
>>>
>>> Noted it down.
>>>
>>>>>> What about NSNH_ALL? That still needs to invalidate all the S1 context
>>>>>> that the guest *thinks* it's invalidating.
>>>>>
>>>>> NSNH_ALL is translated to NH_ALL at the guest level. But maybe
>>>>> it should have been done here instead.
>>>>
>>>> Yes. It seems the worst of both worlds to have an interface which takes
>>>> raw opcodes rather than an enum of supported commands, but still
>>>> requires userspace to know which opcodes are supported and which ones
>>>> don't work as expected even though they are entirely reasonable to use
>>>> in the context of the stage-1-only SMMU being emulated.
>>>
>>> Maybe a list of supported TLBI commands via the hw_info uAPI?
>>
>> I don't think it's all that difficult to implicitly support all commands
>> that are valid for a stage-1-only SMMU, it just needs the right
>> interface design to be capable of encoding them all completely and
>> unambiguously. Coming back to the previous point about the address
>> encoding, I think that means basing it more directly on the actual
>> SMMUv3 commands, rather than on io-pgtable's abstraction of invalidation
>> with SMMUv3 opcodes bolted on.
>
> Yea, with the actual commands from the guest, the host can do
> something with its supported commands, I think.

The one slightly fiddly case, of course, is CMD_SYNC, but I think that's
just a matter for clear documentation of the expectations and behaviour.

Thanks,
Robin.

2023-03-16 15:20:10

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

On 16/03/2023 12:13 am, Nicolin Chen wrote:
> On Fri, Mar 10, 2023 at 03:28:56PM +0000, Robin Murphy wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2023-03-10 01:17, Nicolin Chen wrote:
>>> Hi Robin,
>>>
>>> Thanks for the inputs.
>>>
>>> On Thu, Mar 09, 2023 at 01:03:41PM +0000, Robin Murphy wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 2023-03-09 10:53, Nicolin Chen wrote:
>>>>> This is used to forward the host IDR values to the user space, so the
>>>>> hypervisor and the guest VM can learn about the underlying hardware's
>>>>> capabilities.
>>>>>
>>>>> Also, set the driver_type to IOMMU_HW_INFO_TYPE_ARM_SMMUV3 to pass the
>>>>> corresponding type sanity in the core.
>>>>>
>>>>> Signed-off-by: Nicolin Chen <[email protected]>
>>>>> ---
>>>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 +++++++++++++++++++++
>>>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
>>>>> include/uapi/linux/iommufd.h | 14 ++++++++++++
>>>>> 3 files changed, 41 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> index f2425b0f0cd6..c1aac695ae0d 100644
>>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> @@ -2005,6 +2005,29 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
>>>>> }
>>>>> }
>>>>>
>>>>> +static void *arm_smmu_hw_info(struct device *dev, u32 *length)
>>>>> +{
>>>>> + struct arm_smmu_master *master = dev_iommu_priv_get(dev);
>>>>> + struct iommu_hw_info_smmuv3 *info;
>>>>> + void *base_idr;
>>>>> + int i;
>>>>> +
>>>>> + if (!master || !master->smmu)
>>>>> + return ERR_PTR(-ENODEV);
>>>>> +
>>>>> + info = kzalloc(sizeof(*info), GFP_KERNEL);
>>>>> + if (!info)
>>>>> + return ERR_PTR(-ENOMEM);
>>>>> +
>>>>> + base_idr = master->smmu->base + ARM_SMMU_IDR0;
>>>>> + for (i = 0; i <= 5; i++)
>>>>> + info->idr[i] = readl_relaxed(base_idr + 0x4 * i);
>>>>
>>>> You need to take firmware overrides etc. into account here. In
>>>> particular, features like BTM may need to be hidden to work around
>>>> errata either in the system integration or the SMMU itself. It isn't
>>>> reasonable to expect every VMM to be aware of every erratum and
>>>> workaround, and there may even be workarounds where we need to go out of
>>>> our way to prevent guests from trying to use certain features in order
>>>> to maintain correctness at S2.
>>>
>>> We can add a bit of overrides after this for errata, perhaps?
>>>
>>> I have some trouble with finding the errata docs. Would it be
>>> possible for you to direct me to it with a link maybe?
>>
>> The key Arm term is "Software Developer Errata Notice", or just SDEN.
>> Here's the ones for MMU-600 and MMU-700:
>>
>> https://developer.arm.com/documentation/SDEN-946810/latest/
>
> This page shows "Arm CoreLink MMU-600 System Memory Management
> Unit Software Developer Errata Notice" but the downloaded file
> is "Arm CoreLink CI-700 Coherent Interconnect" errata notice.
> And I don't quite understand what it's about.

Oh, wonderful... I've reported that now, hopefully it gets fixed soon...

>> https://developer.arm.com/documentation/SDEN-1786925/latest/
>
> Yea, this one I got an "MMU-700 System Memory Management Unit"
> SMMU errata file that I can read and understand.
>
>> Note that until now it has been extremely fortunate that in pretty much
>> every case Linux either hasn't supported the affected feature at all, or
>> has happened to avoid meeting the conditions. Once we do introduce
>> nesting support that all goes out the window (and I'll have to think
>> more when reviewing new errata in future...)
>>
>> I've been putting off revisiting all the existing errata to figure out
>> what we'd need to do until new nesting patches appeared, so I'll try to
>> get to that soon now. I think in many cases it's likely to be best to
>> just disallowing nesting entirely on affected implementations.
>
> Do we have already a list of "affected implementations"? Or,
> we would need to make such a list now? In a latter case, can
> these affected implementations be detected from their IRD0-5
> registers, so that we can simply do something in hw_info()?

Somewhere I have a patch that adds all the IIDR stuff needed for this,
but I never sent it upstream since the erratum itself was an early
MMU-600 one which in practice doesn't matter. I'll dig that out and
update it with what I have in mind.

Thanks,
Robin.

2023-03-16 20:07:01

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

On Thu, Mar 16, 2023 at 03:19:27PM +0000, Robin Murphy wrote:

> > > Note that until now it has been extremely fortunate that in pretty much
> > > every case Linux either hasn't supported the affected feature at all, or
> > > has happened to avoid meeting the conditions. Once we do introduce
> > > nesting support that all goes out the window (and I'll have to think
> > > more when reviewing new errata in future...)
> > >
> > > I've been putting off revisiting all the existing errata to figure out
> > > what we'd need to do until new nesting patches appeared, so I'll try to
> > > get to that soon now. I think in many cases it's likely to be best to
> > > just disallowing nesting entirely on affected implementations.
> >
> > Do we have already a list of "affected implementations"? Or,
> > we would need to make such a list now? In a latter case, can
> > these affected implementations be detected from their IRD0-5
> > registers, so that we can simply do something in hw_info()?
>
> Somewhere I have a patch that adds all the IIDR stuff needed for this,
> but I never sent it upstream since the erratum itself was an early
> MMU-600 one which in practice doesn't matter. I'll dig that out and
> update it with what I have in mind.

Nice!

Perhaps we should merge that first, or include in this series
if you don't mind, so that we would be less worried about any
affected platform when releasing the new Linux version having
this nesting feature.

Thanks!
Nic

2023-03-16 21:09:26

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 16, 2023 at 02:58:39PM +0000, Robin Murphy wrote:

> > > > > I really think UAPI should reflect the hardware and encode TG and TTL
> > > > > directly. Especially since there's technically a flaw in the current
> > > > > driver where we assume TTL in cases where it isn't actually known, thus
> > > > > may potentially fail to invalidate level 2 block entries when removing a
> > > > > level 1 table, since io-pgtable passes the level 3 granule in that case.
> > > >
> > > > Do you mean something like hw_info forwarding pgsize_bitmap/tg
> > > > to the guest? Or the other direction?
> > >
> > > I mean if the interface wants to support range invalidations in a way
> > > which works correctly, then it should ideally carry both the TG and TTL
> > > fields from the guest command straight through to the host. If not, then
> > > at the very least the host must always assume TTL=0, because it cannot
> > > correctly infer otherwise once the guest command's original intent has
> > > been lost.
> >
> > Oh, it's about hypervisor simply forwarding the entire CMD to
> > the host side. Jason is suggesting a fast approach by letting
> > host kernel read the CMDQ directly to get the raw CMD. Perhaps
> > that would address this comments about TG/TTL too.
>
> That did cross my mind, but given the usage model, having host userspace
> give guest memory whose contents it can't control (unless it pauses the
> whole VM on all CPUs) directly to the host kernel just seems to invite
> more potential problems than necessary. Commands aren't big, so I think
> it's fair to expect the VMM to marshal them into host memory, and save
> the host kernel from ever having to reason about any races or other
> emulation details which may exist between a VM and its VMM.

An invalidation ioctl is synchronously executed from the top
level in QEMU when it traps any CMDQ_PROD write. So, either
packing the fields of a command into a data structure or just
forwarding the command directly, it seems to be the same for
the matter of worrying about race conditions?

> > I wonder if there could be other case than a WAR, where TG/TTL
> > fields from the guest's aren't supported by the host. And then
> > should the host handle it with a different CMD?
>
> As Eric found previously, there's a clear benefit in emulating range
> invalidation for guests even if the underlying hardware doesn't support
> it, to minimise trapping. But that's not hard, and the patch as-is is
> already achieving it. All we need to be careful to avoid is issuing
> hardware commands with *less* scope than guest originally asked for - if
> the guest asks for a nonsense TG/TTL which doesn't match its current
> context, that's fine. The interface just has to ensure that a VMM's SMMU
> emulation *is* able to make a nested S1 context behave as expected by
> the architecture; we don't need to care if a guest uses the architecture
> wrong, since it's only hurting itself.

Agreed. Yet, similar to moving the emulation of TLBI_NSNH_ALL,
from QEMU to the kernel, we could move the emulations of other
TLBI commands to the kernel too? So that a hyperviosr doesn't
need to know the underlying supported TLBI commands by a host,
and then simply relies on the host to emulate the command with
whatever the actual commands that the host can do, addressing
one of your comments mentioned in the conversation below?

> > > > > > > What about NSNH_ALL? That still needs to invalidate all the S1 context
> > > > > > > that the guest *thinks* it's invalidating.
> > > > > >
> > > > > > NSNH_ALL is translated to NH_ALL at the guest level. But maybe
> > > > > > it should have been done here instead.
> > > > >
> > > > > Yes. It seems the worst of both worlds to have an interface which takes
> > > > > raw opcodes rather than an enum of supported commands, but still
> > > > > requires userspace to know which opcodes are supported and which ones
> > > > > don't work as expected even though they are entirely reasonable to use
> > > > > in the context of the stage-1-only SMMU being emulated.
> > > >
> > > > Maybe a list of supported TLBI commands via the hw_info uAPI?
> > >
> > > I don't think it's all that difficult to implicitly support all commands
> > > that are valid for a stage-1-only SMMU, it just needs the right
> > > interface design to be capable of encoding them all completely and
> > > unambiguously. Coming back to the previous point about the address
> > > encoding, I think that means basing it more directly on the actual
> > > SMMUv3 commands, rather than on io-pgtable's abstraction of invalidation
> > > with SMMUv3 opcodes bolted on.
> >
> > Yea, with the actual commands from the guest, the host can do
> > something with its supported commands, I think.
>
> The one slightly fiddly case, of course, is CMD_SYNC, but I think that's
> just a matter for clear documentation of the expectations and behaviour.

What could be odd about CMD_SYNC?

Actually with QEMU, an ioctl for a CMD execution is initiated
by a CMD_PROD write trapped by the QEMU, then a CMD_SYNC only
triggers an IRQ in this setup.

Thanks
Nic

2023-03-17 09:24:35

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, March 9, 2023 11:31 PM
>
> > Also, perhaps I've overlooked something obvious, but what's the
> procedure
> > for reflecting illegal commands back to userspace? Some of the things we're
> > silently ignoring here would be expected to raise CERROR_ILL. Same goes
> for
> > all the other fault events which may occur due to invalid S1 config, come to
> > think of it.
>
> Perhaps the ioctl should fail and the userpace viommu should inject
> this CERROR_ILL?
>
> But I'm also wondering if we are making a mistake to not just have the
> kernel driver to expose a SW work queue in its native format and the
> ioctl is only just 'read the queue'. Then it could (asynchronously!)
> push back answers, real or emulated, as well, including all error
> indications.
>
> I think we got down this synchronous one-ioctl-per-invalidation path
> because that was what the original generic stuff wanted to do. Is it
> what we really want? Kevin what is your perspective?
>

That's an interesting idea. I think the original synchronous model
also matches how intel-iommu driver works today. In most time
it does synchronous one-invalidation at one time.

Another problem is how to map invalidation scope in native descriptor
format to affected devices.

VT-d allows per-DID invalidation. This needs extra information to map
vDID to affected devices in the kernel.

It also allows a global invalidation type which invalidate all vDIDs. This
might be easy by simply looping every device bound to the iommufd_ctx.


2023-03-17 09:41:44

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, March 11, 2023 12:20 AM
>
> What I'm broadly thinking is if we have to make the infrastructure for
> VCMDQ HW accelerated invalidation then it is not a big step to also
> have the kernel SW path use the same infrastructure just with a CPU
> wake up instead of a MMIO poke.
>
> Ie we have a SW version of VCMDQ to speed up SMMUv3 cases without HW
> support.
>

I thought about this in VT-d context. Looks there are some difficulties.

The most prominent one is that head/tail of the VT-d invalidation queue
are in MMIO registers. Handling it in kernel iommu driver suggests
reading virtual tail register and updating virtual head register. Kind of
moving some vIOMMU awareness into the kernel which, iirc, is not
a welcomed model.

vhost doesn't have this problem as its vring structure fully resides in
memory including ring tail/head. As long as kernel vhost driver understands
the structure and can send/receive notification to/from kvm then the
in-kernel acceleration works seamlessly.

Not sure whether SMMU has similar obstacle as VT-d. But this is my
impression why vhost-iommu is preferred when talking about such
optimization before.

2023-03-17 09:47:56

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Robin Murphy <[email protected]>
> Sent: Thursday, March 9, 2023 10:49 PM
> > + case CMDQ_OP_ATC_INV:
> > + ssid = inv_info->ssid;
> > + iova = inv_info->range.start;
> > + size = inv_info->range.last - inv_info->range.start + 1;
> > + break;
>
> Can we do any better than multiplying every single ATC_INV command, even
> for random bogus StreamIDs, into multiple commands across every physical
> device? In fact, I'm not entirely confident this isn't problematic, if
> the guest wishes to send invalidations for one device specifically while
> it's put some other device into a state where sending it a command would
> do something bad. At the very least, it's liable to be confusing if the
> guest sends a command for one StreamID but gets an error back for a
> different one.
>

Or do we need support this cmd at all?

For vt-d we always implicitly invalidate ATC following a iotlb invalidation
request from userspace. Then vIOMMU just treats it as a nop in the
virtual queue.

IMHO a sane iommu driver should always invalidate both iotlb and atc
together. I'm not sure a valid usage where iotlb is invalidated while
atc is left with some stale mappings.

2023-03-17 10:05:09

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

> From: Robin Murphy <[email protected]>
> Sent: Friday, March 10, 2023 11:57 PM
>
> >
> > If iommufd could provide a general cross-driver API to set exactly
> > that scenario up then VMM code could also be general. That seems
> > prettty interesting.
>
> Indeed, I've always assumed the niche for virtio would be that kind of
> in-between use-case using nesting to accelerate simple translation,
> where we plug a guest-owned pagetable into a host-owned context. That
> way the guest retains the simple virtio interface and only needs to
> understand a pagetable format (or as you say, simply share a CPU
> pagetable) without having to care about the nitty-gritty of all the
> IOMMU-specific moving parts around it. For guests that want to get into
> more advanced stuff like managing their own PASID tables, pushing them
> towards "native" nesting probably does make more sense.
>

Interesting thing is that we cannot expose both virtio-iommu and
emulated vIOMMU to one guest to choose. then if the guest has
been using virtio-iommu for whatever reason naturally it may
want more advanced features on it too.

2023-03-17 10:10:47

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, March 11, 2023 12:03 AM
>
> On Fri, Mar 10, 2023 at 03:57:27PM +0000, Robin Murphy wrote:
>
> > about the nitty-gritty of all the IOMMU-specific moving parts around it. For
> > guests that want to get into more advanced stuff like managing their own
> > PASID tables, pushing them towards "native" nesting probably does make
> more
> > sense.
>
> IMHO with the simplified virtio model I would say the guest should
> not have its own PASID table.
>
> hyper trap to install a PASID and let the hypervisor driver handle
> this abstractly. If abstractly is the whole point and benifit then
> virtio should lean into that.
>
> This also means virtio protocol doesn't do PASID invalidation. It
> invalidates an ASID and the hypervisor takes care of whatever it is
> connected to. Very simple and general for the VM.

this sounds fair, if ASID here refers a general ID identifying the page
table instead of ARM specific ASID. ????

but guest still needs to manage the PASID and program PASID into
the assigned device to tag DMA.

>
> Adding a S1 iommu_domain op for invalidate address range is perfectly
> fine and the virtio kernel hypervisor driver can call it generically.
>
> The primary reason to have guest-owned PASID tables is CC stuff, which
> definitely won't be part of virtio-iommu.
>

This fits Intel well.

2023-03-17 10:17:23

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, March 10, 2023 8:52 PM
>
> On Fri, Mar 10, 2023 at 12:33:12PM +0100, Eric Auger wrote:
>
> > I do agree with Jean. We spent a lot of efforts all together to define
> > this generic invalidation API and if there is compelling reason that
> > prevents from using it, we should try to reuse it.
>
> That's the compelling reason in a nutshell right there.
>
> Alot of time was invested to create something that might be
> general. We still don't know if it is well defined and general. Even
> more time is going to be required on it before it could go forward. In
> future more time will be needed for every future HW to try and fit
> into it. We don't even know if it will scale to future HW. Nobody has
> even checked what today's POWER and S390 HW need.
>
> vs, this stuff was made in a few days. We know it is correct as a uAPI
> since it mirrors the HW and we know it is scalable to different HW
> schemes if they come up.
>
> So I don't see a good reason to take a risk on a "general" uAPI. If we
> make this wrong it could seriously damage the main goal of iommufd -
> to build accelerated vIOMMU models.
>

I'm with this point. We can add a virtio format when it comes.

2023-03-17 14:16:36

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 17, 2023 at 09:47:47AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Robin Murphy <[email protected]>
> > Sent: Thursday, March 9, 2023 10:49 PM
> > > + case CMDQ_OP_ATC_INV:
> > > + ssid = inv_info->ssid;
> > > + iova = inv_info->range.start;
> > > + size = inv_info->range.last - inv_info->range.start + 1;
> > > + break;
> >
> > Can we do any better than multiplying every single ATC_INV command, even
> > for random bogus StreamIDs, into multiple commands across every physical
> > device? In fact, I'm not entirely confident this isn't problematic, if
> > the guest wishes to send invalidations for one device specifically while
> > it's put some other device into a state where sending it a command would
> > do something bad. At the very least, it's liable to be confusing if the
> > guest sends a command for one StreamID but gets an error back for a
> > different one.
> >
>
> Or do we need support this cmd at all?
>
> For vt-d we always implicitly invalidate ATC following a iotlb invalidation
> request from userspace. Then vIOMMU just treats it as a nop in the
> virtual queue.
>
> IMHO a sane iommu driver should always invalidate both iotlb and atc
> together. I'm not sure a valid usage where iotlb is invalidated while
> atc is left with some stale mappings.

vSMMU code in QEMU actually doesn't forward this command. So,
I guess that you are right about this support here and we may
just drop it.

Thanks!
Nic

2023-03-17 14:24:23

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 17, 2023 at 09:41:34AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
>
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Saturday, March 11, 2023 12:20 AM
> >
> > What I'm broadly thinking is if we have to make the infrastructure for
> > VCMDQ HW accelerated invalidation then it is not a big step to also
> > have the kernel SW path use the same infrastructure just with a CPU
> > wake up instead of a MMIO poke.
> >
> > Ie we have a SW version of VCMDQ to speed up SMMUv3 cases without HW
> > support.
> >
>
> I thought about this in VT-d context. Looks there are some difficulties.
>
> The most prominent one is that head/tail of the VT-d invalidation queue
> are in MMIO registers. Handling it in kernel iommu driver suggests
> reading virtual tail register and updating virtual head register. Kind of
> moving some vIOMMU awareness into the kernel which, iirc, is not
> a welcomed model.

I had a similar question in another email:
"Firstly, the consumer and producer indexes might need
to be synced between the host and kernel?"
And Jason replied me with this:
"No, qemu would handles this. The kernel would just read the command
entries it is told by qemu to read which qemu has already sorted out."

Maybe there is no need of a concern for the head/tail readings?

> vhost doesn't have this problem as its vring structure fully resides in
> memory including ring tail/head. As long as kernel vhost driver understands
> the structure and can send/receive notification to/from kvm then the
> in-kernel acceleration works seamlessly.
>
> Not sure whether SMMU has similar obstacle as VT-d. But this is my
> impression why vhost-iommu is preferred when talking about such
> optimization before.

SMMU has a similar pair of head/tail pointers to the invalidation
queue (consumer/producer indexes and command queue in SMMU term).

Thanks
Nic

2023-03-20 01:38:01

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 16, 2023 at 02:09:08PM -0700, Nicolin Chen wrote:
> On Thu, Mar 16, 2023 at 02:58:39PM +0000, Robin Murphy wrote:
>
> > > > > > I really think UAPI should reflect the hardware and encode TG and TTL
> > > > > > directly. Especially since there's technically a flaw in the current
> > > > > > driver where we assume TTL in cases where it isn't actually known, thus
> > > > > > may potentially fail to invalidate level 2 block entries when removing a
> > > > > > level 1 table, since io-pgtable passes the level 3 granule in that case.
> > > > >
> > > > > Do you mean something like hw_info forwarding pgsize_bitmap/tg
> > > > > to the guest? Or the other direction?
> > > >
> > > > I mean if the interface wants to support range invalidations in a way
> > > > which works correctly, then it should ideally carry both the TG and TTL
> > > > fields from the guest command straight through to the host. If not, then
> > > > at the very least the host must always assume TTL=0, because it cannot
> > > > correctly infer otherwise once the guest command's original intent has
> > > > been lost.
> > >
> > > Oh, it's about hypervisor simply forwarding the entire CMD to
> > > the host side. Jason is suggesting a fast approach by letting
> > > host kernel read the CMDQ directly to get the raw CMD. Perhaps
> > > that would address this comments about TG/TTL too.
> >
> > That did cross my mind, but given the usage model, having host userspace
> > give guest memory whose contents it can't control (unless it pauses the
> > whole VM on all CPUs) directly to the host kernel just seems to invite
> > more potential problems than necessary. Commands aren't big, so I think
> > it's fair to expect the VMM to marshal them into host memory, and save
> > the host kernel from ever having to reason about any races or other
> > emulation details which may exist between a VM and its VMM.
>
> An invalidation ioctl is synchronously executed from the top
> level in QEMU when it traps any CMDQ_PROD write. So, either
> packing the fields of a command into a data structure or just
> forwarding the command directly, it seems to be the same for
> the matter of worrying about race conditions?

I think I misread your reply here :)

What you suggested is exactly forwarding the command v.s. host
reading guest's command queue memory.

Although I haven't fully got what Jason's "sorting" approach,
this could already simplify the data structure holding all the
fields, by passing a "__u64 cmds[2]" alone. A sample code:

+struct iommu_hwpt_invalidate_arm_smmuv3 {
+ struct iommu_iova_range range;
+ __u64 cmd[2];
+};

then...

+ cmd[0] = inv_info->cmd[0];
+ cmd[1] = inv_info->cmd[1];
+ switch (cmd[0] & 0xff) {
+ case CMDQ_OP_TLBI_NSNH_ALL:
+ cmd[0] &= ~0xffULL;
+ cmd[0] |= CMDQ_OP_TLBI_NH_ALL;
+ fallthrough;
+ case CMDQ_OP_TLBI_NH_VA:
+ case CMDQ_OP_TLBI_NH_VAA:
+ case CMDQ_OP_TLBI_NH_ALL:
+ case CMDQ_OP_TLBI_NH_ASID:
+ cmd[0] &= ~CMDQ_TLBI_0_VMID;
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, smmu_domain->s2->s2_cfg.vmid);
+ arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
+ break;
+ case CMDQ_OP_CFGI_CD:
+ case CMDQ_OP_CFGI_CD_ALL:
+ arm_smmu_sync_cd(smmu_domain,
+ FIELD_GET(CMDQ_CFGI_0_SSID, cmd[0]), false);
+ break;
+ default:
+ return;
+ }

We could probably do a batch forwarding to if it's worthy?

Thanks
Nic

2023-03-20 12:59:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 17, 2023 at 09:41:34AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Saturday, March 11, 2023 12:20 AM
> >
> > What I'm broadly thinking is if we have to make the infrastructure for
> > VCMDQ HW accelerated invalidation then it is not a big step to also
> > have the kernel SW path use the same infrastructure just with a CPU
> > wake up instead of a MMIO poke.
> >
> > Ie we have a SW version of VCMDQ to speed up SMMUv3 cases without HW
> > support.
> >
>
> I thought about this in VT-d context. Looks there are some difficulties.
>
> The most prominent one is that head/tail of the VT-d invalidation queue
> are in MMIO registers. Handling it in kernel iommu driver suggests
> reading virtual tail register and updating virtual head register. Kind of
> moving some vIOMMU awareness into the kernel which, iirc, is not
> a welcomed model.

qemu would trap the MMIO and generate an IOCTL with the written head
pointer. It isn't as efficient as having the kernel do the trap, but
does give batching.

Jason

2023-03-20 13:03:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Sat, Mar 11, 2023 at 03:56:50AM -0800, Nicolin Chen wrote:

> I recall that one difficulty is to pass the vSID from the guest
> down to the host kernel driver and to link with the pSID. What I
> did previously for VCMDQ was to set the SID_MATCH register with
> iommu_group_id(group) and set the SID_REPLACE register with the
> pSID. Then hyper will use the iommu_group_id to search for the
> pair of the registers, and to set vSID. Perhaps we should think
> of something smarter.

We need an ioctl for this, I think. To load a map of vSID to dev_id
into the driver. Kernel will convert dev_id to pSID. Driver will
program the map into HW.

SW path will program the map into an xarray

> > I suspect the answer to Robin's question on how to handle errors is
> > the most important deciding factor. If we have to capture and relay
> > actual HW errors back to userspace that really suggests we should do
> > something different than a synchronous ioctl.
>
> A synchronous ioctl is to return some values other than defining
> cache_invalidate_user as void, like we are doing now? An fault
> injection pathway to report CERROR asynchronously is what we've
> been doing though -- even with Eric's previous VFIO solution.

Where is this? How does it look?

Jason

2023-03-20 13:12:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Sun, Mar 19, 2023 at 06:32:03PM -0700, Nicolin Chen wrote:

> +struct iommu_hwpt_invalidate_arm_smmuv3 {
> + struct iommu_iova_range range;

what is this?

> + __u64 cmd[2];
> +};

You still have to do something with the SID. We can't just allow any
un-validated SID value - the driver has to check the incoming SID
against allowed SIDs for this iommufd_ctx

Jason

2023-03-20 15:37:04

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 10:11:54AM -0300, Jason Gunthorpe wrote:
> On Sun, Mar 19, 2023 at 06:32:03PM -0700, Nicolin Chen wrote:
>
> > +struct iommu_hwpt_invalidate_arm_smmuv3 {
> > + struct iommu_iova_range range;
>
> what is this?

Not used. A copy-n-paste mistake :(

>
> > + __u64 cmd[2];
> > +};
>
> You still have to do something with the SID. We can't just allow any
> un-validated SID value - the driver has to check the incoming SID
> against allowed SIDs for this iommufd_ctx

Hmm, that's something "missing" even in the current design.

Yet, most of the TLBI commands don't hold an SID field. So,
the hypervisor only trapping a queue write-pointer movement
cannot get the exact vSID for a TLBI command. What our QEMU
code currently does is simply broadcasting all the devices
on the list of attaching devices to the vSMMU, which means
that such an enforcement in the kernel would basically just
allow any vSID (device) that's attached to the domain?

Thanks
Nic

2023-03-20 16:07:10

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 10:03:04AM -0300, Jason Gunthorpe wrote:
> On Sat, Mar 11, 2023 at 03:56:50AM -0800, Nicolin Chen wrote:
>
> > I recall that one difficulty is to pass the vSID from the guest
> > down to the host kernel driver and to link with the pSID. What I
> > did previously for VCMDQ was to set the SID_MATCH register with
> > iommu_group_id(group) and set the SID_REPLACE register with the
> > pSID. Then hyper will use the iommu_group_id to search for the
> > pair of the registers, and to set vSID. Perhaps we should think
> > of something smarter.
>
> We need an ioctl for this, I think. To load a map of vSID to dev_id
> into the driver. Kernel will convert dev_id to pSID. Driver will
> program the map into HW.

Can we just pass a vSID via the alloc ioctl like this?

-----------------------------------------------------------
@@ -429,7 +429,7 @@ struct iommu_hwpt_arm_smmuv3 {
#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
__u64 flags;
__u32 s2vmid;
- __u32 __reserved;
+ __u32 sid;
__u64 s1ctxptr;
__u64 s1cdmax;
__u64 s1fmt;
-----------------------------------------------------------

An alloc is initiated by an SMMU_CMD_CFGI_STE command that has
an SID filed anyway.

> SW path will program the map into an xarray

I found a tricky thing about SIDs in the SMMU driver when doing
this experiment: the SMMU kernel driver mostly handles devices
using struct arm_smmu_master. However, an arm_smmu_master might
have a num_streams>1, meaning a device can have multiple SIDs.
Though it seems that PCI devices might not be in this scope, a
plain xarray might not work for other type of devices in a long
run, if there'd be?

> > > I suspect the answer to Robin's question on how to handle errors is
> > > the most important deciding factor. If we have to capture and relay
> > > actual HW errors back to userspace that really suggests we should do
> > > something different than a synchronous ioctl.
> >
> > A synchronous ioctl is to return some values other than defining
> > cache_invalidate_user as void, like we are doing now? An fault
> > injection pathway to report CERROR asynchronously is what we've
> > been doing though -- even with Eric's previous VFIO solution.
>
> Where is this? How does it look?

That's postponed with the PRI support, right? My use case does
not need PRI actually, but a fault injection pathway to guests.
This pathway should be able to take care of any CERROR (detected
by a host interrupt) or something funky in cache_invalidate_user
requests itself?

Thanks
Nic

2023-03-20 16:13:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 08:28:05AM -0700, Nicolin Chen wrote:
> On Mon, Mar 20, 2023 at 10:11:54AM -0300, Jason Gunthorpe wrote:
> > On Sun, Mar 19, 2023 at 06:32:03PM -0700, Nicolin Chen wrote:
> >
> > > +struct iommu_hwpt_invalidate_arm_smmuv3 {
> > > + struct iommu_iova_range range;
> >
> > what is this?
>
> Not used. A copy-n-paste mistake :(
>
> >
> > > + __u64 cmd[2];
> > > +};
> >
> > You still have to do something with the SID. We can't just allow any
> > un-validated SID value - the driver has to check the incoming SID
> > against allowed SIDs for this iommufd_ctx
>
> Hmm, that's something "missing" even in the current design.
>
> Yet, most of the TLBI commands don't hold an SID field. So,
> the hypervisor only trapping a queue write-pointer movement
> cannot get the exact vSID for a TLBI command. What our QEMU
> code currently does is simply broadcasting all the devices
> on the list of attaching devices to the vSMMU, which means
> that such an enforcement in the kernel would basically just
> allow any vSID (device) that's attached to the domain?

SID is only used for managing the ATC as far as I know. It is because
the ASID doesn't convey enough information to determine what PCI RID
to generate an ATC invalidation for.

We shouldn't be broadcasting for efficiency, at least it should not be
baked into the API.

You need to know what devices the vSID is targetting ang issues
invalidations only for those devices.

Jason

2023-03-20 16:15:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 08:56:00AM -0700, Nicolin Chen wrote:
> On Mon, Mar 20, 2023 at 10:03:04AM -0300, Jason Gunthorpe wrote:
> > On Sat, Mar 11, 2023 at 03:56:50AM -0800, Nicolin Chen wrote:
> >
> > > I recall that one difficulty is to pass the vSID from the guest
> > > down to the host kernel driver and to link with the pSID. What I
> > > did previously for VCMDQ was to set the SID_MATCH register with
> > > iommu_group_id(group) and set the SID_REPLACE register with the
> > > pSID. Then hyper will use the iommu_group_id to search for the
> > > pair of the registers, and to set vSID. Perhaps we should think
> > > of something smarter.
> >
> > We need an ioctl for this, I think. To load a map of vSID to dev_id
> > into the driver. Kernel will convert dev_id to pSID. Driver will
> > program the map into HW.
>
> Can we just pass a vSID via the alloc ioctl like this?
>
> -----------------------------------------------------------
> @@ -429,7 +429,7 @@ struct iommu_hwpt_arm_smmuv3 {
> #define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
> __u64 flags;
> __u32 s2vmid;
> - __u32 __reserved;
> + __u32 sid;
> __u64 s1ctxptr;
> __u64 s1cdmax;
> __u64 s1fmt;
> -----------------------------------------------------------
>
> An alloc is initiated by an SMMU_CMD_CFGI_STE command that has
> an SID filed anyway.

No, a HWPT is not a device or a SID. a HWPT is an ASID in the ARM
model.

dev_id is the SID.

The cfgi_ste will carry the vSID which is mapped to a iommufd dev_id.

The kernel has to translate the vSID to the dev_id to the pSID to
issue an ATC invalidation for the correct entity.

> > SW path will program the map into an xarray
>
> I found a tricky thing about SIDs in the SMMU driver when doing
> this experiment: the SMMU kernel driver mostly handles devices
> using struct arm_smmu_master. However, an arm_smmu_master might
> have a num_streams>1, meaning a device can have multiple SIDs.
> Though it seems that PCI devices might not be in this scope, a
> plain xarray might not work for other type of devices in a long
> run, if there'd be?

You'd replicate each of the vSIDs of the extra SIDs in the xarray.

> > > cache_invalidate_user as void, like we are doing now? An fault
> > > injection pathway to report CERROR asynchronously is what we've
> > > been doing though -- even with Eric's previous VFIO solution.
> >
> > Where is this? How does it look?
>
> That's postponed with the PRI support, right? My use case does
> not need PRI actually, but a fault injection pathway to guests.
> This pathway should be able to take care of any CERROR (detected
> by a host interrupt) or something funky in cache_invalidate_user
> requests itself?

I would expect that if invalidation can fail that we have a way to
signal that failure back to the guest.

Jason

2023-03-20 16:21:16

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 09:59:23AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 17, 2023 at 09:41:34AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Saturday, March 11, 2023 12:20 AM
> > >
> > > What I'm broadly thinking is if we have to make the infrastructure for
> > > VCMDQ HW accelerated invalidation then it is not a big step to also
> > > have the kernel SW path use the same infrastructure just with a CPU
> > > wake up instead of a MMIO poke.
> > >
> > > Ie we have a SW version of VCMDQ to speed up SMMUv3 cases without HW
> > > support.
> > >
> >
> > I thought about this in VT-d context. Looks there are some difficulties.
> >
> > The most prominent one is that head/tail of the VT-d invalidation queue
> > are in MMIO registers. Handling it in kernel iommu driver suggests
> > reading virtual tail register and updating virtual head register. Kind of
> > moving some vIOMMU awareness into the kernel which, iirc, is not
> > a welcomed model.
>
> qemu would trap the MMIO and generate an IOCTL with the written head
> pointer. It isn't as efficient as having the kernel do the trap, but
> does give batching.

Rephrasing that to put into a design: the IOCTL would pass a
user pointer to the queue, the size of the queue, then a head
pointer and a tail pointer? Then the kernel reads out all the
commands between the head and the tail and handles all those
invalidation commands only?

Thanks
Nic

2023-03-20 16:41:14

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 01:01:53PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 20, 2023 at 08:28:05AM -0700, Nicolin Chen wrote:
> > On Mon, Mar 20, 2023 at 10:11:54AM -0300, Jason Gunthorpe wrote:
> > > On Sun, Mar 19, 2023 at 06:32:03PM -0700, Nicolin Chen wrote:
> > >
> > > > +struct iommu_hwpt_invalidate_arm_smmuv3 {
> > > > + struct iommu_iova_range range;
> > >
> > > what is this?
> >
> > Not used. A copy-n-paste mistake :(
> >
> > >
> > > > + __u64 cmd[2];
> > > > +};
> > >
> > > You still have to do something with the SID. We can't just allow any
> > > un-validated SID value - the driver has to check the incoming SID
> > > against allowed SIDs for this iommufd_ctx
> >
> > Hmm, that's something "missing" even in the current design.
> >
> > Yet, most of the TLBI commands don't hold an SID field. So,
> > the hypervisor only trapping a queue write-pointer movement
> > cannot get the exact vSID for a TLBI command. What our QEMU
> > code currently does is simply broadcasting all the devices
> > on the list of attaching devices to the vSMMU, which means
> > that such an enforcement in the kernel would basically just
> > allow any vSID (device) that's attached to the domain?
>
> SID is only used for managing the ATC as far as I know. It is because
> the ASID doesn't convey enough information to determine what PCI RID
> to generate an ATC invalidation for.

Yes. And a CD invalidation too, though the kernel eventually
would do a broadcast to all devices that are using the same
CD.

> We shouldn't be broadcasting for efficiency, at least it should not be
> baked into the API.
>
> You need to know what devices the vSID is targetting ang issues
> invalidations only for those devices.

I agree with that, yet cannot think of a solution to achieve
that out of vSID. QEMU code by means of emulating a physical
SMMU only reads the commands from the queue, without knowing
which device (vSID) actually sent these commands.

I probably can do something to the solution that is doing an
entire broadcasting, with the ASID fields from the commands,
yet it'd only improve the situation by having an ASID-based
broadcasting...

Thanks
Nic

2023-03-20 17:06:35

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 01:04:35PM -0300, Jason Gunthorpe wrote:

> > > We need an ioctl for this, I think. To load a map of vSID to dev_id
> > > into the driver. Kernel will convert dev_id to pSID. Driver will
> > > program the map into HW.
> >
> > Can we just pass a vSID via the alloc ioctl like this?
> >
> > -----------------------------------------------------------
> > @@ -429,7 +429,7 @@ struct iommu_hwpt_arm_smmuv3 {
> > #define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
> > __u64 flags;
> > __u32 s2vmid;
> > - __u32 __reserved;
> > + __u32 sid;
> > __u64 s1ctxptr;
> > __u64 s1cdmax;
> > __u64 s1fmt;
> > -----------------------------------------------------------
> >
> > An alloc is initiated by an SMMU_CMD_CFGI_STE command that has
> > an SID filed anyway.
>
> No, a HWPT is not a device or a SID. a HWPT is an ASID in the ARM
> model.
>
> dev_id is the SID.
>
> The cfgi_ste will carry the vSID which is mapped to a iommufd dev_id.
>
> The kernel has to translate the vSID to the dev_id to the pSID to
> issue an ATC invalidation for the correct entity.

OK. This narrative makes sense. I think our solution (the entire
stack) here mixes these two terms between HWPT/ASID and STE/SID.

What QEMU does is trapping an SMMU_CMD_CFGI_STE command to send
the host an HWPT alloc ioctl. The former one is based on an SID
or a device, while the latter one is based on ASID.

So the correct way should be for QEMU to maintain an ASID-based
list, corresponding to the s1ctxptr from STEs, and only send an
alloc ioctl upon a new s1ctxptr/ASID. Meanwhile, at every trap
of SMMU_CMD_CFGI_STE, it calls a separate ioctl to tie a vSID to
a dev_id (and pSID accordingly).

In another word, an SMMU_CMD_CFGI_STE should do a mandatory SID
ioctl and an optional HWPT alloc ioctl (only allocates a HWPT if
the s1ctxptr in the STE is new).

What could be a good prototype of the ioctl? Would it be a VFIO
device one or IOMMUFD one?

> > > SW path will program the map into an xarray
> >
> > I found a tricky thing about SIDs in the SMMU driver when doing
> > this experiment: the SMMU kernel driver mostly handles devices
> > using struct arm_smmu_master. However, an arm_smmu_master might
> > have a num_streams>1, meaning a device can have multiple SIDs.
> > Though it seems that PCI devices might not be in this scope, a
> > plain xarray might not work for other type of devices in a long
> > run, if there'd be?
>
> You'd replicate each of the vSIDs of the extra SIDs in the xarray.

Noted it down.

> > > > cache_invalidate_user as void, like we are doing now? An fault
> > > > injection pathway to report CERROR asynchronously is what we've
> > > > been doing though -- even with Eric's previous VFIO solution.
> > >
> > > Where is this? How does it look?
> >
> > That's postponed with the PRI support, right? My use case does
> > not need PRI actually, but a fault injection pathway to guests.
> > This pathway should be able to take care of any CERROR (detected
> > by a host interrupt) or something funky in cache_invalidate_user
> > requests itself?
>
> I would expect that if invalidation can fail that we have a way to
> signal that failure back to the guest.

That's plausible to me, and it could apply to a translation
fault too. So, should we add back the iommufd infrastructure
for the fault injection (without PRI), in v2?

Thanks
Nic

2023-03-20 18:08:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 09:12:06AM -0700, Nicolin Chen wrote:
> On Mon, Mar 20, 2023 at 09:59:23AM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 17, 2023 at 09:41:34AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Saturday, March 11, 2023 12:20 AM
> > > >
> > > > What I'm broadly thinking is if we have to make the infrastructure for
> > > > VCMDQ HW accelerated invalidation then it is not a big step to also
> > > > have the kernel SW path use the same infrastructure just with a CPU
> > > > wake up instead of a MMIO poke.
> > > >
> > > > Ie we have a SW version of VCMDQ to speed up SMMUv3 cases without HW
> > > > support.
> > > >
> > >
> > > I thought about this in VT-d context. Looks there are some difficulties.
> > >
> > > The most prominent one is that head/tail of the VT-d invalidation queue
> > > are in MMIO registers. Handling it in kernel iommu driver suggests
> > > reading virtual tail register and updating virtual head register. Kind of
> > > moving some vIOMMU awareness into the kernel which, iirc, is not
> > > a welcomed model.
> >
> > qemu would trap the MMIO and generate an IOCTL with the written head
> > pointer. It isn't as efficient as having the kernel do the trap, but
> > does give batching.
>
> Rephrasing that to put into a design: the IOCTL would pass a
> user pointer to the queue, the size of the queue, then a head
> pointer and a tail pointer? Then the kernel reads out all the
> commands between the head and the tail and handles all those
> invalidation commands only?

Yes, that is one possible design

Jason

2023-03-20 18:14:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 09:35:20AM -0700, Nicolin Chen wrote:

> > You need to know what devices the vSID is targetting ang issues
> > invalidations only for those devices.
>
> I agree with that, yet cannot think of a solution to achieve
> that out of vSID. QEMU code by means of emulating a physical
> SMMU only reads the commands from the queue, without knowing
> which device (vSID) actually sent these commands.

Huh?

CMD_ATC_INV has the SID

Other commands have the ASID.

You never need to cross an ASID to a SID or vice versa.

If the guest is aware of ATS it will issue CMD_ATC_INV with vSIDs, and
the hypervisor just needs to convert vSID to pSID.

Otherwise vSID doesn't matter because it isn't used in the invalidation
API and you are just handling ASIDs that only need the VM_ID scope
applied.

Jason


2023-03-20 18:53:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 09:59:45AM -0700, Nicolin Chen wrote:
> On Mon, Mar 20, 2023 at 01:04:35PM -0300, Jason Gunthorpe wrote:
>
> > > > We need an ioctl for this, I think. To load a map of vSID to dev_id
> > > > into the driver. Kernel will convert dev_id to pSID. Driver will
> > > > program the map into HW.
> > >
> > > Can we just pass a vSID via the alloc ioctl like this?
> > >
> > > -----------------------------------------------------------
> > > @@ -429,7 +429,7 @@ struct iommu_hwpt_arm_smmuv3 {
> > > #define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
> > > __u64 flags;
> > > __u32 s2vmid;
> > > - __u32 __reserved;
> > > + __u32 sid;
> > > __u64 s1ctxptr;
> > > __u64 s1cdmax;
> > > __u64 s1fmt;
> > > -----------------------------------------------------------
> > >
> > > An alloc is initiated by an SMMU_CMD_CFGI_STE command that has
> > > an SID filed anyway.
> >
> > No, a HWPT is not a device or a SID. a HWPT is an ASID in the ARM
> > model.
> >
> > dev_id is the SID.
> >
> > The cfgi_ste will carry the vSID which is mapped to a iommufd dev_id.
> >
> > The kernel has to translate the vSID to the dev_id to the pSID to
> > issue an ATC invalidation for the correct entity.
>
> OK. This narrative makes sense. I think our solution (the entire
> stack) here mixes these two terms between HWPT/ASID and STE/SID.

HWPT is an "ASID/DID" on Intel and a CD table on SMMUv3

> What QEMU does is trapping an SMMU_CMD_CFGI_STE command to send
> the host an HWPT alloc ioctl. The former one is based on an SID
> or a device, while the latter one is based on ASID.
>
> So the correct way should be for QEMU to maintain an ASID-based
> list, corresponding to the s1ctxptr from STEs, and only send an
> alloc ioctl upon a new s1ctxptr/ASID. Meanwhile, at every trap
> of SMMU_CMD_CFGI_STE, it calls a separate ioctl to tie a vSID to
> a dev_id (and pSID accordingly).

It is not ASID, it just s1ctxptr's - de-duplicate them.

Do something about SMMUv3 not being able to interwork iommu_domains
across instances

> In another word, an SMMU_CMD_CFGI_STE should do a mandatory SID
> ioctl and an optional HWPT alloc ioctl (only allocates a HWPT if
> the s1ctxptr in the STE is new).

No, there is no SID ioctl at the STE stage.

The vSID was decided by qemu before the VM booted. It created it when
it built the vRID and the vPCI device. The vSID is tied to the vfio
device FD.

Somehow the VM knows the relationship between vSID and vPCI/vRID. IIRC
this is passed in through ACPI from qemu.

So vSID is an alais for the dev_id in iommfd language, and quemu
always has a translation table for it.

So CFGI_STE maps to allocating a de-duplicated HWPT for the CD table,
and then a replace operation on the device FD represented by the vSID
to change the pSTE to point to the HWPT.

The HWPT is effectively the "shadow STE".

> What could be a good prototype of the ioctl? Would it be a VFIO
> device one or IOMMUFD one?

If we load the vSID table it should be a iommufd one, linked to the
ARM SMMUv3 driver and probably take in a pointer to an array of
vSID/dev_id pairs. Maybe an add/remove type of operation.

> > I would expect that if invalidation can fail that we have a way to
> > signal that failure back to the guest.
>
> That's plausible to me, and it could apply to a translation
> fault too. So, should we add back the iommufd infrastructure
> for the fault injection (without PRI), in v2?

It would be nice if things were not so big, I don't think we need to
tackle translation fault at this time, but we should be thinking about
what invalidation cmd fault converts into.

Jason

2023-03-20 20:47:30

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 03:07:13PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 20, 2023 at 09:35:20AM -0700, Nicolin Chen wrote:
>
> > > You need to know what devices the vSID is targetting ang issues
> > > invalidations only for those devices.
> >
> > I agree with that, yet cannot think of a solution to achieve
> > that out of vSID. QEMU code by means of emulating a physical
> > SMMU only reads the commands from the queue, without knowing
> > which device (vSID) actually sent these commands.
>
> Huh?
>
> CMD_ATC_INV has the SID
>
> Other commands have the ASID.
>
> You never need to cross an ASID to a SID or vice versa.
>
> If the guest is aware of ATS it will issue CMD_ATC_INV with vSIDs, and
> the hypervisor just needs to convert vSID to pSID.
>
> Otherwise vSID doesn't matter because it isn't used in the invalidation
> API and you are just handling ASIDs that only need the VM_ID scope
> applied.

Yea, I was thinking of your point (at the top) how we could
ensure if an invalidation is targeting a correct vSID. So,
that narrative was only about CMD_ATC_INV...

Actually, we don't forward CMD_ATC_INV in QEMU. In another
thread, Kevin also remarked whether we need to support that
in the host or not. And I plan to drop CMD_ATC_INV from the
list of cache_invalidate_user(), following his comments and
the QEMU situation. Our uAPI, either forwarding the commands
or a package of queue info, should be able to cover this in
the future whenever we think it's required.

Combining the two parts above, we probably don't need to know
at this moment which vSID an invalidation is targeting, nor
to only allow it to execute for those devices, since the rest
of commands are all ASID based.

Thanks
Nic

2023-03-20 21:23:07

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 03:45:54PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 20, 2023 at 09:59:45AM -0700, Nicolin Chen wrote:
> > On Mon, Mar 20, 2023 at 01:04:35PM -0300, Jason Gunthorpe wrote:
> >
> > > > > We need an ioctl for this, I think. To load a map of vSID to dev_id
> > > > > into the driver. Kernel will convert dev_id to pSID. Driver will
> > > > > program the map into HW.
> > > >
> > > > Can we just pass a vSID via the alloc ioctl like this?
> > > >
> > > > -----------------------------------------------------------
> > > > @@ -429,7 +429,7 @@ struct iommu_hwpt_arm_smmuv3 {
> > > > #define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */
> > > > __u64 flags;
> > > > __u32 s2vmid;
> > > > - __u32 __reserved;
> > > > + __u32 sid;
> > > > __u64 s1ctxptr;
> > > > __u64 s1cdmax;
> > > > __u64 s1fmt;
> > > > -----------------------------------------------------------
> > > >
> > > > An alloc is initiated by an SMMU_CMD_CFGI_STE command that has
> > > > an SID filed anyway.
> > >
> > > No, a HWPT is not a device or a SID. a HWPT is an ASID in the ARM
> > > model.
> > >
> > > dev_id is the SID.
> > >
> > > The cfgi_ste will carry the vSID which is mapped to a iommufd dev_id.
> > >
> > > The kernel has to translate the vSID to the dev_id to the pSID to
> > > issue an ATC invalidation for the correct entity.
> >
> > OK. This narrative makes sense. I think our solution (the entire
> > stack) here mixes these two terms between HWPT/ASID and STE/SID.
>
> HWPT is an "ASID/DID" on Intel and a CD table on SMMUv3
>
> > What QEMU does is trapping an SMMU_CMD_CFGI_STE command to send
> > the host an HWPT alloc ioctl. The former one is based on an SID
> > or a device, while the latter one is based on ASID.
> >
> > So the correct way should be for QEMU to maintain an ASID-based
> > list, corresponding to the s1ctxptr from STEs, and only send an
> > alloc ioctl upon a new s1ctxptr/ASID. Meanwhile, at every trap
> > of SMMU_CMD_CFGI_STE, it calls a separate ioctl to tie a vSID to
> > a dev_id (and pSID accordingly).
>
> It is not ASID, it just s1ctxptr's - de-duplicate them.

SMMU has "ASID" too. And it's one per CD table. It can be also
seen as one per iommu_domain.

The following are lines from arm_smmu_domain_finalise_s1():
...
ret = xa_alloc(&arm_smmu_asid_xa, &asid, &cfg->cd,
XA_LIMIT(1, (1 << smmu->asid_bits) - 1), GFP_KERNEL);
...
cfg->cd.asid = (u16)asid;
...

> Do something about SMMUv3 not being able to interwork iommu_domains
> across instances

I don't follow this one. Device instances?

> > In another word, an SMMU_CMD_CFGI_STE should do a mandatory SID
> > ioctl and an optional HWPT alloc ioctl (only allocates a HWPT if
> > the s1ctxptr in the STE is new).
>
> No, there is no SID ioctl at the STE stage.
>
> The vSID was decided by qemu before the VM booted. It created it when
> it built the vRID and the vPCI device. The vSID is tied to the vfio
> device FD.
>
> Somehow the VM knows the relationship between vSID and vPCI/vRID. IIRC
> this is passed in through ACPI from qemu.

Yes.

> So vSID is an alais for the dev_id in iommfd language, and quemu
> always has a translation table for it.

I see.

> So CFGI_STE maps to allocating a de-duplicated HWPT for the CD table,
> and then a replace operation on the device FD represented by the vSID
> to change the pSTE to point to the HWPT.
>
> The HWPT is effectively the "shadow STE".

IIUIC, the ioctl for the link of vSID/dev_id should happen at
the stage when boot boots, while the HWPT alloc ioctl happens
at CFGI_STE.

> > What could be a good prototype of the ioctl? Would it be a VFIO
> > device one or IOMMUFD one?
>
> If we load the vSID table it should be a iommufd one, linked to the
> ARM SMMUv3 driver and probably take in a pointer to an array of
> vSID/dev_id pairs. Maybe an add/remove type of operation.

Will try some solution.

> > > I would expect that if invalidation can fail that we have a way to
> > > signal that failure back to the guest.
> >
> > That's plausible to me, and it could apply to a translation
> > fault too. So, should we add back the iommufd infrastructure
> > for the fault injection (without PRI), in v2?
>
> It would be nice if things were not so big, I don't think we need to
> tackle translation fault at this time, but we should be thinking about
> what invalidation cmd fault converts into.

Will see if we can add a compact one, or some other solution
for invalidation fault only.

Thanks
Nic

2023-03-20 22:14:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 01:46:52PM -0700, Nicolin Chen wrote:
> On Mon, Mar 20, 2023 at 03:07:13PM -0300, Jason Gunthorpe wrote:
> > On Mon, Mar 20, 2023 at 09:35:20AM -0700, Nicolin Chen wrote:
> >
> > > > You need to know what devices the vSID is targetting ang issues
> > > > invalidations only for those devices.
> > >
> > > I agree with that, yet cannot think of a solution to achieve
> > > that out of vSID. QEMU code by means of emulating a physical
> > > SMMU only reads the commands from the queue, without knowing
> > > which device (vSID) actually sent these commands.
> >
> > Huh?
> >
> > CMD_ATC_INV has the SID
> >
> > Other commands have the ASID.
> >
> > You never need to cross an ASID to a SID or vice versa.
> >
> > If the guest is aware of ATS it will issue CMD_ATC_INV with vSIDs, and
> > the hypervisor just needs to convert vSID to pSID.
> >
> > Otherwise vSID doesn't matter because it isn't used in the invalidation
> > API and you are just handling ASIDs that only need the VM_ID scope
> > applied.
>
> Yea, I was thinking of your point (at the top) how we could
> ensure if an invalidation is targeting a correct vSID. So,
> that narrative was only about CMD_ATC_INV...
>
> Actually, we don't forward CMD_ATC_INV in QEMU. In another
> thread, Kevin also remarked whether we need to support that
> in the host or not. And I plan to drop CMD_ATC_INV from the
> list of cache_invalidate_user(), following his comments and
> the QEMU situation. Our uAPI, either forwarding the commands
> or a package of queue info, should be able to cover this in
> the future whenever we think it's required.

Something has to generate CMD_ATC_INV.

How do you plan to generate this from the hypervisor based on ASID
invalidations?

The hypervisor doesn't know what ASIDs are connected to what SIDs to
generate the ATC?

Intel is different, they know what devices the vDID is connected to,
so when they get a vDID invalidation they can elaborate it into a ATC
invalidation. ARM doesn't have that information.

Jason

2023-03-20 22:22:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 02:22:42PM -0700, Nicolin Chen wrote:
> > > What QEMU does is trapping an SMMU_CMD_CFGI_STE command to send
> > > the host an HWPT alloc ioctl. The former one is based on an SID
> > > or a device, while the latter one is based on ASID.
> > >
> > > So the correct way should be for QEMU to maintain an ASID-based
> > > list, corresponding to the s1ctxptr from STEs, and only send an
> > > alloc ioctl upon a new s1ctxptr/ASID. Meanwhile, at every trap
> > > of SMMU_CMD_CFGI_STE, it calls a separate ioctl to tie a vSID to
> > > a dev_id (and pSID accordingly).
> >
> > It is not ASID, it just s1ctxptr's - de-duplicate them.
>
> SMMU has "ASID" too. And it's one per CD table. It can be also
> seen as one per iommu_domain.

Yes and no, the ASID in ARM is per CDE not per CD table. It is
associated with each TTB0/1 pointer and is effectively the handle for
the IOPTEs.

Every iommu_domain that has a TTB0/1 (ie represents IOPTEs) should
have an ASID.

The "nested" iommu_domains don't represent IOPTEs and don't have ASIDs.

The nested domains are just "shadow STEs".

> > Do something about SMMUv3 not being able to interwork iommu_domains
> > across instances
>
> I don't follow this one. Device instances?

There is some code that makes sure each iommu_domain is hooked to only
one smmu driver instance, IIRC.

> IIUIC, the ioctl for the link of vSID/dev_id should happen at
> the stage when boot boots, while the HWPT alloc ioctl happens
> at CFGI_STE.

Yes

> > > What could be a good prototype of the ioctl? Would it be a VFIO
> > > device one or IOMMUFD one?
> >
> > If we load the vSID table it should be a iommufd one, linked to the
> > ARM SMMUv3 driver and probably take in a pointer to an array of
> > vSID/dev_id pairs. Maybe an add/remove type of operation.
>
> Will try some solution.

It is only necessary if you want to do batching

For non-batching the SID invalidation should be done differently with
a device_id input instead. That is a bit tricky to organize as you
want iommufd to get back a 'struct device *' from the ID.

Jason

2023-03-21 08:37:25

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, March 21, 2023 2:01 AM
>
> On Mon, Mar 20, 2023 at 09:12:06AM -0700, Nicolin Chen wrote:
> > On Mon, Mar 20, 2023 at 09:59:23AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Mar 17, 2023 at 09:41:34AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <[email protected]>
> > > > > Sent: Saturday, March 11, 2023 12:20 AM
> > > > >
> > > > > What I'm broadly thinking is if we have to make the infrastructure for
> > > > > VCMDQ HW accelerated invalidation then it is not a big step to also
> > > > > have the kernel SW path use the same infrastructure just with a CPU
> > > > > wake up instead of a MMIO poke.
> > > > >
> > > > > Ie we have a SW version of VCMDQ to speed up SMMUv3 cases
> without HW
> > > > > support.
> > > > >
> > > >
> > > > I thought about this in VT-d context. Looks there are some difficulties.
> > > >
> > > > The most prominent one is that head/tail of the VT-d invalidation
> queue
> > > > are in MMIO registers. Handling it in kernel iommu driver suggests
> > > > reading virtual tail register and updating virtual head register. Kind of
> > > > moving some vIOMMU awareness into the kernel which, iirc, is not
> > > > a welcomed model.
> > >
> > > qemu would trap the MMIO and generate an IOCTL with the written head
> > > pointer. It isn't as efficient as having the kernel do the trap, but
> > > does give batching.
> >
> > Rephrasing that to put into a design: the IOCTL would pass a
> > user pointer to the queue, the size of the queue, then a head
> > pointer and a tail pointer? Then the kernel reads out all the
> > commands between the head and the tail and handles all those
> > invalidation commands only?
>
> Yes, that is one possible design
>

If we cannot have the short path in the kernel then I'm not sure the
value of using native format and queue in the uAPI. Batching can
be enabled over any format.

Btw probably a dumb question. The current invalidation IOCTL is
per hwpt. If picking a native format does it suggest making the IOCTL
per iommufd given native format is per IOMMU and could carry
scope bigger than a hwpt.

2023-03-21 11:48:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:

> > > Rephrasing that to put into a design: the IOCTL would pass a
> > > user pointer to the queue, the size of the queue, then a head
> > > pointer and a tail pointer? Then the kernel reads out all the
> > > commands between the head and the tail and handles all those
> > > invalidation commands only?
> >
> > Yes, that is one possible design
>
> If we cannot have the short path in the kernel then I'm not sure the
> value of using native format and queue in the uAPI. Batching can
> be enabled over any format.

SMMUv3 will have a hardware short path where the HW itself runs the
VM's command queue and does this logic.

So I like the symmetry of the SW path being close to that.

> Btw probably a dumb question. The current invalidation IOCTL is
> per hwpt. If picking a native format does it suggest making the IOCTL
> per iommufd given native format is per IOMMU and could carry
> scope bigger than a hwpt.

At least on SMMUv3 it depends on what happens with VMID.

If we can tie the VMID to the iommu_domain then the invalidation has
to flow through the iommu_domain to pick up the VMID.

If the VMID is tied to the entire iommufd_ctx then it can flow
independently.

Jason

2023-03-22 05:16:41

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 07:14:17PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 20, 2023 at 01:46:52PM -0700, Nicolin Chen wrote:
> > On Mon, Mar 20, 2023 at 03:07:13PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Mar 20, 2023 at 09:35:20AM -0700, Nicolin Chen wrote:
> > >
> > > > > You need to know what devices the vSID is targetting ang issues
> > > > > invalidations only for those devices.
> > > >
> > > > I agree with that, yet cannot think of a solution to achieve
> > > > that out of vSID. QEMU code by means of emulating a physical
> > > > SMMU only reads the commands from the queue, without knowing
> > > > which device (vSID) actually sent these commands.
> > >
> > > Huh?
> > >
> > > CMD_ATC_INV has the SID
> > >
> > > Other commands have the ASID.
> > >
> > > You never need to cross an ASID to a SID or vice versa.
> > >
> > > If the guest is aware of ATS it will issue CMD_ATC_INV with vSIDs, and
> > > the hypervisor just needs to convert vSID to pSID.
> > >
> > > Otherwise vSID doesn't matter because it isn't used in the invalidation
> > > API and you are just handling ASIDs that only need the VM_ID scope
> > > applied.
> >
> > Yea, I was thinking of your point (at the top) how we could
> > ensure if an invalidation is targeting a correct vSID. So,
> > that narrative was only about CMD_ATC_INV...
> >
> > Actually, we don't forward CMD_ATC_INV in QEMU. In another
> > thread, Kevin also remarked whether we need to support that
> > in the host or not. And I plan to drop CMD_ATC_INV from the
> > list of cache_invalidate_user(), following his comments and
> > the QEMU situation. Our uAPI, either forwarding the commands
> > or a package of queue info, should be able to cover this in
> > the future whenever we think it's required.
>
> Something has to generate CMD_ATC_INV.
>
> How do you plan to generate this from the hypervisor based on ASID
> invalidations?
>
> The hypervisor doesn't know what ASIDs are connected to what SIDs to
> generate the ATC?
>
> Intel is different, they know what devices the vDID is connected to,
> so when they get a vDID invalidation they can elaborate it into a ATC
> invalidation. ARM doesn't have that information.

I see. Perhaps vSMMU still needs to forward CMD_ATC_INV. And,
as you suggested, it should go through a vSID sanity check by
the host handler. We can find the corresponding pSID to check
if the device is associated with the iommu_domain?

Thanks
Nic

2023-03-22 06:46:41

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Tue, Mar 21, 2023 at 08:48:31AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
>
> > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > user pointer to the queue, the size of the queue, then a head
> > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > commands between the head and the tail and handles all those
> > > > invalidation commands only?
> > >
> > > Yes, that is one possible design
> >
> > If we cannot have the short path in the kernel then I'm not sure the
> > value of using native format and queue in the uAPI. Batching can
> > be enabled over any format.
>
> SMMUv3 will have a hardware short path where the HW itself runs the
> VM's command queue and does this logic.
>
> So I like the symmetry of the SW path being close to that.

A tricky thing here that I just realized:

With VCMDQ, the guest will have two CMDQs. One is the vSMMU's
CMDQ handling all non-TLBI commands like CMD_CFGI_STE via the
invalidation IOCTL, and the other hardware accelerated VCMDQ
handling all TLBI commands by the HW. In this setup, we will
need a VCMDQ kernel driver to dispatch commands into the two
different queues.

Yet, it feels a bit different with this SW path exposing the
entire SMMU CMDQ, since now theoretically non-TLBI and TLBI
commands can be interlaced in one batch, so the hypervisor
should go through the queue first to handle and delete all
non-TLBI commands, and then forward the CMDQ to the host to
run remaining TLBI commands, if there's any.

> > Btw probably a dumb question. The current invalidation IOCTL is
> > per hwpt. If picking a native format does it suggest making the IOCTL
> > per iommufd given native format is per IOMMU and could carry
> > scope bigger than a hwpt.
>
> At least on SMMUv3 it depends on what happens with VMID.
>
> If we can tie the VMID to the iommu_domain then the invalidation has
> to flow through the iommu_domain to pick up the VMID.

Yes. This is what we do now. An invalidation handler finds the
corresponding S2 domain pointer to pick up the VMID. And it'd
be safe, until the S2 domain gets replaced with another domain
I think?

> If the VMID is tied to the entire iommufd_ctx then it can flow
> independently.

One more thing about the VMID unification is that SMMU might
have limitation on the VMID range:
smmu->vmid_bits = reg & IDR0_VMID16 ? 16 : 8;
...
vmid = arm_smmu_bitmap_alloc(smmu->vmid_map, smmu->vmid_bits);

So, we'd likely need a CAP for that, to apply some limitation
with the iommufd_ctx too?

Thanks
Nic

2023-03-22 12:49:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Tue, Mar 21, 2023 at 11:42:25PM -0700, Nicolin Chen wrote:
> On Tue, Mar 21, 2023 at 08:48:31AM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
> >
> > > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > > user pointer to the queue, the size of the queue, then a head
> > > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > > commands between the head and the tail and handles all those
> > > > > invalidation commands only?
> > > >
> > > > Yes, that is one possible design
> > >
> > > If we cannot have the short path in the kernel then I'm not sure the
> > > value of using native format and queue in the uAPI. Batching can
> > > be enabled over any format.
> >
> > SMMUv3 will have a hardware short path where the HW itself runs the
> > VM's command queue and does this logic.
> >
> > So I like the symmetry of the SW path being close to that.
>
> A tricky thing here that I just realized:
>
> With VCMDQ, the guest will have two CMDQs. One is the vSMMU's
> CMDQ handling all non-TLBI commands like CMD_CFGI_STE via the
> invalidation IOCTL, and the other hardware accelerated VCMDQ
> handling all TLBI commands by the HW. In this setup, we will
> need a VCMDQ kernel driver to dispatch commands into the two
> different queues.

You mean a VM kernel driver? Yes that was always the point, the VM
would use the extra CMDQ's only for invalidation

The main CMDQ would work as today through a trap.

> Yet, it feels a bit different with this SW path exposing the
> entire SMMU CMDQ, since now theoretically non-TLBI and TLBI
> commands can be interlaced in one batch, so the hypervisor
> should go through the queue first to handle and delete all
> non-TLBI commands, and then forward the CMDQ to the host to
> run remaining TLBI commands, if there's any.

Yes, there are a few different ways to handle this and still preserve
batching. It is part of the reason it would be hard to make the kernel
natively parse the commandq

On the other hand, we could add some more native kernel support for a
SW emulated vCMDQ and that might be interesting for performance.

One of the biggest reasons to use nesting is to get to vSVA and
invalidation performance is very important in a vSVA environment. We
should not ignore this in the design.

> > If the VMID is tied to the entire iommufd_ctx then it can flow
> > independently.
>
> One more thing about the VMID unification is that SMMU might
> have limitation on the VMID range:
> smmu->vmid_bits = reg & IDR0_VMID16 ? 16 : 8;
> ...
> vmid = arm_smmu_bitmap_alloc(smmu->vmid_map, smmu->vmid_bits);
>
> So, we'd likely need a CAP for that, to apply some limitation
> with the iommufd_ctx too?

I'd imagine the driver would have to allocate its internal data
against the iommufd_ctx

I'm not sure how best to organize that if it is the way to go.

Do we have a use case for more than one S2 iommu_domain on ARM?

Jason

2023-03-22 17:12:44

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 09:43:43AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 21, 2023 at 11:42:25PM -0700, Nicolin Chen wrote:
> > On Tue, Mar 21, 2023 at 08:48:31AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
> > >
> > > > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > > > user pointer to the queue, the size of the queue, then a head
> > > > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > > > commands between the head and the tail and handles all those
> > > > > > invalidation commands only?
> > > > >
> > > > > Yes, that is one possible design
> > > >
> > > > If we cannot have the short path in the kernel then I'm not sure the
> > > > value of using native format and queue in the uAPI. Batching can
> > > > be enabled over any format.
> > >
> > > SMMUv3 will have a hardware short path where the HW itself runs the
> > > VM's command queue and does this logic.
> > >
> > > So I like the symmetry of the SW path being close to that.
> >
> > A tricky thing here that I just realized:
> >
> > With VCMDQ, the guest will have two CMDQs. One is the vSMMU's
> > CMDQ handling all non-TLBI commands like CMD_CFGI_STE via the
> > invalidation IOCTL, and the other hardware accelerated VCMDQ
> > handling all TLBI commands by the HW. In this setup, we will
> > need a VCMDQ kernel driver to dispatch commands into the two
> > different queues.
>
> You mean a VM kernel driver? Yes that was always the point, the VM
> would use the extra CMDQ's only for invalidation

Yes, I was saying the guest kernel driver would dispatch the
commands.

> The main CMDQ would work as today through a trap.

Yes.

> > Yet, it feels a bit different with this SW path exposing the
> > entire SMMU CMDQ, since now theoretically non-TLBI and TLBI
> > commands can be interlaced in one batch, so the hypervisor
> > should go through the queue first to handle and delete all
> > non-TLBI commands, and then forward the CMDQ to the host to
> > run remaining TLBI commands, if there's any.
>
> Yes, there are a few different ways to handle this and still preserve
> batching. It is part of the reason it would be hard to make the kernel
> natively parse the commandq

Yea. I think the way I described above might be the cleanest,
since the host kernel would only handle all the leftover TLBI
commands? I am open for other better idea, if there's any.

> On the other hand, we could add some more native kernel support for a
> SW emulated vCMDQ and that might be interesting for performance.

That's something I have thought about too. But it would feel
like changing the "hardware" of the VM, right? If the host
kernel enables nesting, then we'd have this extra queue for
TLBI commands. From the driver prospective, it would feels
like detecting an extra feature bit in the HW register, but
there's no such bit in the SMMU HW spec :)

Yet, would you please elaborate how it impacts performance?
I can only see the benefit of isolation, from having a SW
emulated VCMDQ exclusively for TLBI commands v.s. having a
single CMDQ interlacing different commands, because both of
them requires trapping and some sort of dispatching.

> One of the biggest reasons to use nesting is to get to vSVA and
> invalidation performance is very important in a vSVA environment. We
> should not ignore this in the design.
>
> > > If the VMID is tied to the entire iommufd_ctx then it can flow
> > > independently.
> >
> > One more thing about the VMID unification is that SMMU might
> > have limitation on the VMID range:
> > smmu->vmid_bits = reg & IDR0_VMID16 ? 16 : 8;
> > ...
> > vmid = arm_smmu_bitmap_alloc(smmu->vmid_map, smmu->vmid_bits);
> >
> > So, we'd likely need a CAP for that, to apply some limitation
> > with the iommufd_ctx too?
>
> I'd imagine the driver would have to allocate its internal data
> against the iommufd_ctx
>
> I'm not sure how best to organize that if it is the way to go.
>
> Do we have a use case for more than one S2 iommu_domain on ARM?

In the previous VFIO solution from Eric, a nested iommu_domain
represented an S1+S2 two-stage setup. Since every CMD_CFGI_STE
could trigger an iommu_domain allocation of that, there could
be multiple S2 domains, when we have 2+ passthrough devices.
That's why I had quite a few patch for VMID unification in the
old VCMDQ series.

But now, we have only one S2 domain that works well with multi-
devices. So, I can't really think of a use case that needs two
S2 domains. Yet, I am not very sure.

Btw, just to confirm my understanding, a use case having two
or more iommu_domains means an S2 iommu_domain replacement,
right? I.e. a running S2 iommu_domain gets replaced on the fly
by a different S2 iommu_domain holding a different VMID, while
the IOAS still has the previous mappings? When would that
actually happen in the real world?

Thanks
Nic

2023-03-22 17:36:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 10:11:33AM -0700, Nicolin Chen wrote:

> > Yes, there are a few different ways to handle this and still preserve
> > batching. It is part of the reason it would be hard to make the kernel
> > natively parse the commandq
>
> Yea. I think the way I described above might be the cleanest,
> since the host kernel would only handle all the leftover TLBI
> commands? I am open for other better idea, if there's any.

It seems best to have userspace take a first pass over the cmdq and
then send what it didn't handle to the kernel

> > On the other hand, we could add some more native kernel support for a
> > SW emulated vCMDQ and that might be interesting for performance.
>
> That's something I have thought about too. But it would feel
> like changing the "hardware" of the VM, right? If the host
> kernel enables nesting, then we'd have this extra queue for
> TLBI commands. From the driver prospective, it would feels
> like detecting an extra feature bit in the HW register, but
> there's no such bit in the SMMU HW spec :)

You'd trigger it the same way vCMDQ triggers. It is basically SW
emulated vCMDQ.

> Yet, would you please elaborate how it impacts performance?
> I can only see the benefit of isolation, from having a SW
> emulated VCMDQ exclusively for TLBI commands v.s. having a
> single CMDQ interlacing different commands, because both of
> them requires trapping and some sort of dispatching.

In theory would could make it work like virtio-iommu where the
doorbell ring for the SW emulated vCMDQ is delivered directly to a
kernel thread and chop a bunch of latency out of it.

The issue is latency to complete invalidation as in a vSVA scenario
the virtual process MM will block on IOMMU invlidation whenever it
does any mm_struct maintenance. Ie you slow a vast set of
operations. The less latency the better.

> Btw, just to confirm my understanding, a use case having two
> or more iommu_domains means an S2 iommu_domain replacement,
> right? I.e. a running S2 iommu_domain gets replaced on the fly
> by a different S2 iommu_domain holding a different VMID, while
> the IOAS still has the previous mappings? When would that
> actually happen in the real world?

It doesn't have to be replace - what is needed is that evey vPCI
device connected to the same SMMU instance be using the same S2 and
thus the same VM_ID.

IOW evey SID must be linked to the same VM_ID or invalidation commands
will not be properly processed.

qemu would have to have multiple SMMU instances according to S2
domains, which is probably true anyhow since we need to know what
physical SMMU instance to deliver the invalidation too anyhow.

Jason

2023-03-22 19:30:22

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 02:28:38PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 10:11:33AM -0700, Nicolin Chen wrote:
>
> > > Yes, there are a few different ways to handle this and still preserve
> > > batching. It is part of the reason it would be hard to make the kernel
> > > natively parse the commandq
> >
> > Yea. I think the way I described above might be the cleanest,
> > since the host kernel would only handle all the leftover TLBI
> > commands? I am open for other better idea, if there's any.
>
> It seems best to have userspace take a first pass over the cmdq and
> then send what it didn't handle to the kernel

Yes. I can go ahead with this approach for v2.

> > > On the other hand, we could add some more native kernel support for a
> > > SW emulated vCMDQ and that might be interesting for performance.
> >
> > That's something I have thought about too. But it would feel
> > like changing the "hardware" of the VM, right? If the host
> > kernel enables nesting, then we'd have this extra queue for
> > TLBI commands. From the driver prospective, it would feels
> > like detecting an extra feature bit in the HW register, but
> > there's no such bit in the SMMU HW spec :)
>
> You'd trigger it the same way vCMDQ triggers. It is basically SW
> emulated vCMDQ.

It still feels something very big. Off the top of my head,
we'd need a pair of new emulated registers for consumer and
producer indexes, and perhaps some configuration registers
too. How should we put into the MMIO space? Maybe we could
emulate that via ECMDQ? So, for QEMU, the SMMU device model
always has the ECMDQ feature so we can have this extra MMIO
space for a separate CMDQ.

> > Yet, would you please elaborate how it impacts performance?
> > I can only see the benefit of isolation, from having a SW
> > emulated VCMDQ exclusively for TLBI commands v.s. having a
> > single CMDQ interlacing different commands, because both of
> > them requires trapping and some sort of dispatching.
>
> In theory would could make it work like virtio-iommu where the
> doorbell ring for the SW emulated vCMDQ is delivered directly to a
> kernel thread and chop a bunch of latency out of it.

With a SW emulated VCMDQ, the dispatching is moved to the
guest kernel, v.s. the hypervisor. I still don't see a big
improvement here. Perhaps we should run a benchmark with
some experimental changes.

> The issue is latency to complete invalidation as in a vSVA scenario
> the virtual process MM will block on IOMMU invlidation whenever it
> does any mm_struct maintenance. Ie you slow a vast set of
> operations. The less latency the better.

Yea. If it has a noticeable per gain, we should do that.

Do you prefer this to happen with this series? I would think
of adding this in the later stage, although I am not sure if
the uAPI would be completely compatible. It seems to me that
we would need a different uAPI, so as to setup a queue in an
earlier stage, and then to ring a bell when QEMU traps any
incoming commands in the emulated VCMDQ.

> > Btw, just to confirm my understanding, a use case having two
> > or more iommu_domains means an S2 iommu_domain replacement,
> > right? I.e. a running S2 iommu_domain gets replaced on the fly
> > by a different S2 iommu_domain holding a different VMID, while
> > the IOAS still has the previous mappings? When would that
> > actually happen in the real world?
>
> It doesn't have to be replace - what is needed is that evey vPCI
> device connected to the same SMMU instance be using the same S2 and
> thus the same VM_ID.
>
> IOW evey SID must be linked to the same VM_ID or invalidation commands
> will not be properly processed.
>
> qemu would have to have multiple SMMU instances according to S2
> domains, which is probably true anyhow since we need to know what
> physical SMMU instance to deliver the invalidation too anyhow.

I am not 100% following this part. So, you mean that we're
safe if we only have one SMMU instance, because there'd be
only one S2 domain, while multiple S2 domains would happen
if we have multiple SMMU instances?

Can we still use the same S2 domain for multiple instances?
Our approach of setting up a stage-2 mapping in QEMU is to
map the entire guest memory. I don't see a point in having
a separate S2 domain, even if there are multiple instances?

Btw, from a private discussion with Eric, he expressed the
difficulty of adding multiple SMMU instances in QEMU, as it
would complicate the device and ACPI components. For VCMDQ,
we do need a multi-instance environment, because there are
multiple physical pairs of SMMU+VCMDQ, i.e. multiple VCMDQ
MMIO regions being attached/used by different devices. So,
I have been exploring a different approach by creating an
internal multiplication inside VCMDQ...

Thanks
Nic

2023-03-22 19:44:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 12:21:27PM -0700, Nicolin Chen wrote:

> Do you prefer this to happen with this series?

No, I just don't want to exclude doing it someday if people are
interested to optimize this. As I said in the other thread I'd rather
optimize SMMUv3 emulation than try to use virtio-iommu to make it run
faster.

> the uAPI would be completely compatible. It seems to me that
> we would need a different uAPI, so as to setup a queue in an
> earlier stage, and then to ring a bell when QEMU traps any
> incoming commands in the emulated VCMDQ.

Yes, it would need more uAPI. Lets just make sure there is room and
maybe think a bit about what it would look like.

You should also draft through the HW vCMDQ stuff to ensure it fits
in here nicely.


> > > Btw, just to confirm my understanding, a use case having two
> > > or more iommu_domains means an S2 iommu_domain replacement,
> > > right? I.e. a running S2 iommu_domain gets replaced on the fly
> > > by a different S2 iommu_domain holding a different VMID, while
> > > the IOAS still has the previous mappings? When would that
> > > actually happen in the real world?
> >
> > It doesn't have to be replace - what is needed is that evey vPCI
> > device connected to the same SMMU instance be using the same S2 and
> > thus the same VM_ID.
> >
> > IOW evey SID must be linked to the same VM_ID or invalidation commands
> > will not be properly processed.
> >
> > qemu would have to have multiple SMMU instances according to S2
> > domains, which is probably true anyhow since we need to know what
> > physical SMMU instance to deliver the invalidation too anyhow.
>
> I am not 100% following this part. So, you mean that we're
> safe if we only have one SMMU instance, because there'd be
> only one S2 domain, while multiple S2 domains would happen
> if we have multiple SMMU instances?

Yes, that would happen today, especially since each smmu has its own
vm_id allocator IIRC

> Can we still use the same S2 domain for multiple instances?

I think not today.

At the core, if we share the same S2 domain then it is a problem to
figure out what smmu instance to send the invalidation command too. EG
if the userspace invalidates ASID 1 you'd have to replicate
invalidation to all SMMU instances. Even if ASID 1 is used by only a
single SID/STE that has a single SMMU instance backing it.

So I think for ARM we want to reflect the physical SMMU instances into
vSMMU instances and that feels best done by having a unique S2
iommu_domain for each SMMU instance. Then we know that an invalidation
for a SMMU instance is delivered to that S2's singular CMDQ and things
like vCMDQ become possible.

> Our approach of setting up a stage-2 mapping in QEMU is to
> map the entire guest memory. I don't see a point in having
> a separate S2 domain, even if there are multiple instances?

And then this is the drawback, we don't really want to have duplicated
S2 page tables in the system for every stage 2.

Maybe we have made a mistake by allowing the S2 to be an unmanaged
domain. Perhaps we should create the S2 out of an unmanaged domain
like the S1.

Then the rules could be
- Unmanaged domain can be used with every smmu instance, only one
copy of the page table. The ASID in the iommu_domain is
kernel-global
- S2 domain is a child of a shared unmanaged domain. It can be used
only with the SMMU it is associated with, it has a per-SMMU VM ID
- S1 domain is a child of a S2 domain, it can be used only with the
SMMU it's S2 is associated with, just because

> Btw, from a private discussion with Eric, he expressed the
> difficulty of adding multiple SMMU instances in QEMU, as it
> would complicate the device and ACPI components.

I'm not surprised by this, but for efficiency we probably have to do
this. Eric am I wrong?

qemu shouldn't have to do it immediately, but the kernel uAPI should
allow for a VMM that is optimized. We shouldn't exclude this by
mis-designing the kernel uAPI. qemu can replicate the invalidations
itself to make an ineffecient single vSMMU.

> For VCMDQ, we do need a multi-instance environment, because there
> are multiple physical pairs of SMMU+VCMDQ, i.e. multiple VCMDQ MMIO
> regions being attached/used by different devices.

Yes. IMHO vCMDQ is the sane design here - invalidation performance is
important, having a kernel-bypass way to do it is ideal. I understand
AMD has a similar kernel-bypass queue approach for their stuff too. I
think everyone will eventually need to do this, especially for CC
applications. Having the hypervisor able to interfere with
invalidation feels like an attack vector.

So we should focus on long term designs that allow kernel-bypass to
work, and I don't see way to hide multi-instance and still truely
support vCMDQ??

> So, I have been exploring a different approach by creating an
> internal multiplication inside VCMDQ...

How can that work?

You'd have to have the guest VM to know to replicate to different
vCMDQ's? Which isn't the standard SMMU programming model anymore..

Jason

2023-03-22 20:49:45

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 04:41:32PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 12:21:27PM -0700, Nicolin Chen wrote:
>
> > Do you prefer this to happen with this series?
>
> No, I just don't want to exclude doing it someday if people are
> interested to optimize this. As I said in the other thread I'd rather
> optimize SMMUv3 emulation than try to use virtio-iommu to make it run
> faster.

Got it. I will then just focus on reworking the invalidation
data structure with a list of command queue info.

> > the uAPI would be completely compatible. It seems to me that
> > we would need a different uAPI, so as to setup a queue in an
> > earlier stage, and then to ring a bell when QEMU traps any
> > incoming commands in the emulated VCMDQ.
>
> Yes, it would need more uAPI. Lets just make sure there is room and
> maybe think a bit about what it would look like.
>
> You should also draft through the HW vCMDQ stuff to ensure it fits
> in here nicely.

Yes.

> > > > Btw, just to confirm my understanding, a use case having two
> > > > or more iommu_domains means an S2 iommu_domain replacement,
> > > > right? I.e. a running S2 iommu_domain gets replaced on the fly
> > > > by a different S2 iommu_domain holding a different VMID, while
> > > > the IOAS still has the previous mappings? When would that
> > > > actually happen in the real world?
> > >
> > > It doesn't have to be replace - what is needed is that evey vPCI
> > > device connected to the same SMMU instance be using the same S2 and
> > > thus the same VM_ID.
> > >
> > > IOW evey SID must be linked to the same VM_ID or invalidation commands
> > > will not be properly processed.
> > >
> > > qemu would have to have multiple SMMU instances according to S2
> > > domains, which is probably true anyhow since we need to know what
> > > physical SMMU instance to deliver the invalidation too anyhow.
> >
> > I am not 100% following this part. So, you mean that we're
> > safe if we only have one SMMU instance, because there'd be
> > only one S2 domain, while multiple S2 domains would happen
> > if we have multiple SMMU instances?
>
> Yes, that would happen today, especially since each smmu has its own
> vm_id allocator IIRC
>
> > Can we still use the same S2 domain for multiple instances?
>
> I think not today.
>
> At the core, if we share the same S2 domain then it is a problem to
> figure out what smmu instance to send the invalidation command too. EG
> if the userspace invalidates ASID 1 you'd have to replicate
> invalidation to all SMMU instances. Even if ASID 1 is used by only a
> single SID/STE that has a single SMMU instance backing it.

Oh, Right. That would be a perf drawdown from an unnecessary
IOTLB miss potentially, because with a single instance QEMU
has to broadcast that invalidation to all SMMU instances.

> So I think for ARM we want to reflect the physical SMMU instances into
> vSMMU instances and that feels best done by having a unique S2
> iommu_domain for each SMMU instance. Then we know that an invalidation
> for a SMMU instance is delivered to that S2's singular CMDQ and things
> like vCMDQ become possible.

In that environment, do we still need a VMID unification?

> > Our approach of setting up a stage-2 mapping in QEMU is to
> > map the entire guest memory. I don't see a point in having
> > a separate S2 domain, even if there are multiple instances?
>
> And then this is the drawback, we don't really want to have duplicated
> S2 page tables in the system for every stage 2.
>
> Maybe we have made a mistake by allowing the S2 to be an unmanaged
> domain. Perhaps we should create the S2 out of an unmanaged domain
> like the S1.
>
> Then the rules could be
> - Unmanaged domain can be used with every smmu instance, only one
> copy of the page table. The ASID in the iommu_domain is
> kernel-global
> - S2 domain is a child of a shared unmanaged domain. It can be used
> only with the SMMU it is associated with, it has a per-SMMU VM ID
> - S1 domain is a child of a S2 domain, it can be used only with the
> SMMU it's S2 is associated with, just because

The actual S2 pagetable has to stay at the unmanaged domain
for IOAS_MAP, while we maintain an s2_cfg data structure in
the shadow S2 domain per SMMU instance that has its own VMID
but a shared S2 page table pointer?

Hmm... Feels very complicated to me. How does that help?

> > Btw, from a private discussion with Eric, he expressed the
> > difficulty of adding multiple SMMU instances in QEMU, as it
> > would complicate the device and ACPI components.
>
> I'm not surprised by this, but for efficiency we probably have to do
> this. Eric am I wrong?
>
> qemu shouldn't have to do it immediately, but the kernel uAPI should
> allow for a VMM that is optimized. We shouldn't exclude this by
> mis-designing the kernel uAPI. qemu can replicate the invalidations
> itself to make an ineffecient single vSMMU.
>
> > For VCMDQ, we do need a multi-instance environment, because there
> > are multiple physical pairs of SMMU+VCMDQ, i.e. multiple VCMDQ MMIO
> > regions being attached/used by different devices.
>
> Yes. IMHO vCMDQ is the sane design here - invalidation performance is
> important, having a kernel-bypass way to do it is ideal. I understand
> AMD has a similar kernel-bypass queue approach for their stuff too. I
> think everyone will eventually need to do this, especially for CC
> applications. Having the hypervisor able to interfere with
> invalidation feels like an attack vector.
>
> So we should focus on long term designs that allow kernel-bypass to
> work, and I don't see way to hide multi-instance and still truely
> support vCMDQ??

Well, I agree and hope people across the board decide to move
towards the multi-instance direction.

> > So, I have been exploring a different approach by creating an
> > internal multiplication inside VCMDQ...
>
> How can that work?
>
> You'd have to have the guest VM to know to replicate to different
> vCMDQ's? Which isn't the standard SMMU programming model anymore..

VCMDQ has multiple VINTFs (Virtual Interfaces) that's supposed
to be used by the host to expose to multiple VMs.

In a multi-SMMU environment, every single SMMU+VCMDQ instance
would have one VINTF only that contains one or more VCMDQs. In
this case, passthrough devices behind different physical SMMU
instances are straightforwardly attached to different vSMMUs.

However, if we can't have multiple vSMMU instances, the guest
VM (its HW) would enable multiple VINTFs corresponding to the
number of physical SMMU/VCMDQ instances, for devices to attach
accordingly. That means I need to figure out a way to pin the
devices onto those VINTFs, by somehow passing their physical
SMMU IDs. The latest progress that I made is to have a bit of
a hack in the Dsdt table by inserting a physical SMMU ID to
every single passthrough device node, though I still need to
confirm the legality of doing that...

Thanks
Nic

2023-03-22 20:59:17

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Mon, Mar 20, 2023 at 07:19:34PM -0300, Jason Gunthorpe wrote:

> > IIUIC, the ioctl for the link of vSID/dev_id should happen at
> > the stage when boot boots, while the HWPT alloc ioctl happens
> > at CFGI_STE.
>
> Yes
>
> > > > What could be a good prototype of the ioctl? Would it be a VFIO
> > > > device one or IOMMUFD one?
> > >
> > > If we load the vSID table it should be a iommufd one, linked to the
> > > ARM SMMUv3 driver and probably take in a pointer to an array of
> > > vSID/dev_id pairs. Maybe an add/remove type of operation.
> >
> > Will try some solution.
>
> It is only necessary if you want to do batching
>
> For non-batching the SID invalidation should be done differently with
> a device_id input instead. That is a bit tricky to organize as you
> want iommufd to get back a 'struct device *' from the ID.

I am wondering whether we need to have dev_id, i.e. IOMMUFD,
in play with the link of pSID<->vSID, as I am thinking of a
simplified approach by passing the vSID via the hwpt alloc
structure when we allocate an S2 domain.

The arm_smmu_domain_alloc_user() takes this vSID and a dev
pointer, so it can easily tie the vSID to the dev's pSID.

By doing so, we wouldn't need a new ioctl anymore.

Thanks
Nic

2023-03-23 12:19:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 01:43:59PM -0700, Nicolin Chen wrote:

> > So I think for ARM we want to reflect the physical SMMU instances into
> > vSMMU instances and that feels best done by having a unique S2
> > iommu_domain for each SMMU instance. Then we know that an invalidation
> > for a SMMU instance is delivered to that S2's singular CMDQ and things
> > like vCMDQ become possible.
>
> In that environment, do we still need a VMID unification?

If each S2 is per-smmu-instance then the VMID can be local to the SMMU
instance

> > > Our approach of setting up a stage-2 mapping in QEMU is to
> > > map the entire guest memory. I don't see a point in having
> > > a separate S2 domain, even if there are multiple instances?
> >
> > And then this is the drawback, we don't really want to have duplicated
> > S2 page tables in the system for every stage 2.
> >
> > Maybe we have made a mistake by allowing the S2 to be an unmanaged
> > domain. Perhaps we should create the S2 out of an unmanaged domain
> > like the S1.
> >
> > Then the rules could be
> > - Unmanaged domain can be used with every smmu instance, only one
> > copy of the page table. The ASID in the iommu_domain is
> > kernel-global
> > - S2 domain is a child of a shared unmanaged domain. It can be used
> > only with the SMMU it is associated with, it has a per-SMMU VM ID
> > - S1 domain is a child of a S2 domain, it can be used only with the
> > SMMU it's S2 is associated with, just because
>
> The actual S2 pagetable has to stay at the unmanaged domain
> for IOAS_MAP, while we maintain an s2_cfg data structure in
> the shadow S2 domain per SMMU instance that has its own VMID
> but a shared S2 page table pointer?

Yes

> Hmm... Feels very complicated to me. How does that help?

It de-duplicates the page table across multiple SMMU instances.

> > So, I have been exploring a different approach by creating an
> > > internal multiplication inside VCMDQ...
> >
> > How can that work?
> >
> > You'd have to have the guest VM to know to replicate to different
> > vCMDQ's? Which isn't the standard SMMU programming model anymore..
>
> VCMDQ has multiple VINTFs (Virtual Interfaces) that's supposed
> to be used by the host to expose to multiple VMs.
>
> In a multi-SMMU environment, every single SMMU+VCMDQ instance
> would have one VINTF only that contains one or more VCMDQs. In
> this case, passthrough devices behind different physical SMMU
> instances are straightforwardly attached to different vSMMUs.

Yes, this is the obvious simple impementation

> However, if we can't have multiple vSMMU instances, the guest
> VM (its HW) would enable multiple VINTFs corresponding to the
> number of physical SMMU/VCMDQ instances, for devices to attach
> accordingly. That means I need to figure out a way to pin the
> devices onto those VINTFs, by somehow passing their physical
> SMMU IDs.

And a way to request the correctly bound vCMDQ from the guest as well.
Sounds really messsy, I'd think multi-smmu is the much cleaner choice

Jason

2023-03-23 12:19:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Wed, Mar 22, 2023 at 01:57:23PM -0700, Nicolin Chen wrote:
> On Mon, Mar 20, 2023 at 07:19:34PM -0300, Jason Gunthorpe wrote:
>
> > > IIUIC, the ioctl for the link of vSID/dev_id should happen at
> > > the stage when boot boots, while the HWPT alloc ioctl happens
> > > at CFGI_STE.
> >
> > Yes
> >
> > > > > What could be a good prototype of the ioctl? Would it be a VFIO
> > > > > device one or IOMMUFD one?
> > > >
> > > > If we load the vSID table it should be a iommufd one, linked to the
> > > > ARM SMMUv3 driver and probably take in a pointer to an array of
> > > > vSID/dev_id pairs. Maybe an add/remove type of operation.
> > >
> > > Will try some solution.
> >
> > It is only necessary if you want to do batching
> >
> > For non-batching the SID invalidation should be done differently with
> > a device_id input instead. That is a bit tricky to organize as you
> > want iommufd to get back a 'struct device *' from the ID.
>
> I am wondering whether we need to have dev_id, i.e. IOMMUFD,
> in play with the link of pSID<->vSID, as I am thinking of a
> simplified approach by passing the vSID via the hwpt alloc
> structure when we allocate an S2 domain.

No, that doesn't make sense. the vSID is per-STE, the S2 domain is
fully shared. You can't put SID information in the iommu_domains.

JAson

2023-03-23 18:26:53

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 23, 2023 at 09:16:51AM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 01:43:59PM -0700, Nicolin Chen wrote:
>
> > > So I think for ARM we want to reflect the physical SMMU instances into
> > > vSMMU instances and that feels best done by having a unique S2
> > > iommu_domain for each SMMU instance. Then we know that an invalidation
> > > for a SMMU instance is delivered to that S2's singular CMDQ and things
> > > like vCMDQ become possible.
> >
> > In that environment, do we still need a VMID unification?
>
> If each S2 is per-smmu-instance then the VMID can be local to the SMMU
> instance

It sounds like related to the multi-SMMU instance too? Anyway,
it's good to think we that have a way out from requiring this
VMID unification.

> > > > Our approach of setting up a stage-2 mapping in QEMU is to
> > > > map the entire guest memory. I don't see a point in having
> > > > a separate S2 domain, even if there are multiple instances?
> > >
> > > And then this is the drawback, we don't really want to have duplicated
> > > S2 page tables in the system for every stage 2.
> > >
> > > Maybe we have made a mistake by allowing the S2 to be an unmanaged
> > > domain. Perhaps we should create the S2 out of an unmanaged domain
> > > like the S1.
> > >
> > > Then the rules could be
> > > - Unmanaged domain can be used with every smmu instance, only one
> > > copy of the page table. The ASID in the iommu_domain is
> > > kernel-global
> > > - S2 domain is a child of a shared unmanaged domain. It can be used
> > > only with the SMMU it is associated with, it has a per-SMMU VM ID
> > > - S1 domain is a child of a S2 domain, it can be used only with the
> > > SMMU it's S2 is associated with, just because
> >
> > The actual S2 pagetable has to stay at the unmanaged domain
> > for IOAS_MAP, while we maintain an s2_cfg data structure in
> > the shadow S2 domain per SMMU instance that has its own VMID
> > but a shared S2 page table pointer?
>
> Yes
>
> > Hmm... Feels very complicated to me. How does that help?
>
> It de-duplicates the page table across multiple SMMU instances.

Oh. So that the s2_cfg data structures can have a shared S2
IOPT while having different VMIDs. This would be a big rework.
It changes the two-domain design for nesting. Should we do
this at a later stage when supporting multi-SMMU instance or
now? And I am not sure Intel would need this...

> > > So, I have been exploring a different approach by creating an
> > > > internal multiplication inside VCMDQ...
> > >
> > > How can that work?
> > >
> > > You'd have to have the guest VM to know to replicate to different
> > > vCMDQ's? Which isn't the standard SMMU programming model anymore..
> >
> > VCMDQ has multiple VINTFs (Virtual Interfaces) that's supposed
> > to be used by the host to expose to multiple VMs.
> >
> > In a multi-SMMU environment, every single SMMU+VCMDQ instance
> > would have one VINTF only that contains one or more VCMDQs. In
> > this case, passthrough devices behind different physical SMMU
> > instances are straightforwardly attached to different vSMMUs.
>
> Yes, this is the obvious simple impementation
>
> > However, if we can't have multiple vSMMU instances, the guest
> > VM (its HW) would enable multiple VINTFs corresponding to the
> > number of physical SMMU/VCMDQ instances, for devices to attach
> > accordingly. That means I need to figure out a way to pin the
> > devices onto those VINTFs, by somehow passing their physical
> > SMMU IDs.
>
> And a way to request the correctly bound vCMDQ from the guest as well.
> Sounds really messsy, I'd think multi-smmu is the much cleaner choice

Yes. I agree, we would need the entire QEMU community to give
consent to change that though.

Thanks!
Nicolin

2023-03-23 18:33:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Thu, Mar 23, 2023 at 11:13:48AM -0700, Nicolin Chen wrote:
> Oh. So that the s2_cfg data structures can have a shared S2
> IOPT while having different VMIDs. This would be a big rework.
> It changes the two-domain design for nesting. Should we do
> this at a later stage when supporting multi-SMMU instance or
> now? And I am not sure Intel would need this...

If we do nothing right now then the S2 unmanaged iommu_domain will
carry the vm_id and it will be locked to a single SMMU instance.

To support multi-instance HW qemu would have to duplicate the entire
S2 unmanaged domain to get different vm_ids.

This is basically status-quo today because SMMU already doesn't
support sharing the unmanaged iommu_domain between instances.

If we chart a path to using a dedicated S2 domain then qemu side would
have to change to make a normal HWPT to back the S2 and then create a
real S2 as a child.

This implies that the request for S2 has to be in the driver data
today so that the driver knows if it should enable the unamanged
domain for S2 operation and lock it do an instance.

So long as that is OK we are probably OK to be incremental..

> > And a way to request the correctly bound vCMDQ from the guest as well.
> > Sounds really messsy, I'd think multi-smmu is the much cleaner choice
>
> Yes. I agree, we would need the entire QEMU community to give
> consent to change that though.

I suppose it wasn't consent, it was someone needs to do the difficult
work.

Jason

2023-03-24 08:51:48

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, March 21, 2023 7:49 PM
>
> On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
>
> > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > user pointer to the queue, the size of the queue, then a head
> > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > commands between the head and the tail and handles all those
> > > > invalidation commands only?
> > >
> > > Yes, that is one possible design
> >
> > If we cannot have the short path in the kernel then I'm not sure the
> > value of using native format and queue in the uAPI. Batching can
> > be enabled over any format.
>
> SMMUv3 will have a hardware short path where the HW itself runs the
> VM's command queue and does this logic.
>
> So I like the symmetry of the SW path being close to that.
>

Out of curiosity. VCMDQ is per SMMU. Does it imply that Qemu needs
to create multiple vSMMU instances if devices assigned to it are behind
different physical SMMUs (plus one instance specific for emulated
devices), to match VCMDQ with a specific device?

btw is VCMDQ in standard SMMU spec or a NVIDIA specific extension?
If the latter does it require extra changes in guest smmu driver?

The symmetry of the SW path has another merit beyond performance.
It allows live migration falling back to the sw short-path with not-so-bad
overhead when the dest machine cannot afford the same number of
VCMDQ's as the src.

But still the main open for in-kernel short-path is what would be the
framework to move part of vIOMMU emulation into the kernel. If this
can be done cleanly then it's better than vhost-iommu which lacks
behind significantly regarding to advanced features. But if it cannot
be done cleanly leaving each vendor move random emulation logic
into the kernel then vhost-iommu sounds more friendly to the kernel
though lots of work remains to fill the feature gap.

2023-03-24 08:57:03

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, March 22, 2023 1:15 PM
>
> >
> > Something has to generate CMD_ATC_INV.
> >
> > How do you plan to generate this from the hypervisor based on ASID
> > invalidations?
> >
> > The hypervisor doesn't know what ASIDs are connected to what SIDs to
> > generate the ATC?
> >
> > Intel is different, they know what devices the vDID is connected to,
> > so when they get a vDID invalidation they can elaborate it into a ATC
> > invalidation. ARM doesn't have that information.
>
> I see. Perhaps vSMMU still needs to forward CMD_ATC_INV. And,

Ah that's quite a different story. ????

2023-03-24 09:15:50

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Nicolin Chen <[email protected]>
> Sent: Wednesday, March 22, 2023 2:42 PM
>
> On Tue, Mar 21, 2023 at 08:48:31AM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
> >
> > > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > > user pointer to the queue, the size of the queue, then a head
> > > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > > commands between the head and the tail and handles all those
> > > > > invalidation commands only?
> > > >
> > > > Yes, that is one possible design
> > >
> > > If we cannot have the short path in the kernel then I'm not sure the
> > > value of using native format and queue in the uAPI. Batching can
> > > be enabled over any format.
> >
> > SMMUv3 will have a hardware short path where the HW itself runs the
> > VM's command queue and does this logic.
> >
> > So I like the symmetry of the SW path being close to that.
>
> A tricky thing here that I just realized:
>
> With VCMDQ, the guest will have two CMDQs. One is the vSMMU's
> CMDQ handling all non-TLBI commands like CMD_CFGI_STE via the
> invalidation IOCTL, and the other hardware accelerated VCMDQ
> handling all TLBI commands by the HW. In this setup, we will
> need a VCMDQ kernel driver to dispatch commands into the two
> different queues.
>

why doesn't hw generate a vm-exit for unsupported CMDs in VCMDQ
and then let them emulated by vSMMU? such events should be rare
once map/unmap are being conducted...

2023-03-24 14:50:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 24, 2023 at 08:47:20AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, March 21, 2023 7:49 PM
> >
> > On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
> >
> > > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > > user pointer to the queue, the size of the queue, then a head
> > > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > > commands between the head and the tail and handles all those
> > > > > invalidation commands only?
> > > >
> > > > Yes, that is one possible design
> > >
> > > If we cannot have the short path in the kernel then I'm not sure the
> > > value of using native format and queue in the uAPI. Batching can
> > > be enabled over any format.
> >
> > SMMUv3 will have a hardware short path where the HW itself runs the
> > VM's command queue and does this logic.
> >
> > So I like the symmetry of the SW path being close to that.
> >
>
> Out of curiosity. VCMDQ is per SMMU. Does it imply that Qemu needs
> to create multiple vSMMU instances if devices assigned to it are behind
> different physical SMMUs (plus one instance specific for emulated
> devices), to match VCMDQ with a specific device?

Yes

> btw is VCMDQ in standard SMMU spec or a NVIDIA specific extension?
> If the latter does it require extra changes in guest smmu driver?

It is a mash up of ARM standard ECMDQ with a few additions. I hope ARM
will standardize something someday

> The symmetry of the SW path has another merit beyond performance.
> It allows live migration falling back to the sw short-path with not-so-bad
> overhead when the dest machine cannot afford the same number of
> VCMDQ's as the src.

Well, that requires SW emulation of the VCMDQ thing, but yes

> But still the main open for in-kernel short-path is what would be the
> framework to move part of vIOMMU emulation into the kernel. If this
> can be done cleanly then it's better than vhost-iommu which lacks
> behind significantly regarding to advanced features. But if it cannot
> be done cleanly leaving each vendor move random emulation logic
> into the kernel then vhost-iommu sounds more friendly to the kernel
> though lots of work remains to fill the feature gap.

I assume there are reasonable ways to hook the kernel to kvm, vhost
does it. I've never looked at it. At worst we need to factor some of
the vhost code into some library to allow it.

We want a kernel thread to wakeup on a doorbell ring basically.

Jason

2023-03-24 14:57:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 24, 2023 at 09:02:34AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <[email protected]>
> > Sent: Wednesday, March 22, 2023 2:42 PM
> >
> > On Tue, Mar 21, 2023 at 08:48:31AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
> > >
> > > > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > > > user pointer to the queue, the size of the queue, then a head
> > > > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > > > commands between the head and the tail and handles all those
> > > > > > invalidation commands only?
> > > > >
> > > > > Yes, that is one possible design
> > > >
> > > > If we cannot have the short path in the kernel then I'm not sure the
> > > > value of using native format and queue in the uAPI. Batching can
> > > > be enabled over any format.
> > >
> > > SMMUv3 will have a hardware short path where the HW itself runs the
> > > VM's command queue and does this logic.
> > >
> > > So I like the symmetry of the SW path being close to that.
> >
> > A tricky thing here that I just realized:
> >
> > With VCMDQ, the guest will have two CMDQs. One is the vSMMU's
> > CMDQ handling all non-TLBI commands like CMD_CFGI_STE via the
> > invalidation IOCTL, and the other hardware accelerated VCMDQ
> > handling all TLBI commands by the HW. In this setup, we will
> > need a VCMDQ kernel driver to dispatch commands into the two
> > different queues.
> >
>
> why doesn't hw generate a vm-exit for unsupported CMDs in VCMDQ
> and then let them emulated by vSMMU? such events should be rare
> once map/unmap are being conducted...

IIRC vcmdq is defined to only process invalidations, so it would be a
driver error to send anything else. I think this is what Nicolin
means. Most likely to use it the VM would have to see the nvidia acpi
extension and activate vcmdq in the VM.

If you suggest to overlay the main cmdq with the vcmdq and then don't
tell the guest about it.. Robin suggested something similar.

This idea would be a half and half, the HW would run the queue and the
doorbell and generate error interrupts back to the hypervisor and tell
it that the queue is paused and ask it to fix the failed entry and
restart.

I could see this as an interesting solution, but I don't know if this
HW can support it..

Jason

2023-03-24 15:34:38

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user

Hi Nicolin,

On 3/9/23 11:53, Nicolin Chen wrote:
> The arm_smmu_domain_alloc_user callback function is used for userspace to
> allocate iommu_domains, such as standalone stage-1 domain, nested stage-1
> domain, and nested stage-2 domain. The input user_data is in the type of
> struct iommu_hwpt_arm_smmuv3 that contains the configurations of a nested
> stage-1 or a nested stage-2 iommu_domain. A NULL user_data will just opt
> in a standalone stage-1 domain allocation.
>
> Add a constitutive function __arm_smmu_domain_alloc to support that.
>
> Since ops->domain_alloc_user has a valid dev pointer, the master pointer
> is available when calling __arm_smmu_domain_alloc() in this case, meaning
> that arm_smmu_domain_finalise() can be done at the allocation stage. This
> allows IOMMUFD to initialize the hw_pagetable for the domain.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 95 ++++++++++++++-------
> 1 file changed, 65 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2d29f7320570..5ff74edfbd68 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2053,36 +2053,6 @@ static void *arm_smmu_hw_info(struct device *dev, u32 *length)
> return info;
> }
>
> -static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
> -{
> - struct arm_smmu_domain *smmu_domain;
> -
> - if (type == IOMMU_DOMAIN_SVA)
> - return arm_smmu_sva_domain_alloc();
> -
> - if (type != IOMMU_DOMAIN_UNMANAGED &&
> - type != IOMMU_DOMAIN_DMA &&
> - type != IOMMU_DOMAIN_DMA_FQ &&
> - type != IOMMU_DOMAIN_IDENTITY)
> - return NULL;
> -
> - /*
> - * Allocate the domain and initialise some of its data structures.
> - * We can't really do anything meaningful until we've added a
> - * master.
> - */
> - smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> - if (!smmu_domain)
> - return NULL;
> -
> - mutex_init(&smmu_domain->init_mutex);
> - INIT_LIST_HEAD(&smmu_domain->devices);
> - spin_lock_init(&smmu_domain->devices_lock);
> - INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
> -
> - return &smmu_domain->domain;
> -}
> -
> static struct iommu_domain *arm_smmu_get_unmanaged_domain(struct device *dev)
> {
> struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> @@ -2893,10 +2863,75 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
> arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
> }
>
> +static struct iommu_domain *
> +__arm_smmu_domain_alloc(unsigned type,
> + struct arm_smmu_domain *s2,
> + struct arm_smmu_master *master,
> + const struct iommu_hwpt_arm_smmuv3 *user_cfg)
> +{
> + struct arm_smmu_domain *smmu_domain;
> + struct iommu_domain *domain;
> + int ret = 0;
> +
> + if (type == IOMMU_DOMAIN_SVA)
> + return arm_smmu_sva_domain_alloc();
> +
> + if (type != IOMMU_DOMAIN_UNMANAGED &&
> + type != IOMMU_DOMAIN_DMA &&
> + type != IOMMU_DOMAIN_DMA_FQ &&
> + type != IOMMU_DOMAIN_IDENTITY)
> + return NULL;
> +
> + /*
> + * Allocate the domain and initialise some of its data structures.
> + * We can't really finalise the domain unless a master is given.
> + */
> + smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> + if (!smmu_domain)
> + return NULL;
> + domain = &smmu_domain->domain;
> +
> + domain->type = type;
> + domain->ops = arm_smmu_ops.default_domain_ops;
Compared to the original code, that's something new. Please can you
explain why this is added in this patch?
> +
> + mutex_init(&smmu_domain->init_mutex);
> + INIT_LIST_HEAD(&smmu_domain->devices);
> + spin_lock_init(&smmu_domain->devices_lock);
> + INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
> +
> + if (master) {
> + smmu_domain->smmu = master->smmu;
> + ret = arm_smmu_domain_finalise(domain, master, user_cfg);
> + if (ret) {
> + kfree(smmu_domain);
> + return NULL;
> + }
> + }
> +
> + return domain;
> +}
> +
> +static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
> +{
> + return __arm_smmu_domain_alloc(type, NULL, NULL, NULL);
> +}
> +
> +static struct iommu_domain *
> +arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
> + const void *user_data)
> +{
> + const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
> + struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> + unsigned type = IOMMU_DOMAIN_UNMANAGED;
is there any guarantee that master is non null? Don't we want to check?
> +
> + return __arm_smmu_domain_alloc(type, NULL, master, user_cfg);
> +}
> +
> static struct iommu_ops arm_smmu_ops = {
> .capable = arm_smmu_capable,
> .hw_info = arm_smmu_hw_info,
> .domain_alloc = arm_smmu_domain_alloc,
> + .domain_alloc_user = arm_smmu_domain_alloc_user,
> .get_unmanaged_domain = arm_smmu_get_unmanaged_domain,
> .probe_device = arm_smmu_probe_device,
> .release_device = arm_smmu_release_device,
Thanks

Eric

2023-03-24 15:39:01

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user



On 3/9/23 11:53, Nicolin Chen wrote:
> The arm_smmu_domain_alloc_user callback function is used for userspace to
> allocate iommu_domains, such as standalone stage-1 domain, nested stage-1
> domain, and nested stage-2 domain. The input user_data is in the type of
> struct iommu_hwpt_arm_smmuv3 that contains the configurations of a nested
> stage-1 or a nested stage-2 iommu_domain. A NULL user_data will just opt
> in a standalone stage-1 domain allocation.
>
> Add a constitutive function __arm_smmu_domain_alloc to support that.
>
> Since ops->domain_alloc_user has a valid dev pointer, the master pointer
> is available when calling __arm_smmu_domain_alloc() in this case, meaning
> that arm_smmu_domain_finalise() can be done at the allocation stage. This
> allows IOMMUFD to initialize the hw_pagetable for the domain.
>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 95 ++++++++++++++-------
> 1 file changed, 65 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2d29f7320570..5ff74edfbd68 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2053,36 +2053,6 @@ static void *arm_smmu_hw_info(struct device *dev, u32 *length)
> return info;
> }
>
> -static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
> -{
> - struct arm_smmu_domain *smmu_domain;
> -
> - if (type == IOMMU_DOMAIN_SVA)
> - return arm_smmu_sva_domain_alloc();
> -
> - if (type != IOMMU_DOMAIN_UNMANAGED &&
> - type != IOMMU_DOMAIN_DMA &&
> - type != IOMMU_DOMAIN_DMA_FQ &&
> - type != IOMMU_DOMAIN_IDENTITY)
> - return NULL;
> -
> - /*
> - * Allocate the domain and initialise some of its data structures.
> - * We can't really do anything meaningful until we've added a
> - * master.
> - */
> - smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> - if (!smmu_domain)
> - return NULL;
> -
> - mutex_init(&smmu_domain->init_mutex);
> - INIT_LIST_HEAD(&smmu_domain->devices);
> - spin_lock_init(&smmu_domain->devices_lock);
> - INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
> -
> - return &smmu_domain->domain;
> -}
> -
> static struct iommu_domain *arm_smmu_get_unmanaged_domain(struct device *dev)
> {
> struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> @@ -2893,10 +2863,75 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
> arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
> }
>
> +static struct iommu_domain *
> +__arm_smmu_domain_alloc(unsigned type,
> + struct arm_smmu_domain *s2,
I think you should rather introduce s2 param in "iommu/arm-smmu-v3:
Support IOMMU_DOMAIN_NESTED type of allocations" because it is not use
at all in this patch nor really explained in the commit msg

Thanks

Eric
> + struct arm_smmu_master *master,
> + const struct iommu_hwpt_arm_smmuv3 *user_cfg)
> +{
> + struct arm_smmu_domain *smmu_domain;
> + struct iommu_domain *domain;
> + int ret = 0;
> +
> + if (type == IOMMU_DOMAIN_SVA)
> + return arm_smmu_sva_domain_alloc();
> +
> + if (type != IOMMU_DOMAIN_UNMANAGED &&
> + type != IOMMU_DOMAIN_DMA &&
> + type != IOMMU_DOMAIN_DMA_FQ &&
> + type != IOMMU_DOMAIN_IDENTITY)
> + return NULL;
> +
> + /*
> + * Allocate the domain and initialise some of its data structures.
> + * We can't really finalise the domain unless a master is given.
> + */
> + smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> + if (!smmu_domain)
> + return NULL;
> + domain = &smmu_domain->domain;
> +
> + domain->type = type;
> + domain->ops = arm_smmu_ops.default_domain_ops;
> +
> + mutex_init(&smmu_domain->init_mutex);
> + INIT_LIST_HEAD(&smmu_domain->devices);
> + spin_lock_init(&smmu_domain->devices_lock);
> + INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
> +
> + if (master) {
> + smmu_domain->smmu = master->smmu;
> + ret = arm_smmu_domain_finalise(domain, master, user_cfg);
> + if (ret) {
> + kfree(smmu_domain);
> + return NULL;
> + }
> + }
> +
> + return domain;
> +}
> +
> +static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
> +{
> + return __arm_smmu_domain_alloc(type, NULL, NULL, NULL);
> +}
> +
> +static struct iommu_domain *
> +arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
> + const void *user_data)
> +{
> + const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
> + struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> + unsigned type = IOMMU_DOMAIN_UNMANAGED;
> +
> + return __arm_smmu_domain_alloc(type, NULL, master, user_cfg);
> +}
> +
> static struct iommu_ops arm_smmu_ops = {
> .capable = arm_smmu_capable,
> .hw_info = arm_smmu_hw_info,
> .domain_alloc = arm_smmu_domain_alloc,
> + .domain_alloc_user = arm_smmu_domain_alloc_user,
> .get_unmanaged_domain = arm_smmu_get_unmanaged_domain,
> .probe_device = arm_smmu_probe_device,
> .release_device = arm_smmu_release_device,

2023-03-24 15:46:29

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

Hi Nicolin,

On 3/9/23 11:53, Nicolin Chen wrote:
> Add domain allocation support for IOMMU_DOMAIN_NESTED type. This includes
> the "finalise" part to log in the user space Stream Table Entry info.

Please explain the domain ops specialization.
>
> Co-developed-by: Eric Auger <[email protected]>
> Signed-off-by: Eric Auger <[email protected]>
> Signed-off-by: Nicolin Chen <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 38 +++++++++++++++++++--
> 1 file changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 5ff74edfbd68..1f318b5e0921 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2214,6 +2214,19 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
> return 0;
> }
>
> + if (domain->type == IOMMU_DOMAIN_NESTED) {
> + if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) ||
> + !(smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
> + dev_dbg(smmu->dev, "does not implement two stages\n");
> + return -EINVAL;
> + }
> + smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
> + smmu_domain->s1_cfg.s1fmt = user_cfg->s1fmt;
> + smmu_domain->s1_cfg.s1cdmax = user_cfg->s1cdmax;
> + smmu_domain->s1_cfg.cdcfg.cdtab_dma = user_cfg->s1ctxptr;
> + return 0;
> + }
> +
> if (user_cfg_s2 && !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
> return -EINVAL;
> if (user_cfg_s2)
> @@ -2863,6 +2876,11 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
> arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
> }
>
> +static const struct iommu_domain_ops arm_smmu_nested_domain_ops = {
> + .attach_dev = arm_smmu_attach_dev,
> + .free = arm_smmu_domain_free,
> +};
> +
> static struct iommu_domain *
> __arm_smmu_domain_alloc(unsigned type,
> struct arm_smmu_domain *s2,
> @@ -2877,11 +2895,15 @@ __arm_smmu_domain_alloc(unsigned type,
> return arm_smmu_sva_domain_alloc();
>
> if (type != IOMMU_DOMAIN_UNMANAGED &&
> + type != IOMMU_DOMAIN_NESTED &&
> type != IOMMU_DOMAIN_DMA &&
> type != IOMMU_DOMAIN_DMA_FQ &&
> type != IOMMU_DOMAIN_IDENTITY)
> return NULL;
>
> + if (s2 && s2->stage != ARM_SMMU_DOMAIN_S2)
> + return NULL;
> +
> /*
> * Allocate the domain and initialise some of its data structures.
> * We can't really finalise the domain unless a master is given.
> @@ -2889,10 +2911,14 @@ __arm_smmu_domain_alloc(unsigned type,
> smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> if (!smmu_domain)
> return NULL;
> + smmu_domain->s2 = s2;
> domain = &smmu_domain->domain;
>
> domain->type = type;
> - domain->ops = arm_smmu_ops.default_domain_ops;
> + if (s2)
> + domain->ops = &arm_smmu_nested_domain_ops;
> + else
> + domain->ops = arm_smmu_ops.default_domain_ops;
>
> mutex_init(&smmu_domain->init_mutex);
> INIT_LIST_HEAD(&smmu_domain->devices);
> @@ -2923,8 +2949,16 @@ arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
> const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
> struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> unsigned type = IOMMU_DOMAIN_UNMANAGED;
> + struct arm_smmu_domain *s2 = NULL;
> +
> + if (parent) {
> + if (parent->ops != arm_smmu_ops.default_domain_ops)
> + return NULL;
> + type = IOMMU_DOMAIN_NESTED;
> + s2 = to_smmu_domain(parent);
> + }
Please can you explain the (use) case where !parent. This creates an
unmanaged S1?

Thanks

Eric
>
> - return __arm_smmu_domain_alloc(type, NULL, master, user_cfg);
> + return __arm_smmu_domain_alloc(type, s2, master, user_cfg);
> }
>
> static struct iommu_ops arm_smmu_ops = {

2023-03-24 16:32:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

On Fri, Mar 24, 2023 at 04:44:58PM +0100, Eric Auger wrote:

> Please can you explain the (use) case where !parent. This creates an
> unmanaged S1?

If parent is not specified then userspace can force the IOPTE format
to be S1 or S2 of a normal unmanaged domain.

Not sure there is a usecase, but it seems reasonable to support. It
would be useful if there is further parameterization of the S1 like
limiting the number of address bits or something.

Jason

2023-03-24 17:47:55

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Fri, Mar 24, 2023 at 11:57:09AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 24, 2023 at 09:02:34AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <[email protected]>
> > > Sent: Wednesday, March 22, 2023 2:42 PM
> > >
> > > On Tue, Mar 21, 2023 at 08:48:31AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Mar 21, 2023 at 08:34:00AM +0000, Tian, Kevin wrote:
> > > >
> > > > > > > Rephrasing that to put into a design: the IOCTL would pass a
> > > > > > > user pointer to the queue, the size of the queue, then a head
> > > > > > > pointer and a tail pointer? Then the kernel reads out all the
> > > > > > > commands between the head and the tail and handles all those
> > > > > > > invalidation commands only?
> > > > > >
> > > > > > Yes, that is one possible design
> > > > >
> > > > > If we cannot have the short path in the kernel then I'm not sure the
> > > > > value of using native format and queue in the uAPI. Batching can
> > > > > be enabled over any format.
> > > >
> > > > SMMUv3 will have a hardware short path where the HW itself runs the
> > > > VM's command queue and does this logic.
> > > >
> > > > So I like the symmetry of the SW path being close to that.
> > >
> > > A tricky thing here that I just realized:
> > >
> > > With VCMDQ, the guest will have two CMDQs. One is the vSMMU's
> > > CMDQ handling all non-TLBI commands like CMD_CFGI_STE via the
> > > invalidation IOCTL, and the other hardware accelerated VCMDQ
> > > handling all TLBI commands by the HW. In this setup, we will
> > > need a VCMDQ kernel driver to dispatch commands into the two
> > > different queues.
> > >
> >
> > why doesn't hw generate a vm-exit for unsupported CMDs in VCMDQ
> > and then let them emulated by vSMMU? such events should be rare
> > once map/unmap are being conducted...
>
> IIRC vcmdq is defined to only process invalidations, so it would be a
> driver error to send anything else. I think this is what Nicolin
> means. Most likely to use it the VM would have to see the nvidia acpi
> extension and activate vcmdq in the VM.
>
> If you suggest to overlay the main cmdq with the vcmdq and then don't
> tell the guest about it.. Robin suggested something similar.

Yea, I remember that too, from the email that I received from
Robin on Christmas Eve :)

Yet, I haven't got a chance to run some experiment with that.

> This idea would be a half and half, the HW would run the queue and the
> doorbell and generate error interrupts back to the hypervisor and tell
> it that the queue is paused and ask it to fix the failed entry and
> restart.
>
> I could see this as an interesting solution, but I don't know if this
> HW can support it..

It possibly can, since an unsupported command will trigger an
Illegal Command interrupt, then the IRQ handler could read it
out of the CMDQ. Again, I'd need to run some experiment, once
this SMMU nesting series is settled down to certain level.

One immediate thing about this solution is that we still need
a multi-CMDQ support per SMMU instance, besides from a multi-
SMMU instance support. This might be implemented as the ECMDQ
I guess. But I am not sure if there is a ECMDQ HW available,
so that we can add its support first, to fit VCMDQ into it.

Overall, interesting topics! I'd like to carry on along the
way of this series, hoping we can figure out something smart
and solid to implement :)

Thanks
Nicolin

2023-03-24 17:47:58

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user

Hi Eirc,

Thanks for the review.

On Fri, Mar 24, 2023 at 04:28:26PM +0100, Eric Auger wrote:

> > +static struct iommu_domain *
> > +__arm_smmu_domain_alloc(unsigned type,
> > + struct arm_smmu_domain *s2,
> > + struct arm_smmu_master *master,
> > + const struct iommu_hwpt_arm_smmuv3 *user_cfg)
> > +{
> > + struct arm_smmu_domain *smmu_domain;
> > + struct iommu_domain *domain;
> > + int ret = 0;
> > +
> > + if (type == IOMMU_DOMAIN_SVA)
> > + return arm_smmu_sva_domain_alloc();
> > +
> > + if (type != IOMMU_DOMAIN_UNMANAGED &&
> > + type != IOMMU_DOMAIN_DMA &&
> > + type != IOMMU_DOMAIN_DMA_FQ &&
> > + type != IOMMU_DOMAIN_IDENTITY)
> > + return NULL;
> > +
> > + /*
> > + * Allocate the domain and initialise some of its data structures.
> > + * We can't really finalise the domain unless a master is given.
> > + */
> > + smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> > + if (!smmu_domain)
> > + return NULL;
> > + domain = &smmu_domain->domain;
> > +
> > + domain->type = type;
> > + domain->ops = arm_smmu_ops.default_domain_ops;
> Compared to the original code, that's something new. Please can you
> explain why this is added in this patch?

Yea, I probably should have mentioned in the commit message that
this function via ops->domain_alloc_user() is called by IOMMUFD
directly without a wrapper, v.s. the other callers all go with
the __iommu_domain_alloc() helper in the iommu core where an ops
pointer gets setup.

So, this is something new, in order to work with IOMMUFD.

Thanks
Nicolin

2023-03-24 17:48:04

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user

On Fri, Mar 24, 2023 at 04:33:31PM +0100, Eric Auger wrote:

> > @@ -2893,10 +2863,75 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
> > arm_smmu_sva_remove_dev_pasid(domain, dev, pasid);
> > }
> >
> > +static struct iommu_domain *
> > +__arm_smmu_domain_alloc(unsigned type,
> > + struct arm_smmu_domain *s2,
> I think you should rather introduce s2 param in "iommu/arm-smmu-v3:
> Support IOMMU_DOMAIN_NESTED type of allocations" because it is not use
> at all in this patch nor really explained in the commit msg

OK. I will move it.

Thanks!
Nicolin

2023-03-24 17:52:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user

On Fri, Mar 24, 2023 at 10:40:46AM -0700, Nicolin Chen wrote:
> Hi Eirc,
>
> Thanks for the review.
>
> On Fri, Mar 24, 2023 at 04:28:26PM +0100, Eric Auger wrote:
>
> > > +static struct iommu_domain *
> > > +__arm_smmu_domain_alloc(unsigned type,
> > > + struct arm_smmu_domain *s2,
> > > + struct arm_smmu_master *master,
> > > + const struct iommu_hwpt_arm_smmuv3 *user_cfg)
> > > +{
> > > + struct arm_smmu_domain *smmu_domain;
> > > + struct iommu_domain *domain;
> > > + int ret = 0;
> > > +
> > > + if (type == IOMMU_DOMAIN_SVA)
> > > + return arm_smmu_sva_domain_alloc();
> > > +
> > > + if (type != IOMMU_DOMAIN_UNMANAGED &&
> > > + type != IOMMU_DOMAIN_DMA &&
> > > + type != IOMMU_DOMAIN_DMA_FQ &&
> > > + type != IOMMU_DOMAIN_IDENTITY)
> > > + return NULL;
> > > +
> > > + /*
> > > + * Allocate the domain and initialise some of its data structures.
> > > + * We can't really finalise the domain unless a master is given.
> > > + */
> > > + smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> > > + if (!smmu_domain)
> > > + return NULL;
> > > + domain = &smmu_domain->domain;
> > > +
> > > + domain->type = type;
> > > + domain->ops = arm_smmu_ops.default_domain_ops;
> > Compared to the original code, that's something new. Please can you
> > explain why this is added in this patch?
>
> Yea, I probably should have mentioned in the commit message that
> this function via ops->domain_alloc_user() is called by IOMMUFD
> directly without a wrapper, v.s. the other callers all go with
> the __iommu_domain_alloc() helper in the iommu core where an ops
> pointer gets setup.
>
> So, this is something new, in order to work with IOMMUFD.

But using default_domain_ops is not great, the ops should be set based
on the domain type being created and the various different flavours
should have their own types and ops.

So name the existing ops something logical like 'unmanaged_domain_ops'
and move it out of the inline initializer.

Make another ops for identity like shown here to get the ball rolling:

https://lore.kernel.org/r/[email protected]

There is a whole bunch of tidying here to make things follow the op
per type design.

Jason

2023-03-24 17:52:36

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

On Fri, Mar 24, 2023 at 04:44:58PM +0100, Eric Auger wrote:
> > @@ -2923,8 +2949,16 @@ arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
> > const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
> > struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > unsigned type = IOMMU_DOMAIN_UNMANAGED;
> > + struct arm_smmu_domain *s2 = NULL;
> > +
> > + if (parent) {
> > + if (parent->ops != arm_smmu_ops.default_domain_ops)
> > + return NULL;
> > + type = IOMMU_DOMAIN_NESTED;
> > + s2 = to_smmu_domain(parent);
> > + }
> Please can you explain the (use) case where !parent. This creates an
> unmanaged S1?

It creates an unmanaged type of a domain. The decision to mark
it as an unmanaged S1 or an unmanaged S2 domain, is done in the
finalise() function that it checks the S2 flag and set a stage
accordingly.

I think that I could add a few lines of comments inline or at
the top of the function to ease the readability.

Thanks
Nicolin

2023-03-24 17:54:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

On Fri, Mar 24, 2023 at 10:50:34AM -0700, Nicolin Chen wrote:
> On Fri, Mar 24, 2023 at 04:44:58PM +0100, Eric Auger wrote:
> > > @@ -2923,8 +2949,16 @@ arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
> > > const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
> > > struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > > unsigned type = IOMMU_DOMAIN_UNMANAGED;
> > > + struct arm_smmu_domain *s2 = NULL;
> > > +
> > > + if (parent) {
> > > + if (parent->ops != arm_smmu_ops.default_domain_ops)
> > > + return NULL;
> > > + type = IOMMU_DOMAIN_NESTED;
> > > + s2 = to_smmu_domain(parent);
> > > + }
> > Please can you explain the (use) case where !parent. This creates an
> > unmanaged S1?
>
> It creates an unmanaged type of a domain. The decision to mark
> it as an unmanaged S1 or an unmanaged S2 domain, is done in the
> finalise() function that it checks the S2 flag and set a stage
> accordingly.

This also needs to be fixed up, the alloc_user should not return
incompletely initialized domains.

Jason

2023-03-24 17:59:43

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 12/14] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED type of allocations

On Fri, Mar 24, 2023 at 02:51:45PM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 24, 2023 at 10:50:34AM -0700, Nicolin Chen wrote:
> > On Fri, Mar 24, 2023 at 04:44:58PM +0100, Eric Auger wrote:
> > > > @@ -2923,8 +2949,16 @@ arm_smmu_domain_alloc_user(struct device *dev, struct iommu_domain *parent,
> > > > const struct iommu_hwpt_arm_smmuv3 *user_cfg = user_data;
> > > > struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > > > unsigned type = IOMMU_DOMAIN_UNMANAGED;
> > > > + struct arm_smmu_domain *s2 = NULL;
> > > > +
> > > > + if (parent) {
> > > > + if (parent->ops != arm_smmu_ops.default_domain_ops)
> > > > + return NULL;
> > > > + type = IOMMU_DOMAIN_NESTED;
> > > > + s2 = to_smmu_domain(parent);
> > > > + }
> > > Please can you explain the (use) case where !parent. This creates an
> > > unmanaged S1?
> >
> > It creates an unmanaged type of a domain. The decision to mark
> > it as an unmanaged S1 or an unmanaged S2 domain, is done in the
> > finalise() function that it checks the S2 flag and set a stage
> > accordingly.
>
> This also needs to be fixed up, the alloc_user should not return
> incompletely initialized domains.

The finalise() is called at the end of __arm_smmu_domain_alloc()
so alloc_user passing a dev pointer completes the initialization
actually.

Thanks
Nicolin

2023-03-24 18:02:33

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 11/14] iommu/arm-smmu-v3: Add arm_smmu_domain_alloc_user

On Fri, Mar 24, 2023 at 02:50:42PM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 24, 2023 at 10:40:46AM -0700, Nicolin Chen wrote:
> > Hi Eirc,
> >
> > Thanks for the review.
> >
> > On Fri, Mar 24, 2023 at 04:28:26PM +0100, Eric Auger wrote:
> >
> > > > +static struct iommu_domain *
> > > > +__arm_smmu_domain_alloc(unsigned type,
> > > > + struct arm_smmu_domain *s2,
> > > > + struct arm_smmu_master *master,
> > > > + const struct iommu_hwpt_arm_smmuv3 *user_cfg)
> > > > +{
> > > > + struct arm_smmu_domain *smmu_domain;
> > > > + struct iommu_domain *domain;
> > > > + int ret = 0;
> > > > +
> > > > + if (type == IOMMU_DOMAIN_SVA)
> > > > + return arm_smmu_sva_domain_alloc();
> > > > +
> > > > + if (type != IOMMU_DOMAIN_UNMANAGED &&
> > > > + type != IOMMU_DOMAIN_DMA &&
> > > > + type != IOMMU_DOMAIN_DMA_FQ &&
> > > > + type != IOMMU_DOMAIN_IDENTITY)
> > > > + return NULL;
> > > > +
> > > > + /*
> > > > + * Allocate the domain and initialise some of its data structures.
> > > > + * We can't really finalise the domain unless a master is given.
> > > > + */
> > > > + smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
> > > > + if (!smmu_domain)
> > > > + return NULL;
> > > > + domain = &smmu_domain->domain;
> > > > +
> > > > + domain->type = type;
> > > > + domain->ops = arm_smmu_ops.default_domain_ops;
> > > Compared to the original code, that's something new. Please can you
> > > explain why this is added in this patch?
> >
> > Yea, I probably should have mentioned in the commit message that
> > this function via ops->domain_alloc_user() is called by IOMMUFD
> > directly without a wrapper, v.s. the other callers all go with
> > the __iommu_domain_alloc() helper in the iommu core where an ops
> > pointer gets setup.
> >
> > So, this is something new, in order to work with IOMMUFD.
>
> But using default_domain_ops is not great, the ops should be set based
> on the domain type being created and the various different flavours
> should have their own types and ops.
>
> So name the existing ops something logical like 'unmanaged_domain_ops'
> and move it out of the inline initializer.
>
> Make another ops for identity like shown here to get the ball rolling:
>
> https://lore.kernel.org/r/[email protected]
>
> There is a whole bunch of tidying here to make things follow the op
> per type design.

Thanks for the suggestion. Will add a patch doing that in v2.

Nicolin

2023-03-28 02:51:56

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, March 24, 2023 10:45 PM
>
> > But still the main open for in-kernel short-path is what would be the
> > framework to move part of vIOMMU emulation into the kernel. If this
> > can be done cleanly then it's better than vhost-iommu which lacks
> > behind significantly regarding to advanced features. But if it cannot
> > be done cleanly leaving each vendor move random emulation logic
> > into the kernel then vhost-iommu sounds more friendly to the kernel
> > though lots of work remains to fill the feature gap.
>
> I assume there are reasonable ways to hook the kernel to kvm, vhost
> does it. I've never looked at it. At worst we need to factor some of
> the vhost code into some library to allow it.
>
> We want a kernel thread to wakeup on a doorbell ring basically.
>

kvm supports ioeventfd for the doorbell purpose.

Aside from that I'm not sure which part of vhost can be generalized
to be used by other vIOMMU. it's a in-memory ring structure plus
doorbell so it's easy to fit in the kernel.

But emulated vIOMMUs are typically MMIO-based ring structure
which requires 1) kvm provides a synchronous ioeventfd for MMIO
based head/tail emulation; 2) userspace vIOMMU shares its virtual
register page with the kernel which can then update virtual tail/head
registers w/o exiting to the userspace; 3) the kernel thread can
selectively exit to userspace for cmds which it cannot directly handle.

Those require a new framework to establish.

2023-03-28 03:10:37

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Nicolin Chen <[email protected]>
> Sent: Saturday, March 25, 2023 1:35 AM
>
> On Fri, Mar 24, 2023 at 11:57:09AM -0300, Jason Gunthorpe wrote:
> >
> > If you suggest to overlay the main cmdq with the vcmdq and then don't
> > tell the guest about it.. Robin suggested something similar.

yes, that's my point.

>
> Yea, I remember that too, from the email that I received from
> Robin on Christmas Eve :)
>
> Yet, I haven't got a chance to run some experiment with that.
>
> > This idea would be a half and half, the HW would run the queue and the
> > doorbell and generate error interrupts back to the hypervisor and tell
> > it that the queue is paused and ask it to fix the failed entry and
> > restart.
> >
> > I could see this as an interesting solution, but I don't know if this
> > HW can support it..
>
> It possibly can, since an unsupported command will trigger an
> Illegal Command interrupt, then the IRQ handler could read it
> out of the CMDQ. Again, I'd need to run some experiment, once
> this SMMU nesting series is settled down to certain level.
>

also you want to ensure that error is a recoverable type so
once sw fixes it the hw can continue to behave correctly.

2023-03-28 12:32:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

On Tue, Mar 28, 2023 at 02:48:31AM +0000, Tian, Kevin wrote:

> But emulated vIOMMUs are typically MMIO-based ring structure
> which requires 1) kvm provides a synchronous ioeventfd for MMIO
> based head/tail emulation; 2) userspace vIOMMU shares its virtual
> register page with the kernel which can then update virtual tail/head
> registers w/o exiting to the userspace; 3) the kernel thread can
> selectively exit to userspace for cmds which it cannot directly handle.

What is needed is for the kvm side to capture the store execute it to
some backing memory, and also trigger the eventfd.

It shouldn't need to be synchronous.

For SMMU the interface is layed out with unique 4k pages per-CMDQ that
contains the 3 relevant 8 byte values.

So we could mmap a page from the kernel that has the 3 values. qemu
would install the page in the kvm memory map and it would
arrange things so that stores reach the 8 bytes and trigger an
eventfd.

Kernel simply reads the cons index after the eventfd, looks in the
IOAS to get the queue memory and does the operation async.

It is not especially conceptually difficult..

Jason

2023-03-31 08:13:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, March 28, 2023 8:27 PM
>
> On Tue, Mar 28, 2023 at 02:48:31AM +0000, Tian, Kevin wrote:
>
> > But emulated vIOMMUs are typically MMIO-based ring structure
> > which requires 1) kvm provides a synchronous ioeventfd for MMIO
> > based head/tail emulation; 2) userspace vIOMMU shares its virtual
> > register page with the kernel which can then update virtual tail/head
> > registers w/o exiting to the userspace; 3) the kernel thread can
> > selectively exit to userspace for cmds which it cannot directly handle.
>
> What is needed is for the kvm side to capture the store execute it to
> some backing memory, and also trigger the eventfd.
>
> It shouldn't need to be synchronous.

Correct

>
> For SMMU the interface is layed out with unique 4k pages per-CMDQ that
> contains the 3 relevant 8 byte values.

VT-d has only one invalidation queue with relevant registers mixed
with other VT-d registers in 4k page. But this should be fine as long
as the new mechanism allows specifying which offsets in mapped
page fall into the fast path.

>
> So we could mmap a page from the kernel that has the 3 values. qemu
> would install the page in the kvm memory map and it would
> arrange things so that stores reach the 8 bytes and trigger an
> eventfd.
>
> Kernel simply reads the cons index after the eventfd, looks in the
> IOAS to get the queue memory and does the operation async.
>
> It is not especially conceptually difficult..
>

Looks so, at least in concept.

btw regarding to the initial nesting support on smmu do you want
to follow this unique 4k layout plus native cmdq format or just
the latter (i.e. cmd format is native but head/tail/start is defined
in a sw customized way)?

If the latter I wonder whether it's necessary to generalize it so
the batching format is vendor-agnostic while the specific cmd/
descriptor format is vendor specific.

Thanks
Kevin

2023-04-12 07:48:46

by Nicolin Chen

[permalink] [raw]
Subject: Re: [PATCH v1 04/14] iommu/arm-smmu-v3: Add arm_smmu_hw_info

Hi Robin,

On Thu, Mar 16, 2023 at 01:06:17PM -0700, Nicolin Chen wrote:
> On Thu, Mar 16, 2023 at 03:19:27PM +0000, Robin Murphy wrote:
>
> > > > Note that until now it has been extremely fortunate that in pretty much
> > > > every case Linux either hasn't supported the affected feature at all, or
> > > > has happened to avoid meeting the conditions. Once we do introduce
> > > > nesting support that all goes out the window (and I'll have to think
> > > > more when reviewing new errata in future...)
> > > >
> > > > I've been putting off revisiting all the existing errata to figure out
> > > > what we'd need to do until new nesting patches appeared, so I'll try to
> > > > get to that soon now. I think in many cases it's likely to be best to
> > > > just disallowing nesting entirely on affected implementations.
> > >
> > > Do we have already a list of "affected implementations"? Or,
> > > we would need to make such a list now? In a latter case, can
> > > these affected implementations be detected from their IRD0-5
> > > registers, so that we can simply do something in hw_info()?
> >
> > Somewhere I have a patch that adds all the IIDR stuff needed for this,
> > but I never sent it upstream since the erratum itself was an early
> > MMU-600 one which in practice doesn't matter. I'll dig that out and
> > update it with what I have in mind.
>
> Nice!
>
> Perhaps we should merge that first, or include in this series
> if you don't mind, so that we would be less worried about any
> affected platform when releasing the new Linux version having
> this nesting feature.

I just want to see if there's a possibility of adding the
patch that you mentioned above in the near term?

I'd like to send a v2 of this series for another round of
review before the next -rc1, so it'd be nicer to include
that.

Thanks
Nic