2023-12-12 16:02:16

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFC PATCH 0/6] iommu/amd: Introduce hardware info reporting and nested translation support

OVERVIEW
--------
This is the first part of multi-part series to introduce the AMD IOMMU
nested translation support with Hardware-virtualized IOMMU (HW-vIOMMU),
which allows user-space to take adventage of hardware accelerated
AMD IOMMU in the guest.

The part 1 introduces AMD IOMMU hardware info reporting support via
the ioctl IOMMU_GET_HW_INFO with enum IOMMU_HW_INFO_TYPE_AMD. This allows
hypervisor to communicate the supported capabilities to user-space.
For example, this information can be used by QEMU to populate EFR and EFR2
field in the guest IVRS table when building guest ACPI table.

In addition, this series introduces nested translation support for the
AMD IOMMU v2 page table (stage-1) via the IOMMUFD nesting infrastructure.
This allows User-space to set up the v2 page table and communicate
information via the struct iommu_hwpt_amd_v2 with enum
IOMMU_HWPT_DATA_AMD_V2.

This series is based on top of:
* Linux v6.7-rc5
* Series [PATCH v2 0/9] Improve TLB invalidation logic
(Currently queued in https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/log/?h=next)
* Series [PATCH v4 00/16] iommu/amd: SVA Support (part 3) - refactor support for GCR3 table
(https://lore.kernel.org/all/[email protected]/)

GITHUB
------
[1] https://github.com/AMDESE/linux/tree/iommu_sva_part3_v4_v6.7_rc5-amd_nesting_rfc_v1
(Complete code for the part 1 of this series)
[2] https://github.com/AMDESE/linux/tree/wip/iommu_sva_part3_v4_v6.7_rc5-amd_viommu_20231212
(WIP for Linux AMD Hw-vIOMMU support)

HISTORY
-------
* [RFC PATCH 00/21] iommu/amd: Introduce support for HW accelerated vIOMMU w/ nested page table
(https://lore.kernel.org/lkml/[email protected]/)

Thank you,
Suravee

Suravee Suthikulpanit (6):
iommu/amd: Update PASID, GATS, and GLX feature related macros
iommu/amd: Add support for hw_info for iommu capability query
iommu/amd: Introduce Guest-ID struct amd_iommu_vminfo
iommufd: Introduce data struct for AMD nested domain allocation
iommu/amd: Introduce helper functions to setup GCR3TRPMode
iommu/amd: Introduce nested translation support

drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 20 +-
drivers/iommu/amd/amd_iommu_types.h | 26 ++-
drivers/iommu/amd/init.c | 8 +-
drivers/iommu/amd/iommu.c | 299 +++++++++++++++++++++++++++-
drivers/iommu/amd/nested.c | 107 ++++++++++
include/uapi/linux/iommufd.h | 36 ++++
7 files changed, 475 insertions(+), 23 deletions(-)
create mode 100644 drivers/iommu/amd/nested.c

--
2.34.1


2023-12-12 16:02:23

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFC PATCH 1/6] iommu/amd: Update PASID, GATS, and GLX feature related macros

Clean up and reorder them according to the bit index. There is no
functional change.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/amd_iommu.h | 2 +-
drivers/iommu/amd/amd_iommu_types.h | 13 +++++++------
drivers/iommu/amd/init.c | 8 ++++----
3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index bbed268e8abc..108253edbeb0 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -106,7 +106,7 @@ static inline bool check_feature2(u64 mask)

static inline int check_feature_gpt_level(void)
{
- return ((amd_iommu_efr >> FEATURE_GATS_SHIFT) & FEATURE_GATS_MASK);
+ return ((amd_iommu_efr & FEATURE_GATS_MASK) >> FEATURE_GATS_SHIFT);
}

static inline bool amd_iommu_gt_ppr_supported(void)
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 3dc39bbc05fc..14f67a8cf755 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -93,8 +93,6 @@
#define FEATURE_GA BIT_ULL(7)
#define FEATURE_HE BIT_ULL(8)
#define FEATURE_PC BIT_ULL(9)
-#define FEATURE_GATS_SHIFT (12)
-#define FEATURE_GATS_MASK (3ULL)
#define FEATURE_GAM_VAPIC BIT_ULL(21)
#define FEATURE_GIOSUP BIT_ULL(48)
#define FEATURE_HASUP BIT_ULL(49)
@@ -102,11 +100,14 @@
#define FEATURE_HDSUP BIT_ULL(52)
#define FEATURE_SNP BIT_ULL(63)

-#define FEATURE_PASID_SHIFT 32
-#define FEATURE_PASID_MASK (0x1fULL << FEATURE_PASID_SHIFT)
+#define FEATURE_GATS_SHIFT 12
+#define FEATURE_GATS_MASK (0x03ULL << FEATURE_GATS_SHIFT)

-#define FEATURE_GLXVAL_SHIFT 14
-#define FEATURE_GLXVAL_MASK (0x03ULL << FEATURE_GLXVAL_SHIFT)
+#define FEATURE_GLX_SHIFT 14
+#define FEATURE_GLX_MASK (0x03ULL << FEATURE_GLX_SHIFT)
+
+#define FEATURE_PASMAX_SHIFT 32
+#define FEATURE_PASMAX_MASK (0x1FULL << FEATURE_PASMAX_SHIFT)

/* Extended Feature 2 Bits */
#define FEATURE_SNPAVICSUP_SHIFT 5
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 959820ccfbcc..e84c69fe13d4 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -2080,14 +2080,14 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
int glxval;
u64 pasmax;

- pasmax = amd_iommu_efr & FEATURE_PASID_MASK;
- pasmax >>= FEATURE_PASID_SHIFT;
+ pasmax = amd_iommu_efr & FEATURE_PASMAX_MASK;
+ pasmax >>= FEATURE_PASMAX_SHIFT;
iommu->iommu.max_pasids = (1 << (pasmax + 1)) - 1;

BUG_ON(iommu->iommu.max_pasids & ~PASID_MASK);

- glxval = amd_iommu_efr & FEATURE_GLXVAL_MASK;
- glxval >>= FEATURE_GLXVAL_SHIFT;
+ glxval = amd_iommu_efr & FEATURE_GLX_MASK;
+ glxval >>= FEATURE_GLX_SHIFT;

if (amd_iommu_max_glx_val == -1)
amd_iommu_max_glx_val = glxval;
--
2.34.1

2023-12-12 16:02:27

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFC PATCH 4/6] iommufd: Introduce data struct for AMD nested domain allocation

Introduce IOMMU_HWPT_DATA_AMD_V2 data type for AMD IOMMU v2 page table,
which is used for stage-1 in nested translation. The data structure
contains information necessary for setting up the AMD HW-vIOMMU support.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index bf4a1f8ab748..e2240d430dd1 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -389,14 +389,37 @@ struct iommu_hwpt_vtd_s1 {
__u32 __reserved;
};

+/**
+ * struct iommu_hwpt_amd_v2 - AMD IOMMU specific user-managed
+ * v2 I/O page table data
+ * @gcr3: GCR3 guest physical ddress
+ * @gid: Guest ID
+ * @iommu_id: IOMMU host device ID
+ * @gdev_id: Guest device ID
+ * @gdom_id: Guest domain ID
+ * @glx: GCR3 table levels
+ * @guest_paging_mode: Guest v2 page table paging mode
+ */
+struct iommu_hwpt_amd_v2 {
+ __aligned_u64 gcr3;
+ __u32 gid;
+ __u32 iommu_id;
+ __u16 gdev_id;
+ __u16 gdom_id;
+ __u16 glx;
+ __u16 guest_paging_mode;
+};
+
/**
* enum iommu_hwpt_data_type - IOMMU HWPT Data Type
* @IOMMU_HWPT_DATA_NONE: no data
* @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
+ * @IOMMU_HWPT_DATA_AMD_V2: AMD IOMMUv2 page table
*/
enum iommu_hwpt_data_type {
IOMMU_HWPT_DATA_NONE,
IOMMU_HWPT_DATA_VTD_S1,
+ IOMMU_HWPT_DATA_AMD_V2,
};

/**
--
2.34.1

2023-12-12 16:02:34

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFC PATCH 5/6] iommu/amd: Introduce helper functions to setup GCR3TRPMode

The GCR3TRPMode allows IOMMU hardware to use GPA when programming the
GCR3 table root pointer (GCR3TRP) in the DTE. The GPA will be translated
by the IOMMU using the v1 page table referenced by the
DTE[Host Page Table Root Pointer].

Please see the AMD IOMMU Specification for more detail.
(https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf)

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/amd_iommu.h | 2 +
drivers/iommu/amd/amd_iommu_types.h | 1 +
drivers/iommu/amd/iommu.c | 132 +++++++++++++++++++++++++++-
3 files changed, 131 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 7783a933ad14..55479a6efaae 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -56,6 +56,8 @@ void amd_iommu_pdev_disable_cap_pri(struct pci_dev *pdev);
int amd_iommu_set_gcr3(struct iommu_dev_data *dev_data,
ioasid_t pasid, unsigned long gcr3);
int amd_iommu_clear_gcr3(struct iommu_dev_data *dev_data, ioasid_t pasid);
+int amd_iommu_set_gcr3tbl_trp(struct amd_iommu *iommu, struct pci_dev *pdev,
+ u64 gcr3_tbl, u16 glx, u16 guest_paging_mode);

/*
* This function flushes all internal caches of
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index a00731673c50..1b150e0cb689 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -541,6 +541,7 @@ struct gcr3_tbl_info {
u64 *gcr3_tbl; /* Guest CR3 table */
int glx; /* Number of levels for GCR3 table */
u32 pasid_cnt; /* Track attached PASIDs */
+ bool trp; /* TRP support */
};

struct amd_io_pgtable {
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index d18b23ac6357..8bf12674dc84 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -93,6 +93,9 @@ static void detach_device(struct device *dev);
static void set_dte_entry(struct amd_iommu *iommu,
struct iommu_dev_data *dev_data);

+static void amd_iommu_clear_gcr3tbl_trp(struct amd_iommu *iommu,
+ struct iommu_dev_data *dev_data);
+
/****************************************************************************
*
* Helper functions
@@ -2146,15 +2149,25 @@ static int do_attach(struct iommu_dev_data *dev_data,

static void do_detach(struct iommu_dev_data *dev_data)
{
+ struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
struct protection_domain *domain = dev_data->domain;
struct amd_iommu *iommu;

iommu = get_amd_iommu_from_dev(dev_data->dev);

- /* Clear GCR3 table */
- if (domain->pd_mode == PD_MODE_V2) {
- __clear_gcr3(dev_data, 0);
- free_gcr3_table(dev_data);
+ if (gcr3_info->gcr3_tbl) {
+ if (gcr3_info->trp) {
+ /*
+ * In GCR3TRPMode, the GCR3 table contains GPA,
+ * which is setup by guest kernel. Therefore, we just
+ * need to clean up the DTE settings for guest translation.
+ */
+ amd_iommu_clear_gcr3tbl_trp(iommu, dev_data);
+ } else {
+ /* Clear GCR3 table */
+ __clear_gcr3(dev_data, 0);
+ free_gcr3_table(dev_data);
+ }
}

/* Update data structures */
@@ -2951,6 +2964,117 @@ const struct iommu_ops amd_iommu_ops = {
}
};

+/*
+ * For GCR3TRPMode, user-space provides GPA for the GCR3 Root Pointer Table.
+ */
+int amd_iommu_set_gcr3tbl_trp(struct amd_iommu *iommu, struct pci_dev *pdev,
+ u64 gcr3_tbl, u16 glx, u16 guest_paging_mode)
+{
+ struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev->dev);
+ struct dev_table_entry *dev_table = get_dev_table(iommu);
+ struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
+ int devid = pci_dev_id(pdev);
+ u64 data0 = dev_table[devid].data[0];
+ u64 data1 = dev_table[devid].data[1];
+ u64 data2 = dev_table[devid].data[2];
+ u64 tmp;
+
+ pr_debug("%s: devid=%d, glx=%#x, gcr3_tbl=%#llx\n",
+ __func__, devid, glx, gcr3_tbl);
+
+ WARN_ON(gcr3_info->trp);
+
+ gcr3_info->trp = true;
+ gcr3_info->gcr3_tbl = (u64 *)gcr3_tbl;
+
+ data0 |= DTE_FLAG_GV | DTE_FLAG_GIOV;
+ tmp = glx;
+ data0 |= (tmp & DTE_GLX_MASK) << DTE_GLX_SHIFT;
+
+ /* First mask out possible old values for GCR3 table */
+ tmp = DTE_GCR3_VAL_A(~0ULL) << DTE_GCR3_SHIFT_A;
+ data0 &= ~tmp;
+
+ tmp = DTE_GCR3_VAL_B(~0ULL) << DTE_GCR3_SHIFT_B;
+ data1 &= ~tmp;
+
+ tmp = DTE_GCR3_VAL_C(~0ULL) << DTE_GCR3_SHIFT_C;
+ data1 &= ~tmp;
+
+ /* Encode GCR3 table into DTE */
+ tmp = DTE_GCR3_VAL_A(gcr3_tbl) << DTE_GCR3_SHIFT_A;
+ data0 |= tmp;
+
+ tmp = DTE_GCR3_VAL_B(gcr3_tbl) << DTE_GCR3_SHIFT_B;
+ data1 |= tmp;
+
+ tmp = DTE_GCR3_VAL_C(gcr3_tbl) << DTE_GCR3_SHIFT_C;
+ data1 |= tmp;
+
+ /* Mask out old values for GuestPagingMode */
+ data2 &= ~(0x3ULL << DTE_GPT_LEVEL_SHIFT);
+
+ /* Check 5-level support for the host before enabling on behalf of the guest */
+ tmp = (u64)guest_paging_mode;
+ if ((tmp == GUEST_PGTABLE_5_LEVEL) &&
+ (check_feature_gpt_level() < GUEST_PGTABLE_5_LEVEL)) {
+ pr_err("Cannot support 5-level v2 page table.\n");
+ return -EINVAL;
+ }
+ data2 |= (tmp << DTE_GPT_LEVEL_SHIFT);
+
+ dev_table[devid].data[2] = data2;
+ dev_table[devid].data[1] = data1;
+ dev_table[devid].data[0] = data0;
+
+ device_flush_dte(dev_data);
+ iommu_completion_wait(iommu);
+
+ return 0;
+}
+
+void amd_iommu_clear_gcr3tbl_trp(struct amd_iommu *iommu,
+ struct iommu_dev_data *dev_data)
+{
+ int devid = dev_data->devid;
+ struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
+ struct dev_table_entry *dev_table = get_dev_table(iommu);
+ u64 data0 = dev_table[devid].data[0];
+ u64 data1 = dev_table[devid].data[1];
+ u64 data2 = dev_table[devid].data[2];
+ u64 tmp;
+
+ if (!gcr3_info->trp)
+ return;
+
+ pr_debug("%s: devid=%#x, gcr3_tbl=%#llx\n", __func__, devid,
+ (unsigned long long)gcr3_info->gcr3_tbl);
+
+ tmp = DTE_GLX_MASK;
+ data0 &= ~(tmp << DTE_GLX_SHIFT);
+ data0 &= ~(DTE_FLAG_GV | DTE_FLAG_GIOV);
+
+ /* Mask out possible old values for GCR3 table */
+ tmp = DTE_GCR3_VAL_A(~0ULL) << DTE_GCR3_SHIFT_A;
+ data0 &= ~tmp;
+
+ tmp = DTE_GCR3_VAL_B(~0ULL) << DTE_GCR3_SHIFT_B;
+ data1 &= ~tmp;
+
+ tmp = DTE_GCR3_VAL_C(~0ULL) << DTE_GCR3_SHIFT_C;
+ data1 &= ~tmp;
+
+ /* Mask out old values for GuestPagingMode */
+ data2 &= ~(0x3ULL << DTE_GPT_LEVEL_SHIFT);
+
+ dev_table[devid].data[2] = data2;
+ dev_table[devid].data[1] = data1;
+ dev_table[devid].data[0] = data0;
+
+ gcr3_info->trp = false;
+ gcr3_info->gcr3_tbl = NULL;
+}
+
#ifdef CONFIG_IRQ_REMAP

/*****************************************************************************
--
2.34.1

2023-12-12 16:02:37

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFC PATCH 3/6] iommu/amd: Introduce Guest-ID struct amd_iommu_vminfo

AMD HW-vIOMMU feature requires IOMMU driver to specify a unique 16-bit
Guest ID (GID) for each VM. This ID is used to index into various
data structures for configuring the hardware.

Introduce amd_iommu_vminfo_hash hashtable to store per-vm configuration,
which uses 16-bit GID as a hash key along with helper functions.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/amd_iommu.h | 6 +++
drivers/iommu/amd/amd_iommu_types.h | 6 +++
drivers/iommu/amd/iommu.c | 66 +++++++++++++++++++++++++++++
3 files changed, 78 insertions(+)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 4118129f4a24..7783a933ad14 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -182,4 +182,10 @@ void amd_iommu_domain_set_pgtable(struct protection_domain *domain,
struct dev_table_entry *get_dev_table(struct amd_iommu *iommu);

extern bool amd_iommu_snp_en;
+
+/* AMD IOMMU GID */
+int amd_iommu_vminfo_alloc(struct amd_iommu *iommu, struct amd_iommu_vminfo *vminfo);
+void amd_iommu_vminfo_free(struct amd_iommu *iommu, struct amd_iommu_vminfo *vminfo);
+struct amd_iommu_vminfo *amd_iommu_get_vminfo(int gid);
+
#endif
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 956fd4658a4a..a00731673c50 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -16,6 +16,7 @@
#include <linux/pci.h>
#include <linux/irqreturn.h>
#include <linux/io-pgtable.h>
+#include <linux/hashtable.h>

/*
* Maximum number of IOMMUs supported
@@ -1053,6 +1054,11 @@ struct amd_irte_ops {
void (*clear_allocated)(struct irq_remap_table *, int);
};

+struct amd_iommu_vminfo {
+ u16 gid;
+ struct hlist_node hnode;
+};
+
#ifdef CONFIG_IRQ_REMAP
extern struct amd_irte_ops irte_32_ops;
extern struct amd_irte_ops irte_128_ops;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index c41932e9f16a..d18b23ac6357 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -23,6 +23,7 @@
#include <linux/amd-iommu.h>
#include <linux/notifier.h>
#include <linux/export.h>
+#include <linux/idr.h>
#include <linux/irq.h>
#include <linux/msi.h>
#include <linux/irqdomain.h>
@@ -66,6 +67,16 @@ LIST_HEAD(acpihid_map);
const struct iommu_ops amd_iommu_ops;
static const struct iommu_dirty_ops amd_dirty_ops;

+/* VMInfo Hashtable */
+#define AMD_IOMMU_VMINFO_HASH_BITS 16
+DEFINE_HASHTABLE(amd_iommu_vminfo_hash, AMD_IOMMU_VMINFO_HASH_BITS);
+DEFINE_SPINLOCK(amd_iommu_vminfo_hash_lock);
+
+/* Global VMID */
+#define AMD_IOMMU_VMID_INVALID (-1U)
+static DEFINE_IDA(amd_iommu_global_vmid_ida);
+static u32 amd_iommu_latest_gid;
+
int amd_iommu_max_glx_val = -1;

/*
@@ -101,6 +112,61 @@ static inline bool domain_id_is_per_dev(struct protection_domain *pdom)
return (pdom && pdom->pd_mode != PD_MODE_V1);
}

+int get_vmid(void)
+{
+ int ret;
+
+ ret = ida_alloc_range(&amd_iommu_global_vmid_ida, 1, 0xFFFF, GFP_KERNEL);
+ return ret < 0 ? AMD_IOMMU_VMID_INVALID : ret;
+}
+
+int amd_iommu_vminfo_alloc(struct amd_iommu *iommu, struct amd_iommu_vminfo *vminfo)
+{
+ u32 gid;
+ unsigned long flags;
+
+ spin_lock_irqsave(&amd_iommu_vminfo_hash_lock, flags);
+ gid = amd_iommu_latest_gid = get_vmid();
+ if (gid == AMD_IOMMU_VMID_INVALID)
+ return -EINVAL;
+
+ pr_debug("%s: gid=%u\n", __func__, gid);
+ vminfo->gid = gid;
+ hash_add(amd_iommu_vminfo_hash, &vminfo->hnode, vminfo->gid);
+ spin_unlock_irqrestore(&amd_iommu_vminfo_hash_lock, flags);
+ return 0;
+}
+
+void amd_iommu_vminfo_free(struct amd_iommu *iommu,
+ struct amd_iommu_vminfo *vminfo)
+{
+ unsigned long flags;
+
+ pr_debug("%s: gid=%u\n", __func__, vminfo->gid);
+ spin_lock_irqsave(&amd_iommu_vminfo_hash_lock, flags);
+ hash_del(&vminfo->hnode);
+ ida_free(&amd_iommu_global_vmid_ida, vminfo->gid);
+ spin_unlock_irqrestore(&amd_iommu_vminfo_hash_lock, flags);
+}
+
+struct amd_iommu_vminfo *amd_iommu_get_vminfo(int gid)
+{
+ unsigned long flags;
+ struct amd_iommu_vminfo *tmp, *ptr = NULL;
+
+ spin_lock_irqsave(&amd_iommu_vminfo_hash_lock, flags);
+ hash_for_each_possible(amd_iommu_vminfo_hash, tmp, hnode, gid) {
+ if (tmp->gid == gid) {
+ ptr = tmp;
+ break;
+ }
+ }
+ if (!ptr)
+ pr_debug("%s : gid=%u not found\n", __func__, gid);
+ spin_unlock_irqrestore(&amd_iommu_vminfo_hash_lock, flags);
+ return ptr;
+}
+
static inline int get_acpihid_device_id(struct device *dev,
struct acpihid_map_entry **entry)
{
--
2.34.1

2023-12-12 16:02:46

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFC PATCH 2/6] iommu/amd: Add support for hw_info for iommu capability query

AMD IOMMU Extended Feature (EFR) and Extended Feature 2 (EFR2) registers
specify features supported by each IOMMU hardware instance.
The IOMMU driver checks each feature-specific bits before enabling
each feature at run time.

For IOMMUFD, the hypervisor determines which IOMMU features to support
in the guest, and communicates this information to user-space (e.g. QEMU)
via iommufd IOMMU_DEVICE_GET_HW_INFO ioctl.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/amd_iommu.h | 2 ++
drivers/iommu/amd/amd_iommu_types.h | 3 +++
drivers/iommu/amd/iommu.c | 38 +++++++++++++++++++++++++++++
include/uapi/linux/iommufd.h | 13 ++++++++++
4 files changed, 56 insertions(+)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 108253edbeb0..4118129f4a24 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -72,6 +72,8 @@ void amd_iommu_dev_flush_pasid_pages(struct iommu_dev_data *dev_data,
void amd_iommu_dev_flush_pasid_all(struct iommu_dev_data *dev_data,
ioasid_t pasid);

+void amd_iommu_build_efr(u64 *efr, u64 *efr2);
+
#ifdef CONFIG_IRQ_REMAP
int amd_iommu_create_irq_domain(struct amd_iommu *iommu);
#else
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 14f67a8cf755..956fd4658a4a 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -100,12 +100,15 @@
#define FEATURE_HDSUP BIT_ULL(52)
#define FEATURE_SNP BIT_ULL(63)

+#define FEATURE_GATS_5LEVEL 1ULL
#define FEATURE_GATS_SHIFT 12
#define FEATURE_GATS_MASK (0x03ULL << FEATURE_GATS_SHIFT)

+#define FEATURE_GLX_3LEVEL 0ULL
#define FEATURE_GLX_SHIFT 14
#define FEATURE_GLX_MASK (0x03ULL << FEATURE_GLX_SHIFT)

+#define FEATURE_PASMAX_16 0xFULL
#define FEATURE_PASMAX_SHIFT 32
#define FEATURE_PASMAX_MASK (0x1FULL << FEATURE_PASMAX_SHIFT)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 4e4ff1550cf3..c41932e9f16a 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2822,8 +2822,46 @@ static const struct iommu_dirty_ops amd_dirty_ops = {
.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
};

+void amd_iommu_build_efr(u64 *efr, u64 *efr2)
+{
+ if (efr) {
+ *efr = (FEATURE_GT | FEATURE_GIOSUP | FEATURE_PPR);
+
+ /* 5-level v2 page table support */
+ *efr |= ((FEATURE_GATS_5LEVEL << FEATURE_GATS_SHIFT) &
+ FEATURE_GATS_MASK);
+
+ /* 3-level GCR3 table support */
+ *efr |= ((FEATURE_GLX_3LEVEL << FEATURE_GLX_SHIFT) &
+ FEATURE_GLX_MASK);
+
+ /* 16-bit PASMAX support */
+ *efr |= ((FEATURE_PASMAX_16 << FEATURE_PASMAX_SHIFT) &
+ FEATURE_PASMAX_MASK);
+ }
+
+ if (efr2)
+ *efr2 = 0;
+}
+
+static void *amd_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
+{
+ struct iommu_hw_info_amd *hwinfo;
+
+ hwinfo = kzalloc(sizeof(*hwinfo), GFP_KERNEL);
+ if (!hwinfo)
+ return ERR_PTR(-ENOMEM);
+
+ *length = sizeof(*hwinfo);
+ *type = IOMMU_HW_INFO_TYPE_AMD;
+
+ amd_iommu_build_efr(&hwinfo->efr, &hwinfo->efr2);
+ return hwinfo;
+}
+
const struct iommu_ops amd_iommu_ops = {
.capable = amd_iommu_capable,
+ .hw_info = amd_iommu_hw_info,
.domain_alloc = amd_iommu_domain_alloc,
.domain_alloc_user = amd_iommu_domain_alloc_user,
.probe_device = amd_iommu_probe_device,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 0b2bc6252e2c..bf4a1f8ab748 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -474,15 +474,28 @@ struct iommu_hw_info_vtd {
__aligned_u64 ecap_reg;
};

+/**
+ * struct iommu_hw_info_amd - AMD IOMMU device info
+ *
+ * @efr : Value of AMD IOMMU Extended Feature Register (EFR)
+ * @efr2: Value of AMD IOMMU Extended Feature 2 Register (EFR2)
+ */
+struct iommu_hw_info_amd {
+ __u64 efr;
+ __u64 efr2;
+};
+
/**
* enum iommu_hw_info_type - IOMMU Hardware Info Types
* @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
* info
* @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_AMD: AMD IOMMU info type
*/
enum iommu_hw_info_type {
IOMMU_HW_INFO_TYPE_NONE,
IOMMU_HW_INFO_TYPE_INTEL_VTD,
+ IOMMU_HW_INFO_TYPE_AMD,
};

/**
--
2.34.1

2023-12-12 16:02:56

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

To support nested translation on AMD IOMMU, the driver needs to
program DTE[GCR3 Table Root Pointer] with the address provided by
the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
of the struct iommu_ops.domain_alloc_user() with the flag
IOMMU_HWPT_ALLOC_NEST_PARENT.

Note that current implementation only support GCR3TRPMode for
nested translation, which uses GPA to program GCR3 Table Root Pointer.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 8 +++
drivers/iommu/amd/amd_iommu_types.h | 3 +
drivers/iommu/amd/iommu.c | 63 ++++++++++++++--
drivers/iommu/amd/nested.c | 107 ++++++++++++++++++++++++++++
5 files changed, 175 insertions(+), 8 deletions(-)
create mode 100644 drivers/iommu/amd/nested.c

diff --git a/drivers/iommu/amd/Makefile b/drivers/iommu/amd/Makefile
index f454fbb1569e..447cb6bb48eb 100644
--- a/drivers/iommu/amd/Makefile
+++ b/drivers/iommu/amd/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_AMD_IOMMU) += iommu.o init.o quirks.o io_pgtable.o io_pgtable_v2.o
+obj-$(CONFIG_AMD_IOMMU) += iommu.o init.o quirks.o io_pgtable.o io_pgtable_v2.o nested.o
obj-$(CONFIG_AMD_IOMMU_DEBUGFS) += debugfs.o
diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 55479a6efaae..6ea146a964df 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -7,6 +7,7 @@
#ifndef AMD_IOMMU_H
#define AMD_IOMMU_H

+#include <uapi/linux/iommufd.h>
#include <linux/iommu.h>

#include "amd_iommu_types.h"
@@ -75,6 +76,8 @@ void amd_iommu_dev_flush_pasid_all(struct iommu_dev_data *dev_data,
ioasid_t pasid);

void amd_iommu_build_efr(u64 *efr, u64 *efr2);
+int amd_iommu_attach_device(struct iommu_domain *dom, struct device *dev);
+void amd_iommu_domain_free(struct iommu_domain *dom);

#ifdef CONFIG_IRQ_REMAP
int amd_iommu_create_irq_domain(struct amd_iommu *iommu);
@@ -190,4 +193,9 @@ int amd_iommu_vminfo_alloc(struct amd_iommu *iommu, struct amd_iommu_vminfo *vmi
void amd_iommu_vminfo_free(struct amd_iommu *iommu, struct amd_iommu_vminfo *vminfo);
struct amd_iommu_vminfo *amd_iommu_get_vminfo(int gid);

+/* NESTED */
+struct protection_domain *to_pdomain(struct iommu_domain *dom);
+struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
+ struct iommu_hwpt_amd_v2 *hwpt);
+
#endif
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 1b150e0cb689..c2055b476a97 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -114,6 +114,8 @@
#define FEATURE_PASMAX_MASK (0x1FULL << FEATURE_PASMAX_SHIFT)

/* Extended Feature 2 Bits */
+#define FEATURE_GCR3TRPMODE BIT_ULL(3)
+
#define FEATURE_SNPAVICSUP_SHIFT 5
#define FEATURE_SNPAVICSUP_MASK (0x07ULL << FEATURE_SNPAVICSUP_SHIFT)
#define FEATURE_SNPAVICSUP_GAM(x) \
@@ -1058,6 +1060,7 @@ struct amd_irte_ops {
struct amd_iommu_vminfo {
u16 gid;
struct hlist_node hnode;
+ u64 *devid_table;
};

#ifdef CONFIG_IRQ_REMAP
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 8bf12674dc84..2a7e29e8c112 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -260,7 +260,7 @@ static struct amd_iommu *rlookup_amd_iommu(struct device *dev)
return __rlookup_amd_iommu(seg, PCI_SBDF_TO_DEVID(devid));
}

-static struct protection_domain *to_pdomain(struct iommu_domain *dom)
+struct protection_domain *to_pdomain(struct iommu_domain *dom)
{
return container_of(dom, struct protection_domain, domain);
}
@@ -2526,21 +2526,70 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned int type)
return domain;
}

+static int udata_to_iommu_hwpt_amd_v2(const struct iommu_user_data *user_data,
+ struct iommu_hwpt_amd_v2 *hwpt)
+{
+ if (!user_data)
+ return -EINVAL;
+
+ if (user_data->type != IOMMU_HWPT_DATA_AMD_V2)
+ return -EOPNOTSUPP;
+
+ return iommu_copy_struct_from_user(hwpt, user_data,
+ IOMMU_HWPT_DATA_AMD_V2,
+ guest_paging_mode);
+}
+
+static bool check_nested_support(u32 flags)
+{
+ if (!(flags & IOMMU_HWPT_ALLOC_NEST_PARENT))
+ return true;
+
+ if (!check_feature(FEATURE_GT) ||
+ !check_feature(FEATURE_GIOSUP) ||
+ !check_feature2(FEATURE_GCR3TRPMODE))
+ return false;
+
+ return true;
+}
+
static struct iommu_domain *
amd_iommu_domain_alloc_user(struct device *dev, u32 flags,
struct iommu_domain *parent,
const struct iommu_user_data *user_data)
-
{
- unsigned int type = IOMMU_DOMAIN_UNMANAGED;
+ struct iommu_domain *dom;
+
+ if (parent) {
+ int ret;
+ struct iommu_hwpt_amd_v2 hwpt;
+
+ if (parent->ops != amd_iommu_ops.default_domain_ops)
+ return ERR_PTR(-EINVAL);

- if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
+ ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
+ if (ret)
+ return ERR_PTR(ret);
+
+ return amd_iommu_nested_domain_alloc(dev, &hwpt);
+ }
+
+ /* Check supported flags */
+ if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT |
+ IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
+ return ERR_PTR(-EOPNOTSUPP);
+
+ if (!check_nested_support(flags))
return ERR_PTR(-EOPNOTSUPP);

- return do_iommu_domain_alloc(type, dev, flags);
+ dom = iommu_domain_alloc(dev->bus);
+ if (!dom)
+ return ERR_PTR(-ENOMEM);
+
+ return dom;
}

-static void amd_iommu_domain_free(struct iommu_domain *dom)
+void amd_iommu_domain_free(struct iommu_domain *dom)
{
struct protection_domain *domain;
unsigned long flags;
@@ -2559,7 +2608,7 @@ static void amd_iommu_domain_free(struct iommu_domain *dom)
protection_domain_free(domain);
}

-static int amd_iommu_attach_device(struct iommu_domain *dom,
+int amd_iommu_attach_device(struct iommu_domain *dom,
struct device *dev)
{
struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev);
diff --git a/drivers/iommu/amd/nested.c b/drivers/iommu/amd/nested.c
new file mode 100644
index 000000000000..332f7efcdc92
--- /dev/null
+++ b/drivers/iommu/amd/nested.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Advanced Micro Devices, Inc.
+ * Author: Suravee Suthikulpanit <[email protected]>
+ */
+
+#define pr_fmt(fmt) "AMD-Vi: " fmt
+#define dev_fmt(fmt) pr_fmt(fmt)
+
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "amd_iommu.h"
+
+static struct amd_iommu *get_amd_iommu_from_devid(u16 devid)
+{
+ struct amd_iommu *iommu;
+
+ for_each_iommu(iommu)
+ if (iommu->devid == devid)
+ return iommu;
+ return NULL;
+}
+
+/*
+ * Note:
+ * Host-DevID is stored in the per-VM DevID mapping table,
+ * which is indexed by the Guest-DevID.
+ */
+static u16 get_hdev_id(struct amd_iommu *iommu, u16 guestId, u16 gdev_id)
+{
+ struct amd_iommu_vminfo *vminfo;
+ void *addr;
+ u64 offset;
+
+ vminfo = amd_iommu_get_vminfo(guestId);
+ if (!vminfo)
+ return -1;
+
+ addr = vminfo->devid_table;
+ offset = gdev_id << 4;
+ return (*((u64 *)(addr + offset)) >> 24) & 0xFFFF;
+}
+
+static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt, struct iommu_domain *udom)
+{
+ int ret;
+ u16 hdev_id;
+ struct pci_dev *pdev;
+ struct amd_iommu *iommu;
+
+ iommu = get_amd_iommu_from_devid(hwpt->iommu_id);
+ hdev_id = get_hdev_id(iommu, hwpt->gid, hwpt->gdev_id);
+
+ pr_debug("%s: gid=%u, hdev_id=%#x, gcr3=%#llx\n",
+ __func__, hwpt->gid, hdev_id,
+ (unsigned long long) hwpt->gcr3);
+
+ pdev = pci_get_domain_bus_and_slot(0, PCI_BUS_NUM(hdev_id),
+ hdev_id & 0xff);
+ if (!pdev)
+ return -EINVAL;
+
+ /* Note: Currently only support GCR3TRPMode with nested translation */
+ if (!check_feature2(FEATURE_GCR3TRPMODE))
+ return -EOPNOTSUPP;
+
+ ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
+ hwpt->guest_paging_mode);
+ if (ret) {
+ pr_err("%s: Fail to enable gcr3 (devid=%#x)\n", __func__,
+ pci_dev_id(pdev));
+ }
+
+ return ret;
+}
+
+static const struct iommu_domain_ops nested_domain_ops = {
+ .attach_dev = amd_iommu_attach_device,
+ .free = amd_iommu_domain_free,
+};
+
+struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
+ struct iommu_hwpt_amd_v2 *hwpt)
+{
+ int ret;
+ struct iommu_domain *dom;
+ struct protection_domain *pdom;
+
+ dom = iommu_domain_alloc(dev->bus);
+ if (!dom)
+ return ERR_PTR(-ENOMEM);
+
+ pdom = to_pdomain(dom);
+ dom->type = IOMMU_DOMAIN_NESTED;
+ dom->ops = &nested_domain_ops;
+
+ ret = nested_gcr3_update(hwpt, dom);
+ if (ret)
+ goto err_out;
+
+ return dom;
+
+err_out:
+ iommu_domain_free(dom);
+ return ERR_PTR(-EINVAL);
+}
--
2.34.1

2023-12-13 13:28:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] iommu/amd: Add support for hw_info for iommu capability query

On Tue, Dec 12, 2023 at 10:01:35AM -0600, Suravee Suthikulpanit wrote:
> AMD IOMMU Extended Feature (EFR) and Extended Feature 2 (EFR2) registers
> specify features supported by each IOMMU hardware instance.
> The IOMMU driver checks each feature-specific bits before enabling
> each feature at run time.
>
> For IOMMUFD, the hypervisor determines which IOMMU features to support
> in the guest, and communicates this information to user-space (e.g. QEMU)
> via iommufd IOMMU_DEVICE_GET_HW_INFO ioctl.
>
> Signed-off-by: Suravee Suthikulpanit <[email protected]>
> ---
> drivers/iommu/amd/amd_iommu.h | 2 ++
> drivers/iommu/amd/amd_iommu_types.h | 3 +++
> drivers/iommu/amd/iommu.c | 38 +++++++++++++++++++++++++++++
> include/uapi/linux/iommufd.h | 13 ++++++++++
> 4 files changed, 56 insertions(+)
>
> diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
> index 108253edbeb0..4118129f4a24 100644
> --- a/drivers/iommu/amd/amd_iommu.h
> +++ b/drivers/iommu/amd/amd_iommu.h
> @@ -72,6 +72,8 @@ void amd_iommu_dev_flush_pasid_pages(struct iommu_dev_data *dev_data,
> void amd_iommu_dev_flush_pasid_all(struct iommu_dev_data *dev_data,
> ioasid_t pasid);
>
> +void amd_iommu_build_efr(u64 *efr, u64 *efr2);
> +
> #ifdef CONFIG_IRQ_REMAP
> int amd_iommu_create_irq_domain(struct amd_iommu *iommu);
> #else
> diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
> index 14f67a8cf755..956fd4658a4a 100644
> --- a/drivers/iommu/amd/amd_iommu_types.h
> +++ b/drivers/iommu/amd/amd_iommu_types.h
> @@ -100,12 +100,15 @@
> #define FEATURE_HDSUP BIT_ULL(52)
> #define FEATURE_SNP BIT_ULL(63)
>
> +#define FEATURE_GATS_5LEVEL 1ULL
> #define FEATURE_GATS_SHIFT 12
> #define FEATURE_GATS_MASK (0x03ULL << FEATURE_GATS_SHIFT)
>
> +#define FEATURE_GLX_3LEVEL 0ULL
> #define FEATURE_GLX_SHIFT 14
> #define FEATURE_GLX_MASK (0x03ULL << FEATURE_GLX_SHIFT)
>
> +#define FEATURE_PASMAX_16 0xFULL
> #define FEATURE_PASMAX_SHIFT 32
> #define FEATURE_PASMAX_MASK (0x1FULL << FEATURE_PASMAX_SHIFT)
>
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 4e4ff1550cf3..c41932e9f16a 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2822,8 +2822,46 @@ static const struct iommu_dirty_ops amd_dirty_ops = {
> .read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
> };
>
> +void amd_iommu_build_efr(u64 *efr, u64 *efr2)
> +{
> + if (efr) {
> + *efr = (FEATURE_GT | FEATURE_GIOSUP | FEATURE_PPR);
> +
> + /* 5-level v2 page table support */
> + *efr |= ((FEATURE_GATS_5LEVEL << FEATURE_GATS_SHIFT) &
> + FEATURE_GATS_MASK);
> +
> + /* 3-level GCR3 table support */
> + *efr |= ((FEATURE_GLX_3LEVEL << FEATURE_GLX_SHIFT) &
> + FEATURE_GLX_MASK);
> +
> + /* 16-bit PASMAX support */
> + *efr |= ((FEATURE_PASMAX_16 << FEATURE_PASMAX_SHIFT) &
> + FEATURE_PASMAX_MASK);
> + }
> +
> + if (efr2)
> + *efr2 = 0;

Why are you checking for null here? It is never called with null

> +/**
> + * struct iommu_hw_info_amd - AMD IOMMU device info
> + *
> + * @efr : Value of AMD IOMMU Extended Feature Register (EFR)
> + * @efr2: Value of AMD IOMMU Extended Feature 2 Register (EFR2)

Please reference a section in the spec for what these are just for
clarity

> + */
> +struct iommu_hw_info_amd {
> + __u64 efr;
> + __u64 efr2;
> +};

__aligned_u64

Jason

2023-12-13 13:57:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

On Tue, Dec 12, 2023 at 10:01:39AM -0600, Suravee Suthikulpanit wrote:

> - if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
> + ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
> + if (ret)
> + return ERR_PTR(ret);
> +
> + return amd_iommu_nested_domain_alloc(dev, &hwpt);
> + }
> +
> + /* Check supported flags */
> + if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT |
> + IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
> + return ERR_PTR(-EOPNOTSUPP);
> +
> + if (!check_nested_support(flags))
> return ERR_PTR(-EOPNOTSUPP);
>
> - return do_iommu_domain_alloc(type, dev, flags);
> + dom = iommu_domain_alloc(dev->bus);

Please don't call iommu_domain_alloc, call your internal function and
force it to allocate the v1 domain..

> +static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt, struct iommu_domain *udom)
> +{
> + int ret;
> + u16 hdev_id;
> + struct pci_dev *pdev;
> + struct amd_iommu *iommu;
> +
> + iommu = get_amd_iommu_from_devid(hwpt->iommu_id);
> + hdev_id = get_hdev_id(iommu, hwpt->gid, hwpt->gdev_id);
> +
> + pr_debug("%s: gid=%u, hdev_id=%#x, gcr3=%#llx\n",
> + __func__, hwpt->gid, hdev_id,
> + (unsigned long long) hwpt->gcr3);
> +
> + pdev = pci_get_domain_bus_and_slot(0, PCI_BUS_NUM(hdev_id),
> + hdev_id & 0xff);

Huh? "hdev_id"? This is not OK..

The device you are allowed to look at is the "struct device *dev" passed
to alloc. You cannot pass in a struct device and then override it with
another value.

> + if (!pdev)
> + return -EINVAL;
> +
> + /* Note: Currently only support GCR3TRPMode with nested translation */
> + if (!check_feature2(FEATURE_GCR3TRPMODE))
> + return -EOPNOTSUPP;
> +
> + ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
> + hwpt->guest_paging_mode);

Waah?

This is touching the dev table? That is not right, allocation is only
*ALLOCATION*. The dev table can't be changed until you do attachment.

Please look at the smmuv3 patches and try to be structurally
similar. AMD and SMMUv3 are *very similar* in how their HW works
excluding the viommu stuff.

You also can't assume your parent is currently attached to anything.

The construction of the DTE has to be from-scratch based on the parent
domain and the provided values in the "hwpt". Again see how smmuv3
does this where there is one function that builds the entire DTE
(called STE)

I'm skeptical you can do this properly without also restructuring the
DTE logic like I've mentioned before, there is a reason I did that for
SMMUv3. :)

> +struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
> + struct iommu_hwpt_amd_v2 *hwpt)
> +{
> + int ret;
> + struct iommu_domain *dom;
> + struct protection_domain *pdom;
> +
> + dom = iommu_domain_alloc(dev->bus);
> + if (!dom)
> + return ERR_PTR(-ENOMEM);

Also no, do not allocate a normal domain and then 'wreck'
it into a nesting domain. Refactor the allocation code to be in
smaller chucks so you can alloc and init the memory directly here.

Jason

2023-12-13 14:04:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] iommufd: Introduce data struct for AMD nested domain allocation

On Tue, Dec 12, 2023 at 10:01:37AM -0600, Suravee Suthikulpanit wrote:
> Introduce IOMMU_HWPT_DATA_AMD_V2 data type for AMD IOMMU v2 page table,
> which is used for stage-1 in nested translation. The data structure
> contains information necessary for setting up the AMD HW-vIOMMU support.
>
> Signed-off-by: Suravee Suthikulpanit <[email protected]>
> ---
> include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index bf4a1f8ab748..e2240d430dd1 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -389,14 +389,37 @@ struct iommu_hwpt_vtd_s1 {
> __u32 __reserved;
> };
>
> +/**
> + * struct iommu_hwpt_amd_v2 - AMD IOMMU specific user-managed
> + * v2 I/O page table data
> + * @gcr3: GCR3 guest physical ddress
> + * @gid: Guest ID
> + * @iommu_id: IOMMU host device ID
> + * @gdev_id: Guest device ID
> + * @gdom_id: Guest domain ID
> + * @glx: GCR3 table levels
> + * @guest_paging_mode: Guest v2 page table paging mode
> + */
> +struct iommu_hwpt_amd_v2 {
> + __aligned_u64 gcr3;
> + __u32 gid;
> + __u32 iommu_id;
> + __u16 gdev_id;
> + __u16 gdom_id;
> + __u16 glx;
> + __u16 guest_paging_mode;

Add explicit padding please

Also, I'm pretty sure that most of these IDs cannot be here.

These are Ok:

__aligned_u64 gcr3; // table top pointer
__u16 gdom_id; // virtual cache tag
__u16 glx; // configuration of radix
__u16 guest_paging_mode; // configuration of radix

These are confusing, probably incorrect.

+ __u32 gid;
+ __u32 iommu_id;
+ __u16 gdev_id;

iommu_id is the RID of the IOMMU, so definately not. The iommu
instance to work on is specifed by the generic dev_id which becomes a
struct device * in the driver callback. Access the target iommu
instance via dev_iommu_priv_get()/etc

The translation of gdev_id to pdev_dev needs to be connected to some
future viommu object, so this shouldn't be part of this series, and
will eventually be provided through some new viommu object API - see
my long outline email

The viommu object should hold the GID. I'm not sure you need a GID
right now (can you just issue invalidation on the physical side?), but
if you do need GID to bridge until the viommu is ready it should
probably be allocated by and stored in the nesting parent.

Jason

2023-12-13 14:05:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] iommu/amd: Introduce helper functions to setup GCR3TRPMode

On Tue, Dec 12, 2023 at 10:01:38AM -0600, Suravee Suthikulpanit wrote:
> +/*
> + * For GCR3TRPMode, user-space provides GPA for the GCR3 Root Pointer Table.
> + */
> +int amd_iommu_set_gcr3tbl_trp(struct amd_iommu *iommu, struct pci_dev *pdev,
> + u64 gcr3_tbl, u16 glx, u16 guest_paging_mode)
> +{
> + struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev->dev);
> + struct dev_table_entry *dev_table = get_dev_table(iommu);
> + struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
> + int devid = pci_dev_id(pdev);
> + u64 data0 = dev_table[devid].data[0];
> + u64 data1 = dev_table[devid].data[1];
> + u64 data2 = dev_table[devid].data[2];
> + u64 tmp;

Like I said in my other email, this whole function is conceptually
wrong - you can't read the DTE to learn the parent domain's
contribution to the nesting DTE and you can't write to the DTE during
allocation of a domain!

Jason

2023-12-15 07:32:26

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 2/6] iommu/amd: Add support for hw_info for iommu capability query

> From: Suravee Suthikulpanit <[email protected]>
> Sent: Wednesday, December 13, 2023 12:02 AM
>
> +void amd_iommu_build_efr(u64 *efr, u64 *efr2)
> +{
> + if (efr) {
> + *efr = (FEATURE_GT | FEATURE_GIOSUP | FEATURE_PPR);
> +
> + /* 5-level v2 page table support */
> + *efr |= ((FEATURE_GATS_5LEVEL << FEATURE_GATS_SHIFT) &
> + FEATURE_GATS_MASK);
> +
> + /* 3-level GCR3 table support */
> + *efr |= ((FEATURE_GLX_3LEVEL << FEATURE_GLX_SHIFT) &
> + FEATURE_GLX_MASK);
> +
> + /* 16-bit PASMAX support */
> + *efr |= ((FEATURE_PASMAX_16 << FEATURE_PASMAX_SHIFT)
> &
> + FEATURE_PASMAX_MASK);
> + }
> +
> + if (efr2)
> + *efr2 = 0;
> +}

Don't this need to check the support in hw?

2023-12-15 07:36:24

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 3/6] iommu/amd: Introduce Guest-ID struct amd_iommu_vminfo

> From: Suravee Suthikulpanit <[email protected]>
> Sent: Wednesday, December 13, 2023 12:02 AM
>
> AMD HW-vIOMMU feature requires IOMMU driver to specify a unique 16-bit
> Guest ID (GID) for each VM. This ID is used to index into various
> data structures for configuring the hardware.
>
> Introduce amd_iommu_vminfo_hash hashtable to store per-vm
> configuration,
> which uses 16-bit GID as a hash key along with helper functions.
>

somehow it's unclear to me whether this series is only for hw
supporting vf or broader hw supporting nested capability. for
the latter case is GID still necessary?

2023-12-15 07:39:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 4/6] iommufd: Introduce data struct for AMD nested domain allocation

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, December 13, 2023 10:04 PM
>
> The viommu object should hold the GID. I'm not sure you need a GID
> right now (can you just issue invalidation on the physical side?), but
> if you do need GID to bridge until the viommu is ready it should
> probably be allocated by and stored in the nesting parent.
>

and how is GID different from existing Domain ID of nesting parent?

2023-12-15 07:43:38

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 5/6] iommu/amd: Introduce helper functions to setup GCR3TRPMode

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, December 13, 2023 9:53 PM
>
> On Tue, Dec 12, 2023 at 10:01:38AM -0600, Suravee Suthikulpanit wrote:
> > +/*
> > + * For GCR3TRPMode, user-space provides GPA for the GCR3 Root Pointer
> Table.
> > + */
> > +int amd_iommu_set_gcr3tbl_trp(struct amd_iommu *iommu, struct
> pci_dev *pdev,
> > + u64 gcr3_tbl, u16 glx, u16 guest_paging_mode)
> > +{
> > + struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev-
> >dev);
> > + struct dev_table_entry *dev_table = get_dev_table(iommu);
> > + struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
> > + int devid = pci_dev_id(pdev);
> > + u64 data0 = dev_table[devid].data[0];
> > + u64 data1 = dev_table[devid].data[1];
> > + u64 data2 = dev_table[devid].data[2];
> > + u64 tmp;
>
> Like I said in my other email, this whole function is conceptually
> wrong - you can't read the DTE to learn the parent domain's
> contribution to the nesting DTE and you can't write to the DTE during
> allocation of a domain!
>

Agree. DTE is updated only at attach/detach. domain allocation should
involve things only about the domain itself.

2023-12-15 07:55:55

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

> From: Suravee Suthikulpanit <[email protected]>
> Sent: Wednesday, December 13, 2023 12:02 AM
>
> To support nested translation on AMD IOMMU, the driver needs to
> program DTE[GCR3 Table Root Pointer] with the address provided by
> the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
> of the struct iommu_ops.domain_alloc_user() with the flag
> IOMMU_HWPT_ALLOC_NEST_PARENT.
>
> Note that current implementation only support GCR3TRPMode for
> nested translation, which uses GPA to program GCR3 Table Root Pointer.
>

means there is a plan to support another mode in the future or
actually the nested translation requires GCR3TRPMode as a
functional requirement? imho the point of GPA is assumed
in the nested configuration in concept...

2024-01-05 13:39:11

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

Hi Jason

On 12/13/2023 8:52 PM, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 10:01:39AM -0600, Suravee Suthikulpanit wrote:
>
>> - if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
>> + ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
>> + if (ret)
>> + return ERR_PTR(ret);
>> +
>> + return amd_iommu_nested_domain_alloc(dev, &hwpt);
>> + }
>> +
>> + /* Check supported flags */
>> + if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT |
>> + IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
>> + return ERR_PTR(-EOPNOTSUPP);
>> +
>> + if (!check_nested_support(flags))
>> return ERR_PTR(-EOPNOTSUPP);
>>
>> - return do_iommu_domain_alloc(type, dev, flags);
>> + dom = iommu_domain_alloc(dev->bus);
>
> Please don't call iommu_domain_alloc, call your internal function and
> force it to allocate the v1 domain..

Okay.

>> +static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt, struct iommu_domain *udom)
>> +{
>> + int ret;
>> + u16 hdev_id;
>> + struct pci_dev *pdev;
>> + struct amd_iommu *iommu;
>> +
>> + iommu = get_amd_iommu_from_devid(hwpt->iommu_id);
>> + hdev_id = get_hdev_id(iommu, hwpt->gid, hwpt->gdev_id);
>> +
>> + pr_debug("%s: gid=%u, hdev_id=%#x, gcr3=%#llx\n",
>> + __func__, hwpt->gid, hdev_id,
>> + (unsigned long long) hwpt->gcr3);
>> +
>> + pdev = pci_get_domain_bus_and_slot(0, PCI_BUS_NUM(hdev_id),
>> + hdev_id & 0xff);
>
> Huh? "hdev_id"? This is not OK..
>
> The device you are allowed to look at is the "struct device *dev" passed
> to alloc. You cannot pass in a struct device and then override it with
> another value.

Good point. I'll fix this to use the dev.

>> + if (!pdev)
>> + return -EINVAL;
>> +
>> + /* Note: Currently only support GCR3TRPMode with nested translation */
>> + if (!check_feature2(FEATURE_GCR3TRPMODE))
>> + return -EOPNOTSUPP;
>> +
>> + ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
>> + hwpt->guest_paging_mode);
>
> Waah?
>
> This is touching the dev table? That is not right, allocation is only
> *ALLOCATION*. The dev table can't be changed until you do attachment.

My understanding is QEMU should call:

1. iommufd_backend_get_ioas()
-> ioctl(IOMMU_IOAS_ALLOC)

2. For parent domain
iommufd_backend_alloc_hwpt(IOMMU_HWPT_ALLOC_NEST_PARENT)
-> ioctl( IOMMU_HWPT_ALLOC)
--- in Linux ---
....
-> iommufd_hwpt_paging_alloc()
-> struct iommu_ops.domain_alloc_user(IOMMU_HWPT_ALLOC_NEST_PARENT)

3. For parent domain
iommufd_device_attach_hwpt()
-> vfio_device_attach_container()
-> ioctl(VFIO_DEVICE_ATTACH_IOMMUFD_PT)
--- in Linux ---
vfio_iommufd_physical_attach_ioas()
-> iommufd_device_attach()
-> iommufd_device_do_attach()
-> iommufd_hw_pagetable_attach()

4. Same as (2) but for child domain w/ stage 1 table.

5. Same as (3) but for child domain w/ stage 1 table.

You want the gcr3 table root point in the DTE to be update at step 5
only, right? If so, this should be okay.

> Please look at the smmuv3 patches and try to be structurally
> similar. AMD and SMMUv3 are *very similar* in how their HW works
> excluding the viommu stuff.

Could you please point me to the series? I found one but it was really
old. I might have missed the latest stuff.

> You also can't assume your parent is currently attached to anything.
>
> The construction of the DTE has to be from-scratch based on the parent
> domain and the provided values in the "hwpt". Again see how smmuv3
> does this where there is one function that builds the entire DTE
> (called STE)

Ok. I'll program fields of the DTE, which are related to DMA-remapping
with v1 and v2 table using the information in the parent and child
domains only.

> I'm skeptical you can do this properly without also restructuring the
> DTE logic like I've mentioned before, there is a reason I did that for
> SMMUv3. :)
A device can be attached to a domain, which could be either one-level or
nested domain. In case of nesting, the parent domain contains
information for the stage2 (v1) table, while the child domain contains
information for the stage1 (v2) table. For each domain, we need to keep
track of the parent domain.

When calling set_dte_entry(), we need to check if the device is attached
to a domain, which has parent. If so, we need to configure DTE using
information in both domains accordingly.

I'll update the set_dte_entry() to reflect this logic for programming DTE.

>> +struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
>> + struct iommu_hwpt_amd_v2 *hwpt)
>> +{
>> + int ret;
>> + struct iommu_domain *dom;
>> + struct protection_domain *pdom;
>> +
>> + dom = iommu_domain_alloc(dev->bus);
>> + if (!dom)
>> + return ERR_PTR(-ENOMEM);
>
>
> Also no, do not allocate a normal domain and then 'wreck'
> it into a nesting domain. Refactor the allocation code to be in
> smaller chucks so you can alloc and init the memory directly here.

Good point. I'll take care of this in the drivers/iommu/amd/iommmu.c:
do_iommu_domain_alloc().

Thanks,
Suravee

2024-01-05 13:39:34

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] iommufd: Introduce data struct for AMD nested domain allocation

Hi Jason

On 12/13/2023 9:03 PM, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 10:01:37AM -0600, Suravee Suthikulpanit wrote:
>> Introduce IOMMU_HWPT_DATA_AMD_V2 data type for AMD IOMMU v2 page table,
>> which is used for stage-1 in nested translation. The data structure
>> contains information necessary for setting up the AMD HW-vIOMMU support.
>>
>> Signed-off-by: Suravee Suthikulpanit <[email protected]>
>> ---
>> include/uapi/linux/iommufd.h | 23 +++++++++++++++++++++++
>> 1 file changed, 23 insertions(+)
>>
>> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
>> index bf4a1f8ab748..e2240d430dd1 100644
>> --- a/include/uapi/linux/iommufd.h
>> +++ b/include/uapi/linux/iommufd.h
>> @@ -389,14 +389,37 @@ struct iommu_hwpt_vtd_s1 {
>> __u32 __reserved;
>> };
>>
>> +/**
>> + * struct iommu_hwpt_amd_v2 - AMD IOMMU specific user-managed
>> + * v2 I/O page table data
>> + * @gcr3: GCR3 guest physical ddress
>> + * @gid: Guest ID
>> + * @iommu_id: IOMMU host device ID
>> + * @gdev_id: Guest device ID
>> + * @gdom_id: Guest domain ID
>> + * @glx: GCR3 table levels
>> + * @guest_paging_mode: Guest v2 page table paging mode
>> + */
>> +struct iommu_hwpt_amd_v2 {
>> + __aligned_u64 gcr3;
>> + __u32 gid;
>> + __u32 iommu_id;
>> + __u16 gdev_id;
>> + __u16 gdom_id;
>> + __u16 glx;
>> + __u16 guest_paging_mode;
>
> Add explicit padding please

OK

> Also, I'm pretty sure that most of these IDs cannot be here.
>
> These are Ok:
>
> __aligned_u64 gcr3; // table top pointer
> __u16 gdom_id; // virtual cache tag
> __u16 glx; // configuration of radix
> __u16 guest_paging_mode; // configuration of radix
>
> These are confusing, probably incorrect.
>
> + __u32 gid;
> + __u32 iommu_id;
> + __u16 gdev_id;

> iommu_id is the RID of the IOMMU, so definately not. The iommu
> instance to work on is specifed by the generic dev_id which becomes a
> struct device * in the driver callback. Access the target iommu
> instance via dev_iommu_priv_get()/etc
>
> The translation of gdev_id to pdev_dev needs to be connected to some
> future viommu object, so this shouldn't be part of this series, and
> will eventually be provided through some new viommu object API - see
> my long outline email

Got it.

> The viommu object should hold the GID. I'm not sure you need a GID
> right now (can you just issue invalidation on the physical side?), but
> if you do need GID to bridge until the viommu is ready it should
> probably be allocated by and stored in the nesting parent.

Currently, the GID is needed when setting up domain ID mapping table
(and device ID mapping table in future series), which will be moved to
when attaching domain to a device. By that time, we should already have
GID information already stored in other struct (e.g. struct
iommu_dev_data). So it should be alright to get rid of GID from the
struct iommu_hwpt_amd_v2.

Thanks,
Suravee

2024-01-05 13:39:51

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] iommu/amd: Add support for hw_info for iommu capability query

Hi Jason

On 12/13/2023 8:27 PM, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 10:01:35AM -0600, Suravee Suthikulpanit wrote:
>> AMD IOMMU Extended Feature (EFR) and Extended Feature 2 (EFR2) registers
>>
>> ...
>>
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 4e4ff1550cf3..c41932e9f16a 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -2822,8 +2822,46 @@ static const struct iommu_dirty_ops amd_dirty_ops = {
>> .read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
>> };
>>
>> +void amd_iommu_build_efr(u64 *efr, u64 *efr2)
>> +{
>> + if (efr) {
>> + *efr = (FEATURE_GT | FEATURE_GIOSUP | FEATURE_PPR);
>> +
>> + /* 5-level v2 page table support */
>> + *efr |= ((FEATURE_GATS_5LEVEL << FEATURE_GATS_SHIFT) &
>> + FEATURE_GATS_MASK);
>> +
>> + /* 3-level GCR3 table support */
>> + *efr |= ((FEATURE_GLX_3LEVEL << FEATURE_GLX_SHIFT) &
>> + FEATURE_GLX_MASK);
>> +
>> + /* 16-bit PASMAX support */
>> + *efr |= ((FEATURE_PASMAX_16 << FEATURE_PASMAX_SHIFT) &
>> + FEATURE_PASMAX_MASK);
>> + }
>> +
>> + if (efr2)
>> + *efr2 = 0;
>
> Why are you checking for null here? It is never called with null

In subsequent patches of the part 2, this helper function will be used
to help populate the EFR and/or EFR2 in different call paths, which can
pass only efr or efr2 parameter individually.

>> +/**
>> + * struct iommu_hw_info_amd - AMD IOMMU device info
>> + *
>> + * @efr : Value of AMD IOMMU Extended Feature Register (EFR)
>> + * @efr2: Value of AMD IOMMU Extended Feature 2 Register (EFR2)
>
> Please reference a section in the spec for what these are just for
> clarity

Sure

>> + */
>> +struct iommu_hw_info_amd {
>> + __u64 efr;
>> + __u64 efr2;
>> +};
>
> __aligned_u64

Okey.

Thanks,
Suravee

2024-01-05 13:40:13

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

Hi Kevin,

On 12/15/2023 2:45 PM, Tian, Kevin wrote:
>> From: Suravee Suthikulpanit <[email protected]>
>> Sent: Wednesday, December 13, 2023 12:02 AM
>>
>> To support nested translation on AMD IOMMU, the driver needs to
>> program DTE[GCR3 Table Root Pointer] with the address provided by
>> the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
>> of the struct iommu_ops.domain_alloc_user() with the flag
>> IOMMU_HWPT_ALLOC_NEST_PARENT.
>>
>> Note that current implementation only support GCR3TRPMode for
>> nested translation, which uses GPA to program GCR3 Table Root Pointer.
>>
>
> means there is a plan to support another mode in the future or
> actually the nested translation requires GCR3TRPMode as a
> functional requirement? imho the point of GPA is assumed
> in the nested configuration in concept...

On (older) system, which does not support GCR3TRPMode, the IOMMU driver
needs to program the device's DTE[GCR3 Table Root Pointer] field w/ SPA.

When QEMU presents an AMD vIOMMU device to a guest, the guest programs
the guest DTE[GCR3 Table Root Pointer] with GPA. Then we need to :
1. Traps the DTE write
2. Translate the GPA->SPA
3. Program DTE with SPA.

With the GCR3TRPMode, we can skip step 2 above and directly program step
3 with GPA.

Suravee


2024-01-05 13:40:23

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] iommu/amd: Introduce Guest-ID struct amd_iommu_vminfo

Hi Kevin

On 12/15/2023 2:35 PM, Tian, Kevin wrote:
>> From: Suravee Suthikulpanit <[email protected]>
>> Sent: Wednesday, December 13, 2023 12:02 AM
>>
>> AMD HW-vIOMMU feature requires IOMMU driver to specify a unique 16-bit
>> Guest ID (GID) for each VM. This ID is used to index into various
>> data structures for configuring the hardware.
>>
>> Introduce amd_iommu_vminfo_hash hashtable to store per-vm
>> configuration,
>> which uses 16-bit GID as a hash key along with helper functions.
>>
>
> somehow it's unclear to me whether this series is only for hw
> supporting vf or broader hw supporting nested capability. for
> the latter case is GID still necessary?

I am restructuring the series and might be moving GID stuff until later
when introduce broader hw support for AMD vIOMMU.

Suravee

2024-01-05 13:40:59

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] iommu/amd: Add support for hw_info for iommu capability query

Hi Kevin,

On 12/15/2023 2:32 PM, Tian, Kevin wrote:
>> From: Suravee Suthikulpanit <[email protected]>
>> Sent: Wednesday, December 13, 2023 12:02 AM
>>
>> +void amd_iommu_build_efr(u64 *efr, u64 *efr2)
>> +{
>> + if (efr) {
>> + *efr = (FEATURE_GT | FEATURE_GIOSUP | FEATURE_PPR);
>> +
>> + /* 5-level v2 page table support */
>> + *efr |= ((FEATURE_GATS_5LEVEL << FEATURE_GATS_SHIFT) &
>> + FEATURE_GATS_MASK);
>> +
>> + /* 3-level GCR3 table support */
>> + *efr |= ((FEATURE_GLX_3LEVEL << FEATURE_GLX_SHIFT) &
>> + FEATURE_GLX_MASK);
>> +
>> + /* 16-bit PASMAX support */
>> + *efr |= ((FEATURE_PASMAX_16 << FEATURE_PASMAX_SHIFT)
>> &
>> + FEATURE_PASMAX_MASK);
>> + }
>> +
>> + if (efr2)
>> + *efr2 = 0;
>> +}
>
> Don't this need to check the support in hw?

Ah.. Good point. Let me add check for the support for these features in
the actual hardware.

Thanks,
Suravee

2024-01-05 13:57:13

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] iommu/amd: Introduce helper functions to setup GCR3TRPMode



On 12/13/2023 8:53 PM, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 10:01:38AM -0600, Suravee Suthikulpanit wrote:
>> +/*
>> + * For GCR3TRPMode, user-space provides GPA for the GCR3 Root Pointer Table.
>> + */
>> +int amd_iommu_set_gcr3tbl_trp(struct amd_iommu *iommu, struct pci_dev *pdev,
>> + u64 gcr3_tbl, u16 glx, u16 guest_paging_mode)
>> +{
>> + struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev->dev);
>> + struct dev_table_entry *dev_table = get_dev_table(iommu);
>> + struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
>> + int devid = pci_dev_id(pdev);
>> + u64 data0 = dev_table[devid].data[0];
>> + u64 data1 = dev_table[devid].data[1];
>> + u64 data2 = dev_table[devid].data[2];
>> + u64 tmp;
>
> Like I said in my other email, this whole function is conceptually
> wrong - you can't read the DTE to learn the parent domain's
> contribution to the nesting DTE and you can't write to the DTE during
> allocation of a domain!
>
> Jason

I'll fix this in the v2.

Suravee

2024-01-05 14:32:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

On Fri, Jan 05, 2024 at 08:38:47PM +0700, Suthikulpanit, Suravee wrote:
> > > + if (!pdev)
> > > + return -EINVAL;
> > > +
> > > + /* Note: Currently only support GCR3TRPMode with nested translation */
> > > + if (!check_feature2(FEATURE_GCR3TRPMODE))
> > > + return -EOPNOTSUPP;
> > > +
> > > + ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
> > > + hwpt->guest_paging_mode);
> >
> > Waah?
> >
> > This is touching the dev table? That is not right, allocation is only
> > *ALLOCATION*. The dev table can't be changed until you do attachment.
>
> My understanding is QEMU should call:
>
> 1. iommufd_backend_get_ioas()
> -> ioctl(IOMMU_IOAS_ALLOC)
>
> 2. For parent domain
> iommufd_backend_alloc_hwpt(IOMMU_HWPT_ALLOC_NEST_PARENT)
> -> ioctl( IOMMU_HWPT_ALLOC)
> --- in Linux ---
> ....
> -> iommufd_hwpt_paging_alloc()
> -> struct iommu_ops.domain_alloc_user(IOMMU_HWPT_ALLOC_NEST_PARENT)
>
> 3. For parent domain
> iommufd_device_attach_hwpt()
> -> vfio_device_attach_container()
> -> ioctl(VFIO_DEVICE_ATTACH_IOMMUFD_PT)
> --- in Linux ---
> vfio_iommufd_physical_attach_ioas()
> -> iommufd_device_attach()
> -> iommufd_device_do_attach()
> -> iommufd_hw_pagetable_attach()
>
> 4. Same as (2) but for child domain w/ stage 1 table.
>
> 5. Same as (3) but for child domain w/ stage 1 table.

Yes, but understand it is not API that the driver can depend on that
order. #3 can be skipped and there can be multiple parents and
everything must still work.

> You want the gcr3 table root point in the DTE to be update at step 5 only,
> right? If so, this should be okay.

#3 sets the DTE to point to just the parent domain as a V1 page table

#4/5 sets the DTE to point to both the V1 page table and the GCR3 table

The DTE is calculated based only on the *currently* attached
iommu_domain's for the struct device.

> > Please look at the smmuv3 patches and try to be structurally
> > similar. AMD and SMMUv3 are *very similar* in how their HW works
> > excluding the viommu stuff.
>
> Could you please point me to the series? I found one but it was really old.
> I might have missed the latest stuff.

https://lore.kernel.org/linux-iommu/[email protected]/
https://lore.kernel.org/linux-iommu/[email protected]/

(and the github link has the other parts)

> > You also can't assume your parent is currently attached to anything.
> >
> > The construction of the DTE has to be from-scratch based on the parent
> > domain and the provided values in the "hwpt". Again see how smmuv3
> > does this where there is one function that builds the entire DTE
> > (called STE)
>
> Ok. I'll program fields of the DTE, which are related to DMA-remapping with
> v1 and v2 table using the information in the parent and child domains only.

The smmu code looks like:

static void arm_smmu_make_nested_domain_ste(
struct arm_smmu_ste *target, struct arm_smmu_master *master,
struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
{
[..]
// Incorporate the "v1 fields" into the "DTE"
arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent,
ats_enabled);

// Incorporate the "GCR3 fields" into the "DTE"
target->data[0] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_NESTED)) |
nested_domain->ste[0];

Where "arm_smmu_ste *" is stack memory that holds the "DTE" being
constructed. Once the make* family of function figure out the exact
STE that the struct device needs, another function writes the STE to
the HW visible location(s).

The goal is to be able to calculate the required DTE for the #3 and
#4/5 cases you list above directly without making assumptions about
the current state of the DTE or anything else.

> A device can be attached to a domain, which could be either one-level or
> nested domain. In case of nesting, the parent domain contains information
> for the stage2 (v1) table, while the child domain contains information for
> the stage1 (v2) table. For each domain, we need to keep track of the parent
> domain.

The nesting domain, and only the nesting domain, stores a pointer to
it's parent domain - they are a unit together.

Jason

2024-01-05 14:37:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

On Fri, Jan 05, 2024 at 08:39:52PM +0700, Suthikulpanit, Suravee wrote:
> Hi Kevin,
>
> On 12/15/2023 2:45 PM, Tian, Kevin wrote:
> > > From: Suravee Suthikulpanit <[email protected]>
> > > Sent: Wednesday, December 13, 2023 12:02 AM
> > >
> > > To support nested translation on AMD IOMMU, the driver needs to
> > > program DTE[GCR3 Table Root Pointer] with the address provided by
> > > the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
> > > of the struct iommu_ops.domain_alloc_user() with the flag
> > > IOMMU_HWPT_ALLOC_NEST_PARENT.
> > >
> > > Note that current implementation only support GCR3TRPMode for
> > > nested translation, which uses GPA to program GCR3 Table Root Pointer.
> > >
> >
> > means there is a plan to support another mode in the future or
> > actually the nested translation requires GCR3TRPMode as a
> > functional requirement? imho the point of GPA is assumed
> > in the nested configuration in concept...
>
> On (older) system, which does not support GCR3TRPMode, the IOMMU driver
> needs to program the device's DTE[GCR3 Table Root Pointer] field w/ SPA.

Meaning that on older systems the GCR3 Table Root Pointer is not
translated by the parent v1 page table?

> When QEMU presents an AMD vIOMMU device to a guest, the guest programs the
> guest DTE[GCR3 Table Root Pointer] with GPA. Then we need to :
> 1. Traps the DTE write
> 2. Translate the GPA->SPA
> 3. Program DTE with SPA.
>
> With the GCR3TRPMode, we can skip step 2 above and directly program step 3
> with GPA.

Do you want to support this? It will be hard to do because it is not
just those three steps (which are easy) but that you have to somehow
maintain coherence with any changes to the parent page table, so you
have to hook the iommu_domain unmap as well...

Jason

2024-01-05 14:38:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] iommu/amd: Introduce Guest-ID struct amd_iommu_vminfo

On Fri, Jan 05, 2024 at 08:39:56PM +0700, Suthikulpanit, Suravee wrote:
> Hi Kevin
>
> On 12/15/2023 2:35 PM, Tian, Kevin wrote:
> > > From: Suravee Suthikulpanit <[email protected]>
> > > Sent: Wednesday, December 13, 2023 12:02 AM
> > >
> > > AMD HW-vIOMMU feature requires IOMMU driver to specify a unique 16-bit
> > > Guest ID (GID) for each VM. This ID is used to index into various
> > > data structures for configuring the hardware.
> > >
> > > Introduce amd_iommu_vminfo_hash hashtable to store per-vm
> > > configuration,
> > > which uses 16-bit GID as a hash key along with helper functions.
> > >
> >
> > somehow it's unclear to me whether this series is only for hw
> > supporting vf or broader hw supporting nested capability. for
> > the latter case is GID still necessary?
>
> I am restructuring the series and might be moving GID stuff until later when
> introduce broader hw support for AMD vIOMMU.

I'm hoping you can just skip enabling the viommu features and still
have nesting? That should be OK right? The SW will manage the
invalidations.

I'd like to do ARM and AMD accelerated viommu nesting together since
they are so similar it will help to make the APIs correct.

Jason

2024-01-08 06:50:13

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support



On 1/5/2024 9:37 PM, Jason Gunthorpe wrote:
> On Fri, Jan 05, 2024 at 08:39:52PM +0700, Suthikulpanit, Suravee wrote:
>> Hi Kevin,
>>
>> On 12/15/2023 2:45 PM, Tian, Kevin wrote:
>>>> From: Suravee Suthikulpanit <[email protected]>
>>>> Sent: Wednesday, December 13, 2023 12:02 AM
>>>>
>>>> To support nested translation on AMD IOMMU, the driver needs to
>>>> program DTE[GCR3 Table Root Pointer] with the address provided by
>>>> the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
>>>> of the struct iommu_ops.domain_alloc_user() with the flag
>>>> IOMMU_HWPT_ALLOC_NEST_PARENT.
>>>>
>>>> Note that current implementation only support GCR3TRPMode for
>>>> nested translation, which uses GPA to program GCR3 Table Root Pointer.
>>>>
>>>
>>> means there is a plan to support another mode in the future or
>>> actually the nested translation requires GCR3TRPMode as a
>>> functional requirement? imho the point of GPA is assumed
>>> in the nested configuration in concept...
>>
>> On (older) system, which does not support GCR3TRPMode, the IOMMU driver
>> needs to program the device's DTE[GCR3 Table Root Pointer] field w/ SPA.
>
> Meaning that on older systems the GCR3 Table Root Pointer is not
> translated by the parent v1 page table?

Correct.

>> When QEMU presents an AMD vIOMMU device to a guest, the guest programs the
>> guest DTE[GCR3 Table Root Pointer] with GPA. Then we need to :
>> 1. Traps the DTE write
>> 2. Translate the GPA->SPA
>> 3. Program DTE with SPA.
>>
>> With the GCR3TRPMode, we can skip step 2 above and directly program step 3
>> with GPA.
>
> Do you want to support this? It will be hard to do because it is not
> just those three steps (which are easy) but that you have to somehow
> maintain coherence with any changes to the parent page table, so you
> have to hook the iommu_domain unmap as well...

I'm debating this. Let me get back on this part.

Thanks,
Suravee

2024-01-09 10:11:24

by Suravee Suthikulpanit

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] iommu/amd: Introduce Guest-ID struct amd_iommu_vminfo



On 1/5/2024 9:38 PM, Jason Gunthorpe wrote:
> On Fri, Jan 05, 2024 at 08:39:56PM +0700, Suthikulpanit, Suravee wrote:
>> Hi Kevin
>>
>> On 12/15/2023 2:35 PM, Tian, Kevin wrote:
>>>> From: Suravee Suthikulpanit <[email protected]>
>>>> Sent: Wednesday, December 13, 2023 12:02 AM
>>>>
>>>> AMD HW-vIOMMU feature requires IOMMU driver to specify a unique 16-bit
>>>> Guest ID (GID) for each VM. This ID is used to index into various
>>>> data structures for configuring the hardware.
>>>>
>>>> Introduce amd_iommu_vminfo_hash hashtable to store per-vm
>>>> configuration,
>>>> which uses 16-bit GID as a hash key along with helper functions.
>>>>
>>>
>>> somehow it's unclear to me whether this series is only for hw
>>> supporting vf or broader hw supporting nested capability. for
>>> the latter case is GID still necessary?
>>
>> I am restructuring the series and might be moving GID stuff until later when
>> introduce broader hw support for AMD vIOMMU.
>
> I'm hoping you can just skip enabling the viommu features and still
> have nesting? That should be OK right? The SW will manage the
> invalidations.

Correct, the part 1 only add support for nested translation (w/o
HW-vIOMMU feature). So, SW would need to manage the invalidation.

Suravee