2024-01-12 00:07:17

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 0/7] iommu/amd: Introduce hardware info reporting and nested translation support

OVERVIEW
--------
This is the first part of multi-part series to introduce the AMD IOMMU
nested translation support with Hardware-virtualized IOMMU (HW-vIOMMU),
which allows user-space to take adventage of hardware accelerated
AMD IOMMU in the guest.

The part 1 introduces AMD IOMMU hardware info reporting support via
the ioctl IOMMU_GET_HW_INFO with enum IOMMU_HW_INFO_TYPE_AMD. This allows
hypervisor to communicate the supported capabilities to user-space.
For example, this information can be used by QEMU to populate EFR and EFR2
field in the guest IVRS table when building guest ACPI table.

In addition, this series introduces nested translation support for the
AMD IOMMU v2 page table (stage-1) via the IOMMUFD nesting infrastructure.
This allows User-space to set up the v2 page table and communicate
information via the struct iommu_hwpt_amd_v2 with enum
IOMMU_HWPT_DATA_AMD_V2.

This series is based on top of:
* Linux v6.7-rc5
* Series [PATCH v2 0/9] Improve TLB invalidation logic
(Currently queued in https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/log/?h=next)
* Series [PATCH v4 00/16] iommu/amd: SVA Support (part 3) - refactor support for GCR3 table
(https://lore.kernel.org/all/[email protected]/)

GITHUB
------
[1] https://github.com/AMDESE/linux/tree/iommu_sva_part3_v4_v6.7_rc5-amd_nesting_rfc_v2.1
(Complete code for the part 1 of this series)
[2] https://github.com/AMDESE/linux/tree/wip/iommu_sva_part3_v4_v6.7_rc5-amd_viommu_20240112.1
(WIP for Linux AMD Hw-vIOMMU support)

CHANGES
-------
* Patch 1 : New
* Patch 2 : New
* Patch 4 : Modifies the amd_iommu_build_efr to also check the advertised capabilities against
the current hardware (per Kevin)
* Patch 4 : Also add reference to IOMMU spec in the comment (per Jason)
* Patch 5 : Removes gid, iommu_id, gdev_id (per Jason), and introduce the struct flags and giov tracking.
* Patch 6,7 : Separate logic for nested domain allocation and attachment into two patches (per Jason).

HISTORY
-------
* [RFC PATCH 0/6] iommu/amd: Introduce hardware info reporting and nested translation support
(https://lore.kernel.org/linux-iommu/[email protected]/)

* [RFC PATCH 00/21] iommu/amd: Introduce support for HW accelerated vIOMMU w/ nested page table
(https://lore.kernel.org/lkml/[email protected]/)

Thank you,
Suravee

Suravee Suthikulpanit (7):
iommu/amd: Introduce struct gcr3_tbl_info.giov
iommu/amd: Refactor set_dte_entry
iommu/amd: Update PASID, GATS, and GLX feature related macros
iommu/amd: Add support for hw_info for iommu capability query
iommufd: Introduce data struct for AMD nested domain allocation
iommu/amd: Add nested domain allocation support
iommu/amd: Add nested domain attach/detach support

drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 14 +-
drivers/iommu/amd/amd_iommu_types.h | 20 +-
drivers/iommu/amd/init.c | 8 +-
drivers/iommu/amd/iommu.c | 298 ++++++++++++++++++++++------
drivers/iommu/amd/nested.c | 75 +++++++
include/uapi/linux/iommufd.h | 45 +++++
7 files changed, 387 insertions(+), 75 deletions(-)
create mode 100644 drivers/iommu/amd/nested.c

--
2.34.1



2024-01-12 00:07:30

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 3/7] iommu/amd: Update PASID, GATS, and GLX feature related macros

Clean up and reorder them according to the bit index. There is no
functional change.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/amd_iommu.h | 2 +-
drivers/iommu/amd/amd_iommu_types.h | 13 +++++++------
drivers/iommu/amd/init.c | 8 ++++----
3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index bbed268e8abc..108253edbeb0 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -106,7 +106,7 @@ static inline bool check_feature2(u64 mask)

static inline int check_feature_gpt_level(void)
{
- return ((amd_iommu_efr >> FEATURE_GATS_SHIFT) & FEATURE_GATS_MASK);
+ return ((amd_iommu_efr & FEATURE_GATS_MASK) >> FEATURE_GATS_SHIFT);
}

static inline bool amd_iommu_gt_ppr_supported(void)
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index ff56c857f6ad..f8baa8d88832 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -93,8 +93,6 @@
#define FEATURE_GA BIT_ULL(7)
#define FEATURE_HE BIT_ULL(8)
#define FEATURE_PC BIT_ULL(9)
-#define FEATURE_GATS_SHIFT (12)
-#define FEATURE_GATS_MASK (3ULL)
#define FEATURE_GAM_VAPIC BIT_ULL(21)
#define FEATURE_GIOSUP BIT_ULL(48)
#define FEATURE_HASUP BIT_ULL(49)
@@ -102,11 +100,14 @@
#define FEATURE_HDSUP BIT_ULL(52)
#define FEATURE_SNP BIT_ULL(63)

-#define FEATURE_PASID_SHIFT 32
-#define FEATURE_PASID_MASK (0x1fULL << FEATURE_PASID_SHIFT)
+#define FEATURE_GATS_SHIFT 12
+#define FEATURE_GATS_MASK (0x03ULL << FEATURE_GATS_SHIFT)

-#define FEATURE_GLXVAL_SHIFT 14
-#define FEATURE_GLXVAL_MASK (0x03ULL << FEATURE_GLXVAL_SHIFT)
+#define FEATURE_GLX_SHIFT 14
+#define FEATURE_GLX_MASK (0x03ULL << FEATURE_GLX_SHIFT)
+
+#define FEATURE_PASMAX_SHIFT 32
+#define FEATURE_PASMAX_MASK (0x1FULL << FEATURE_PASMAX_SHIFT)

/* Extended Feature 2 Bits */
#define FEATURE_SNPAVICSUP_SHIFT 5
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 959820ccfbcc..e84c69fe13d4 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -2080,14 +2080,14 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
int glxval;
u64 pasmax;

- pasmax = amd_iommu_efr & FEATURE_PASID_MASK;
- pasmax >>= FEATURE_PASID_SHIFT;
+ pasmax = amd_iommu_efr & FEATURE_PASMAX_MASK;
+ pasmax >>= FEATURE_PASMAX_SHIFT;
iommu->iommu.max_pasids = (1 << (pasmax + 1)) - 1;

BUG_ON(iommu->iommu.max_pasids & ~PASID_MASK);

- glxval = amd_iommu_efr & FEATURE_GLXVAL_MASK;
- glxval >>= FEATURE_GLXVAL_SHIFT;
+ glxval = amd_iommu_efr & FEATURE_GLX_MASK;
+ glxval >>= FEATURE_GLX_SHIFT;

if (amd_iommu_max_glx_val == -1)
amd_iommu_max_glx_val = glxval;
--
2.34.1


2024-01-12 00:07:43

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 2/7] iommu/amd: Refactor set_dte_entry

Separate logic for setting DTE[GCR3 Root Pointer Table] into a helper
function set_dte_gcr3_table() to prepare for adding nested domain support.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/iommu.c | 124 ++++++++++++++++++++++----------------
1 file changed, 72 insertions(+), 52 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b9759f6d8be2..71099e5fbaee 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1917,89 +1917,109 @@ int amd_iommu_clear_gcr3(struct iommu_dev_data *dev_data, ioasid_t pasid)
return ret;
}

+static void set_dte_gcr3_table(struct amd_iommu *iommu,
+ struct iommu_dev_data *dev_data,
+ struct dev_table_entry *target)
+{
+ struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
+ int devid = dev_data->devid;
+ u64 tmp, gcr3 = 0;
+
+ if (!gcr3_info || !gcr3_info->gcr3_tbl)
+ return;
+
+ pr_debug("%s: devid=%#x, glx=%#x, giov=%#x, gcr3_tbl=%#llx\n",
+ __func__, devid, gcr3_info->glx, gcr3_info->giov,
+ (unsigned long long)gcr3_info->gcr3_tbl);
+
+ tmp = gcr3_info->glx;
+ target->data[0] |= (tmp & DTE_GLX_MASK) << DTE_GLX_SHIFT;
+ if (gcr3_info->giov)
+ target->data[0] |= DTE_FLAG_GIOV;
+ target->data[0] |= DTE_FLAG_GV;
+
+ /* First mask out possible old values for GCR3 table */
+ tmp = DTE_GCR3_VAL_A(~0ULL) << DTE_GCR3_SHIFT_A;
+ target->data[0] &= ~tmp;
+ tmp = DTE_GCR3_VAL_B(~0ULL) << DTE_GCR3_SHIFT_B;
+ target->data[1] &= ~tmp;
+ tmp = DTE_GCR3_VAL_C(~0ULL) << DTE_GCR3_SHIFT_C;
+ target->data[1] &= ~tmp;
+
+ gcr3 = iommu_virt_to_phys(gcr3_info->gcr3_tbl);
+
+ /* Encode GCR3 table into DTE */
+ tmp = DTE_GCR3_VAL_A(gcr3) << DTE_GCR3_SHIFT_A;
+ target->data[0] |= tmp;
+ tmp = DTE_GCR3_VAL_A(gcr3) << DTE_GCR3_SHIFT_A;
+ target->data[1] |= tmp;
+ tmp = DTE_GCR3_VAL_B(gcr3) << DTE_GCR3_SHIFT_B;
+ target->data[1] |= tmp;
+ tmp = DTE_GCR3_VAL_C(gcr3) << DTE_GCR3_SHIFT_C;
+ target->data[1] |= tmp;
+
+ /* Use system default */
+ tmp = amd_iommu_gpt_level;
+
+ /* Mask out old values for GuestPagingMode */
+ target->data[2] &= ~(0x3ULL << DTE_GPT_LEVEL_SHIFT);
+ target->data[2] |= (tmp << DTE_GPT_LEVEL_SHIFT);
+}
+
static void set_dte_entry(struct amd_iommu *iommu,
struct iommu_dev_data *dev_data)
{
- u64 pte_root = 0;
- u64 flags = 0;
- u32 old_domid;
- u16 devid = dev_data->devid;
u16 domid;
+ u16 devid = dev_data->devid;
struct protection_domain *domain = dev_data->domain;
+ struct dev_table_entry target = {.data = {0, 0, 0, 0}};
struct dev_table_entry *dev_table = get_dev_table(iommu);
- struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
+ u32 old_domid = dev_table[devid].data[1] & DEV_DOMID_MASK;

if (domain_id_is_per_dev(domain))
domid = dev_data->domid;
else
domid = domain->id;

+ /*
+ * Need to get the current value in dte[1,2] because they contain
+ * interrupt-remapping settings, which has been programmed earlier.
+ */
+ target.data[1] = dev_table[devid].data[1];
+ target.data[2] = dev_table[devid].data[2];
+
if (domain->iop.mode != PAGE_MODE_NONE)
- pte_root = iommu_virt_to_phys(domain->iop.root);
+ target.data[0] = iommu_virt_to_phys(domain->iop.root);

- pte_root |= (domain->iop.mode & DEV_ENTRY_MODE_MASK)
+ target.data[0] |= (domain->iop.mode & DEV_ENTRY_MODE_MASK)
<< DEV_ENTRY_MODE_SHIFT;

- pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V;
+ target.data[0] |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V;

/*
* When SNP is enabled, Only set TV bit when IOMMU
* page translation is in use.
*/
if (!amd_iommu_snp_en || (domid != 0))
- pte_root |= DTE_FLAG_TV;
-
- flags = dev_table[devid].data[1];
+ target.data[0] |= DTE_FLAG_TV;

if (dev_data->ats_enabled)
- flags |= DTE_FLAG_IOTLB;
+ target.data[1] |= DTE_FLAG_IOTLB;

if (dev_data->ppr)
- pte_root |= 1ULL << DEV_ENTRY_PPR;
+ target.data[0] |= 1ULL << DEV_ENTRY_PPR;

if (domain->dirty_tracking)
- pte_root |= DTE_FLAG_HAD;
-
- if (gcr3_info && gcr3_info->gcr3_tbl) {
- u64 gcr3 = iommu_virt_to_phys(gcr3_info->gcr3_tbl);
- u64 glx = gcr3_info->glx;
- u64 tmp;
-
- pte_root |= DTE_FLAG_GV;
- pte_root |= (glx & DTE_GLX_MASK) << DTE_GLX_SHIFT;
-
- /* First mask out possible old values for GCR3 table */
- tmp = DTE_GCR3_VAL_B(~0ULL) << DTE_GCR3_SHIFT_B;
- flags &= ~tmp;
+ target.data[0] |= DTE_FLAG_HAD;

- tmp = DTE_GCR3_VAL_C(~0ULL) << DTE_GCR3_SHIFT_C;
- flags &= ~tmp;
-
- /* Encode GCR3 table into DTE */
- tmp = DTE_GCR3_VAL_A(gcr3) << DTE_GCR3_SHIFT_A;
- pte_root |= tmp;
-
- tmp = DTE_GCR3_VAL_B(gcr3) << DTE_GCR3_SHIFT_B;
- flags |= tmp;
-
- tmp = DTE_GCR3_VAL_C(gcr3) << DTE_GCR3_SHIFT_C;
- flags |= tmp;
-
- if (amd_iommu_gpt_level == PAGE_MODE_5_LEVEL) {
- dev_table[devid].data[2] |=
- ((u64)GUEST_PGTABLE_5_LEVEL << DTE_GPT_LEVEL_SHIFT);
- }
-
- if (gcr3_info->giov)
- pte_root |= DTE_FLAG_GIOV;
- }
+ target.data[1] &= ~DEV_DOMID_MASK;
+ target.data[1] |= domid;

- flags &= ~DEV_DOMID_MASK;
- flags |= domid;
+ set_dte_gcr3_table(iommu, dev_data, &target);

- old_domid = dev_table[devid].data[1] & DEV_DOMID_MASK;
- dev_table[devid].data[1] = flags;
- dev_table[devid].data[0] = pte_root;
+ dev_table[devid].data[0] = target.data[0];
+ dev_table[devid].data[1] = target.data[1];
+ dev_table[devid].data[2] = target.data[2];

/*
* A kdump kernel might be replacing a domain ID that was copied from
--
2.34.1


2024-01-12 00:08:02

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 1/7] iommu/amd: Introduce struct gcr3_tbl_info.giov

To track DTE[GIOV] programming during IOMMU domain attach, also add logic
to determine if the GIOV is required, and set the variable accordinglly.

This is also a preparation for adding nested domain support, where the GIOV
setting is determined by the child domain.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/amd_iommu_types.h | 1 +
drivers/iommu/amd/iommu.c | 11 +++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 3dc39bbc05fc..ff56c857f6ad 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -536,6 +536,7 @@ struct gcr3_tbl_info {
u64 *gcr3_tbl; /* Guest CR3 table */
int glx; /* Number of levels for GCR3 table */
u32 pasid_cnt; /* Track attached PASIDs */
+ bool giov; /* Track DTE[GIOV] */
};

struct amd_io_pgtable {
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 4e4ff1550cf3..b9759f6d8be2 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1990,8 +1990,7 @@ static void set_dte_entry(struct amd_iommu *iommu,
((u64)GUEST_PGTABLE_5_LEVEL << DTE_GPT_LEVEL_SHIFT);
}

- /* GIOV is supported with V2 page table mode only */
- if (pdom_is_v2_pgtbl_mode(domain))
+ if (gcr3_info->giov)
pte_root |= DTE_FLAG_GIOV;
}

@@ -2067,6 +2066,14 @@ static int do_attach(struct iommu_dev_data *dev_data,
free_gcr3_table(dev_data);
return ret;
}
+
+ /*
+ * GIOV is required for PD_MODE_V2 because we need
+ * to support the case where the end-point device
+ * does not have PASID in the TLP prefix when setting
+ * up to use the v2 table.
+ */
+ dev_data->gcr3_info.giov = true;
}

/* Update device table */
--
2.34.1


2024-01-12 00:08:05

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 5/7] iommufd: Introduce data struct for AMD nested domain allocation

Introduce IOMMU_HWPT_DATA_AMD_V2 data type for AMD IOMMU v2 page table,
which is used for stage-1 in nested translation. The data structure
contains information necessary for setting up the AMD HW-vIOMMU support.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
include/uapi/linux/iommufd.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 9901b9f4abe2..b28ec5947571 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -389,14 +389,39 @@ struct iommu_hwpt_vtd_s1 {
__u32 __reserved;
};

+/**
+ * struct iommu_hwpt_amd_v2 - AMD IOMMU specific user-managed
+ * v2 I/O page table data
+ * @gcr3: GCR3 guest physical ddress
+ * @flags.glx: GCR3 table levels
+ * @flags.giov: GIOV mode
+ * @flags.guest_paging_mode: Guest v2 page table paging mode
+ * @flags.reserved : Must be 0
+ * @gdom_id: Guest domain ID
+ * @__reserved: Must be 0
+ */
+struct iommu_hwpt_amd_v2 {
+ __aligned_u64 gcr3;
+ struct {
+ __aligned_u64 glx : 1,
+ giov : 1,
+ guest_paging_mode : 2,
+ reserved : 60;
+ } flags;
+ __u32 gdom_id;
+ __u32 __reserved;
+};
+
/**
* enum iommu_hwpt_data_type - IOMMU HWPT Data Type
* @IOMMU_HWPT_DATA_NONE: no data
* @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
+ * @IOMMU_HWPT_DATA_AMD_V2: AMD IOMMUv2 page table
*/
enum iommu_hwpt_data_type {
IOMMU_HWPT_DATA_NONE,
IOMMU_HWPT_DATA_VTD_S1,
+ IOMMU_HWPT_DATA_AMD_V2,
};

/**
--
2.34.1


2024-01-12 00:08:19

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 4/7] iommu/amd: Add support for hw_info for iommu capability query

AMD IOMMU Extended Feature (EFR) and Extended Feature 2 (EFR2) registers
specify features supported by each IOMMU hardware instance.
The IOMMU driver checks each feature-specific bits before enabling
each feature at run time.

For IOMMUFD, the hypervisor determines which IOMMU features to support
in the guest, and communicates this information to user-space (e.g. QEMU)
via iommufd IOMMU_DEVICE_GET_HW_INFO ioctl.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/amd_iommu.h | 2 ++
drivers/iommu/amd/iommu.c | 36 +++++++++++++++++++++++++++++++++++
include/uapi/linux/iommufd.h | 20 +++++++++++++++++++
3 files changed, 58 insertions(+)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 108253edbeb0..4118129f4a24 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -72,6 +72,8 @@ void amd_iommu_dev_flush_pasid_pages(struct iommu_dev_data *dev_data,
void amd_iommu_dev_flush_pasid_all(struct iommu_dev_data *dev_data,
ioasid_t pasid);

+void amd_iommu_build_efr(u64 *efr, u64 *efr2);
+
#ifdef CONFIG_IRQ_REMAP
int amd_iommu_create_irq_domain(struct amd_iommu *iommu);
#else
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 71099e5fbaee..134f4af921dc 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2849,8 +2849,44 @@ static const struct iommu_dirty_ops amd_dirty_ops = {
.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
};

+void amd_iommu_build_efr(u64 *efr, u64 *efr2)
+{
+ /* Build the EFR against the current hardware capabilities */
+ if (efr) {
+ *efr = 0ULL;
+ *efr |= (amd_iommu_efr & FEATURE_GT);
+ *efr |= (amd_iommu_efr & FEATURE_GIOSUP);
+ *efr |= (amd_iommu_efr & FEATURE_PPR);
+ *efr |= (amd_iommu_efr & FEATURE_GATS_MASK);
+ *efr |= (amd_iommu_efr & FEATURE_GLX_MASK);
+ *efr |= (amd_iommu_efr & FEATURE_PASMAX_MASK);
+ pr_debug("%s: efr=%#llx\n", __func__, *efr);
+ }
+
+ if (efr2) {
+ *efr2 = 0ULL;
+ pr_debug("%s: efr2=%#llx\n", __func__, *efr);
+ }
+}
+
+static void *amd_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
+{
+ struct iommu_hw_info_amd *hwinfo;
+
+ hwinfo = kzalloc(sizeof(*hwinfo), GFP_KERNEL);
+ if (!hwinfo)
+ return ERR_PTR(-ENOMEM);
+
+ *length = sizeof(*hwinfo);
+ *type = IOMMU_HW_INFO_TYPE_AMD;
+
+ amd_iommu_build_efr(&hwinfo->efr, &hwinfo->efr2);
+ return hwinfo;
+}
+
const struct iommu_ops amd_iommu_ops = {
.capable = amd_iommu_capable,
+ .hw_info = amd_iommu_hw_info,
.domain_alloc = amd_iommu_domain_alloc,
.domain_alloc_user = amd_iommu_domain_alloc_user,
.probe_device = amd_iommu_probe_device,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 0b2bc6252e2c..9901b9f4abe2 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -474,15 +474,35 @@ struct iommu_hw_info_vtd {
__aligned_u64 ecap_reg;
};

+/**
+ * struct iommu_hw_info_amd - AMD IOMMU device info
+ *
+ * @efr : Value of AMD IOMMU Extended Feature Register (EFR)
+ * @efr2: Value of AMD IOMMU Extended Feature 2 Register (EFR2)
+ *
+ * Please See description of these registers in the following sections of
+ * the AMD I/O Virtualization Technology (IOMMU) Specification.
+ * (https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf)
+ *
+ * - MMIO Offset 0030h IOMMU Extended Feature Register
+ * - MMIO Offset 01A0h IOMMU Extended Feature 2 Register
+ */
+struct iommu_hw_info_amd {
+ __aligned_u64 efr;
+ __aligned_u64 efr2;
+};
+
/**
* enum iommu_hw_info_type - IOMMU Hardware Info Types
* @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
* info
* @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_AMD: AMD IOMMU info type
*/
enum iommu_hw_info_type {
IOMMU_HW_INFO_TYPE_NONE,
IOMMU_HW_INFO_TYPE_INTEL_VTD,
+ IOMMU_HW_INFO_TYPE_AMD,
};

/**
--
2.34.1


2024-01-12 00:08:50

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 7/7] iommu/amd: Add nested domain attach/detach support

With GCR3TRPMode, the AMD IOMMU driver does not need to allocate the GCR3
table and the v2 (stage 1) table. Instead, it uses the GPA of the GCR3
table provided by the guest when attach/detach the nesting domain.

Modify the set_dte_gcr3_table() to program DTE[GCR3 Table Root Pointer]
for nesting domain with the provided GPA.

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/iommu.c | 45 +++++++++++++++++++++++++++++++--------
1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 51716fa5ccb5..4041ac3fcd1b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1927,15 +1927,16 @@ static void set_dte_gcr3_table(struct amd_iommu *iommu,
struct dev_table_entry *target)
{
struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
+ struct protection_domain *pdom = dev_data->domain;
int devid = dev_data->devid;
u64 tmp, gcr3 = 0;

- if (!gcr3_info || !gcr3_info->gcr3_tbl)
+ if (!gcr3_info || (!gcr3_info->gcr3_tbl && !gcr3_info->trp_gpa))
return;

- pr_debug("%s: devid=%#x, glx=%#x, giov=%#x, gcr3_tbl=%#llx\n",
+ pr_debug("%s: devid=%#x, glx=%#x, giov=%#x, gcr3_tbl=%#llx, trp_gpa=%#llx\n",
__func__, devid, gcr3_info->glx, gcr3_info->giov,
- (unsigned long long)gcr3_info->gcr3_tbl);
+ (unsigned long long)gcr3_info->gcr3_tbl, gcr3_info->trp_gpa);

tmp = gcr3_info->glx;
target->data[0] |= (tmp & DTE_GLX_MASK) << DTE_GLX_SHIFT;
@@ -1951,7 +1952,11 @@ static void set_dte_gcr3_table(struct amd_iommu *iommu,
tmp = DTE_GCR3_VAL_C(~0ULL) << DTE_GCR3_SHIFT_C;
target->data[1] &= ~tmp;

- gcr3 = iommu_virt_to_phys(gcr3_info->gcr3_tbl);
+ /* For nested domain, use GCR3 GPA provided */
+ if (amd_iommu_domain_is_nested(pdom))
+ gcr3 = gcr3_info->trp_gpa;
+ else if (gcr3_info->gcr3_tbl)
+ gcr3 = iommu_virt_to_phys(gcr3_info->gcr3_tbl);

/* Encode GCR3 table into DTE */
tmp = DTE_GCR3_VAL_A(gcr3) << DTE_GCR3_SHIFT_A;
@@ -1963,8 +1968,21 @@ static void set_dte_gcr3_table(struct amd_iommu *iommu,
tmp = DTE_GCR3_VAL_C(gcr3) << DTE_GCR3_SHIFT_C;
target->data[1] |= tmp;

- /* Use system default */
- tmp = amd_iommu_gpt_level;
+ if (amd_iommu_domain_is_nested(pdom)) {
+ /*
+ * For nested domain, guest provide guest-paging mode.
+ * We need to check host capability before setting the mode.
+ */
+ tmp = pdom->guest_paging_mode;
+ if (tmp > amd_iommu_gpt_level) {
+ pr_err("Cannot support Guest paging mode=%#x (dom_id=%#x).\n",
+ pdom->guest_paging_mode, pdom->id);
+ tmp = amd_iommu_gpt_level;
+ }
+ } else {
+ /* Use system default */
+ tmp = amd_iommu_gpt_level;
+ }

/* Mask out old values for GuestPagingMode */
target->data[2] &= ~(0x3ULL << DTE_GPT_LEVEL_SHIFT);
@@ -1981,6 +1999,13 @@ static void set_dte_entry(struct amd_iommu *iommu,
struct dev_table_entry *dev_table = get_dev_table(iommu);
u32 old_domid = dev_table[devid].data[1] & DEV_DOMID_MASK;

+ /*
+ * For nested domain, use parent domain to setup v1 table
+ * information and domain id.
+ */
+ if (amd_iommu_domain_is_nested(domain))
+ domain = domain->parent;
+
if (domain_id_is_per_dev(domain))
domid = dev_data->domid;
else
@@ -2076,7 +2101,8 @@ static int do_attach(struct iommu_dev_data *dev_data,
dev_data->domid = domain_id_alloc();

/* Init GCR3 table and update device table */
- if (domain->pd_mode == PD_MODE_V2) {
+ if (!amd_iommu_domain_is_nested(domain) &&
+ pdom_is_v2_pgtbl_mode(domain)) {
/*
* By default, setup GCR3 table to support MAX PASIDs
* support by the IOMMU HW.
@@ -2117,8 +2143,9 @@ static void do_detach(struct iommu_dev_data *dev_data)

iommu = get_amd_iommu_from_dev(dev_data->dev);

- /* Clear GCR3 table */
- if (domain->pd_mode == PD_MODE_V2) {
+ if (!amd_iommu_domain_is_nested(domain) &&
+ pdom_is_v2_pgtbl_mode(domain)) {
+ /* Clear GCR3 table */
__clear_gcr3(dev_data, 0);
free_gcr3_table(dev_data);
}
--
2.34.1


2024-01-12 00:09:14

by Suravee Suthikulpanit

[permalink] [raw]
Subject: [RFCv2 PATCH 6/7] iommu/amd: Add nested domain allocation support

To support nested translation, the parent domain is allocated with flag
IOMMU_HWPT_ALLOC_NEST_PARENT, and stores information of the v1 page table
for stage 2 (i.e. GPA->SPA), whereas the child domain stores information
of the GCR3 root pointer table for stage 1 (i.e. GVA->GPA).

Modify the current driver to handle the domain allocation with type
IOMMU_DOMAIN_NESTED. Also, when allocating the child domain (with the
parent domain is specified), keeps track the parent using the struct
protection_domain.parent.

Note that current implementation requires AMD IOMMU GCR3TRPMode feature,
which program DTE[GCR3 Table Root Pointer] with the GPA provided by the
guest via struct iommu_hwpt_amd_v2, which is passed as a parameter of
the struct iommu_ops.domain_alloc_user().

Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 10 +++
drivers/iommu/amd/amd_iommu_types.h | 6 ++
drivers/iommu/amd/iommu.c | 96 ++++++++++++++++++++++++++---
drivers/iommu/amd/nested.c | 75 ++++++++++++++++++++++
5 files changed, 181 insertions(+), 8 deletions(-)
create mode 100644 drivers/iommu/amd/nested.c

diff --git a/drivers/iommu/amd/Makefile b/drivers/iommu/amd/Makefile
index f454fbb1569e..447cb6bb48eb 100644
--- a/drivers/iommu/amd/Makefile
+++ b/drivers/iommu/amd/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_AMD_IOMMU) += iommu.o init.o quirks.o io_pgtable.o io_pgtable_v2.o
+obj-$(CONFIG_AMD_IOMMU) += iommu.o init.o quirks.o io_pgtable.o io_pgtable_v2.o nested.o
obj-$(CONFIG_AMD_IOMMU_DEBUGFS) += debugfs.o
diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 4118129f4a24..bb25d7c3bff5 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -7,6 +7,7 @@
#ifndef AMD_IOMMU_H
#define AMD_IOMMU_H

+#include <uapi/linux/iommufd.h>
#include <linux/iommu.h>

#include "amd_iommu_types.h"
@@ -182,4 +183,13 @@ void amd_iommu_domain_set_pgtable(struct protection_domain *domain,
struct dev_table_entry *get_dev_table(struct amd_iommu *iommu);

extern bool amd_iommu_snp_en;
+
+/* NESTED */
+struct protection_domain *to_pdomain(struct iommu_domain *dom);
+bool amd_iommu_domain_is_nested(struct protection_domain *pdom);
+struct iommu_domain *
+amd_iommu_nested_domain_alloc(struct device *dev, unsigned int type, u32 flags,
+ struct iommu_hwpt_amd_v2 *hwpt,
+ struct iommu_domain *parent);
+
#endif
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index f8baa8d88832..db77b050a496 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -110,6 +110,8 @@
#define FEATURE_PASMAX_MASK (0x1FULL << FEATURE_PASMAX_SHIFT)

/* Extended Feature 2 Bits */
+#define FEATURE_GCR3TRPMODE BIT_ULL(3)
+
#define FEATURE_SNPAVICSUP_SHIFT 5
#define FEATURE_SNPAVICSUP_MASK (0x07ULL << FEATURE_SNPAVICSUP_SHIFT)
#define FEATURE_SNPAVICSUP_GAM(x) \
@@ -535,6 +537,7 @@ struct amd_irte_ops;

struct gcr3_tbl_info {
u64 *gcr3_tbl; /* Guest CR3 table */
+ u64 trp_gpa; /* Guest CR3 TRP GPA for nested domain */
int glx; /* Number of levels for GCR3 table */
u32 pasid_cnt; /* Track attached PASIDs */
bool giov; /* Track DTE[GIOV] */
@@ -569,6 +572,9 @@ struct protection_domain {
bool dirty_tracking; /* dirty tracking is enabled in the domain */
unsigned dev_cnt; /* devices assigned to this domain */
unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference count */
+ struct protection_domain *parent; /* Nested parent domain */
+ u16 guest_paging_mode; /* Guest paging mode */
+ u16 guest_domain_id; /* Guest domain ID */
};

/*
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 134f4af921dc..51716fa5ccb5 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -77,11 +77,16 @@ struct iommu_cmd {

struct kmem_cache *amd_iommu_irq_cache;

+static int amd_iommu_attach_device(struct iommu_domain *dom,
+ struct device *dev);
+
static void detach_device(struct device *dev);

static void set_dte_entry(struct amd_iommu *iommu,
struct iommu_dev_data *dev_data);

+static void amd_iommu_domain_free(struct iommu_domain *dom);
+
/****************************************************************************
*
* Helper functions
@@ -191,7 +196,7 @@ static struct amd_iommu *rlookup_amd_iommu(struct device *dev)
return __rlookup_amd_iommu(seg, PCI_SBDF_TO_DEVID(devid));
}

-static struct protection_domain *to_pdomain(struct iommu_domain *dom)
+struct protection_domain *to_pdomain(struct iommu_domain *dom)
{
return container_of(dom, struct protection_domain, domain);
}
@@ -2367,8 +2372,9 @@ static struct protection_domain *protection_domain_alloc(unsigned int type)
domain->nid = NUMA_NO_NODE;

switch (type) {
- /* No need to allocate io pgtable ops in passthrough mode */
+ /* No need to allocate io pgtable ops in passthrough and nested mode */
case IOMMU_DOMAIN_IDENTITY:
+ case IOMMU_DOMAIN_NESTED:
return domain;
case IOMMU_DOMAIN_DMA:
pgtable = amd_iommu_pgtable;
@@ -2423,7 +2429,12 @@ static bool amd_iommu_hd_support(struct amd_iommu *iommu)
return iommu && (iommu->features & FEATURE_HDSUP);
}

-static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
+static const struct iommu_domain_ops nested_domain_ops = {
+ .attach_dev = amd_iommu_attach_device,
+ .free = amd_iommu_domain_free,
+};
+
+struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
struct device *dev, u32 flags)
{
bool dirty_tracking = flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
@@ -2454,7 +2465,10 @@ static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
if (iommu) {
domain->domain.type = type;
domain->domain.pgsize_bitmap = iommu->iommu.ops->pgsize_bitmap;
- domain->domain.ops = iommu->iommu.ops->default_domain_ops;
+ if (type == IOMMU_DOMAIN_NESTED)
+ domain->domain.ops = &nested_domain_ops;
+ else
+ domain->domain.ops = iommu->iommu.ops->default_domain_ops;

if (dirty_tracking)
domain->domain.dirty_ops = &amd_dirty_ops;
@@ -2474,18 +2488,86 @@ static struct iommu_domain *amd_iommu_domain_alloc(unsigned int type)
return domain;
}

+static int udata_to_iommu_hwpt_amd_v2(const struct iommu_user_data *user_data,
+ struct iommu_hwpt_amd_v2 *hwpt)
+{
+ if (!user_data)
+ return -EINVAL;
+
+ if (user_data->type != IOMMU_HWPT_DATA_AMD_V2)
+ return -EOPNOTSUPP;
+
+ return iommu_copy_struct_from_user(hwpt, user_data,
+ IOMMU_HWPT_DATA_AMD_V2,
+ __reserved);
+}
+
+static bool check_nested_support(u32 flags)
+{
+ if (!(flags & IOMMU_HWPT_ALLOC_NEST_PARENT))
+ return true;
+
+ if (!check_feature(FEATURE_GT) ||
+ !check_feature(FEATURE_GIOSUP) ||
+ !check_feature2(FEATURE_GCR3TRPMODE))
+ return false;
+
+ return true;
+}
+
+static u32 amd_iommu_hwpt_supported_flags =
+ IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
+ IOMMU_HWPT_ALLOC_NEST_PARENT;
+
static struct iommu_domain *
amd_iommu_domain_alloc_user(struct device *dev, u32 flags,
struct iommu_domain *parent,
const struct iommu_user_data *user_data)
-
{
+ struct iommu_domain *dom;
+ struct iommu_dev_data *dev_data;
unsigned int type = IOMMU_DOMAIN_UNMANAGED;
+ bool nested_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
+
+ if (parent) {
+ int ret;
+ struct iommu_hwpt_amd_v2 hwpt;
+
+ if (parent->ops != amd_iommu_ops.default_domain_ops)
+ return ERR_PTR(-EINVAL);
+
+ ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
+ if (ret)
+ return ERR_PTR(ret);

- if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
+ return amd_iommu_nested_domain_alloc(dev, type, flags,
+ &hwpt, parent);
+ }
+
+ /* Check supported flags */
+ if ((flags & ~amd_iommu_hwpt_supported_flags) ||
+ !check_nested_support(flags))
return ERR_PTR(-EOPNOTSUPP);

- return do_iommu_domain_alloc(type, dev, flags);
+ dev_data = dev_iommu_priv_get(dev);
+
+ /*
+ * When allocated nested parent domain, the device may already
+ * have been attached to a domain. For example, a device is already
+ * attached to the domain allocated by VFIO, which contains GPA->SPA mapping.
+ * In such case, return reference to the same domain.
+ */
+ if (dev_data->domain && nested_parent) {
+ pr_debug("%s: Found exist: protection domain id=%#x\n",
+ __func__, dev_data->domain->id);
+ dom = &dev_data->domain->domain;
+ } else {
+ dom = do_iommu_domain_alloc(type, dev, flags);
+ if (!dom)
+ return ERR_PTR(-ENOMEM);
+ }
+
+ return dom;
}

static void amd_iommu_domain_free(struct iommu_domain *dom)
diff --git a/drivers/iommu/amd/nested.c b/drivers/iommu/amd/nested.c
new file mode 100644
index 000000000000..1addcb21a38c
--- /dev/null
+++ b/drivers/iommu/amd/nested.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Advanced Micro Devices, Inc.
+ * Author: Suravee Suthikulpanit <[email protected]>
+ */
+
+#define pr_fmt(fmt) "AMD-Vi: " fmt
+#define dev_fmt(fmt) pr_fmt(fmt)
+
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "amd_iommu.h"
+
+bool amd_iommu_domain_is_nested(struct protection_domain *pdom)
+{
+ return (pdom && pdom->parent != NULL);
+}
+
+static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt,
+ struct protection_domain *pdom,
+ struct protection_domain *ppdom,
+ struct device *dev)
+{
+ struct pci_dev *pdev;
+ struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev);
+
+ pdev = to_pci_dev(dev);
+ if (!pdev)
+ return -EINVAL;
+
+ /* Note: Currently only support GCR3TRPMode with nested translation */
+ if (!check_feature2(FEATURE_GCR3TRPMODE))
+ return -EOPNOTSUPP;
+
+ pdom->parent = ppdom;
+ pdom->guest_domain_id = hwpt->gdom_id;
+ pdom->guest_paging_mode = hwpt->flags.guest_paging_mode;
+
+ dev_data->gcr3_info.trp_gpa = hwpt->gcr3;
+ dev_data->gcr3_info.glx = hwpt->flags.glx;
+ dev_data->gcr3_info.giov = hwpt->flags.giov;
+
+ return 0;
+}
+
+struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
+ struct device *dev, u32 flags);
+struct iommu_domain *
+amd_iommu_nested_domain_alloc(struct device *dev, unsigned int type, u32 flags,
+ struct iommu_hwpt_amd_v2 *hwpt,
+ struct iommu_domain *parent)
+{
+ int ret;
+ struct iommu_domain *dom;
+ struct protection_domain *pdom;
+
+ pr_debug("%s: Allocating nested domain with parent domid=%#x\n",
+ __func__, to_pdomain(parent)->id);
+
+ dom = do_iommu_domain_alloc(IOMMU_DOMAIN_NESTED, dev, flags);
+ if (IS_ERR(dom))
+ return ERR_PTR(-ENOMEM);
+
+ pdom = to_pdomain(dom);
+ ret = nested_gcr3_update(hwpt, pdom, to_pdomain(parent), dev);
+ if (ret)
+ goto err_out;
+
+ return dom;
+
+err_out:
+ iommu_domain_free(dom);
+ return ERR_PTR(-EINVAL);
+}
--
2.34.1


2024-01-22 08:41:27

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFCv2 PATCH 2/7] iommu/amd: Refactor set_dte_entry

> From: Suravee Suthikulpanit <[email protected]>
> Sent: Friday, January 12, 2024 8:07 AM
>
> +
> + /* Use system default */
> + tmp = amd_iommu_gpt_level;
> +
> + /* Mask out old values for GuestPagingMode */
> + target->data[2] &= ~(0x3ULL << DTE_GPT_LEVEL_SHIFT);
> + target->data[2] |= (tmp << DTE_GPT_LEVEL_SHIFT);

Just directly use amd_iommu_gpt_level here.

btw this sounds like a functional change as the original code only does
this for 5level:

if (amd_iommu_gpt_level == PAGE_MODE_5_LEVEL) {
dev_table[devid].data[2] |=
((u64)GUEST_PGTABLE_5_LEVEL <<
DTE_GPT_LEVEL_SHIFT);
}

If it's the desired change then better make it a separate fix.

2024-01-22 08:49:27

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFCv2 PATCH 5/7] iommufd: Introduce data struct for AMD nested domain allocation

> From: Suravee Suthikulpanit <[email protected]>
> Sent: Friday, January 12, 2024 8:07 AM
>
> +/**
> + * struct iommu_hwpt_amd_v2 - AMD IOMMU specific user-managed
> + * v2 I/O page table data
> + * @gcr3: GCR3 guest physical ddress
> + * @flags.glx: GCR3 table levels
> + * @flags.giov: GIOV mode
> + * @flags.guest_paging_mode: Guest v2 page table paging mode
> + * @flags.reserved : Must be 0
> + * @gdom_id: Guest domain ID
> + * @__reserved: Must be 0
> + */
> +struct iommu_hwpt_amd_v2 {
> + __aligned_u64 gcr3;
> + struct {
> + __aligned_u64 glx : 1,
> + giov : 1,

My impression was that giov allows non-PASID requests to be treated
as if tagged with pasid#0. kind of a prerequisite for enabling nested
translation? how does it work if guest disables it?

> + guest_paging_mode : 2,
> + reserved : 60;
> + } flags;
> + __u32 gdom_id;

this is not used in this series.

2024-01-22 08:56:00

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFCv2 PATCH 6/7] iommu/amd: Add nested domain allocation support

> From: Suravee Suthikulpanit <[email protected]>
> Sent: Friday, January 12, 2024 8:07 AM
>
> To support nested translation, the parent domain is allocated with flag
> IOMMU_HWPT_ALLOC_NEST_PARENT, and stores information of the v1 page
> table
> for stage 2 (i.e. GPA->SPA), whereas the child domain stores information
> of the GCR3 root pointer table for stage 1 (i.e. GVA->GPA).

put support of NEST_PARENT in a separate patch.

> @@ -569,6 +572,9 @@ struct protection_domain {
> bool dirty_tracking; /* dirty tracking is enabled in the domain */
> unsigned dev_cnt; /* devices assigned to this domain */
> unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference
> count */
> + struct protection_domain *parent; /* Nested parent domain */
> + u16 guest_paging_mode; /* Guest paging mode */
> + u16 guest_domain_id; /* Guest domain ID */

not used

> +struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
> struct device *dev, u32 flags)
> {
> bool dirty_tracking = flags &
> IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> @@ -2454,7 +2465,10 @@ static struct iommu_domain
> *do_iommu_domain_alloc(unsigned int type,
> if (iommu) {
> domain->domain.type = type;
> domain->domain.pgsize_bitmap = iommu->iommu.ops-
> >pgsize_bitmap;
> - domain->domain.ops = iommu->iommu.ops-
> >default_domain_ops;
> + if (type == IOMMU_DOMAIN_NESTED)
> + domain->domain.ops = &nested_domain_ops;
> + else
> + domain->domain.ops = iommu->iommu.ops-
> >default_domain_ops;
>
> if (dirty_tracking)
> domain->domain.dirty_ops = &amd_dirty_ops;

dirty_tracking doesn't apply to nested domain

> +
> +static bool check_nested_support(u32 flags)
> +{
> + if (!(flags & IOMMU_HWPT_ALLOC_NEST_PARENT))
> + return true;

it's more readable by putting this check in the caller.

> +
> + /*
> + * When allocated nested parent domain, the device may already
> + * have been attached to a domain. For example, a device is already
> + * attached to the domain allocated by VFIO, which contains GPA-
> >SPA mapping.
> + * In such case, return reference to the same domain.
> + */
> + if (dev_data->domain && nested_parent) {
> + pr_debug("%s: Found exist: protection domain id=%#x\n",
> + __func__, dev_data->domain->id);
> + dom = &dev_data->domain->domain;

alloc() shouldn't deal with domain reuse. it's the decision in the
caller. If the caller wants to reuse then it can try to attach to an
existing domain if compatible.

if the caller wants to create a new domain then just follow it.

2024-01-22 08:58:00

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFCv2 PATCH 7/7] iommu/amd: Add nested domain attach/detach support

> From: Suravee Suthikulpanit <[email protected]>
> Sent: Friday, January 12, 2024 8:07 AM
>
>
> - /* Use system default */
> - tmp = amd_iommu_gpt_level;
> + if (amd_iommu_domain_is_nested(pdom)) {
> + /*
> + * For nested domain, guest provide guest-paging mode.
> + * We need to check host capability before setting the mode.
> + */
> + tmp = pdom->guest_paging_mode;
> + if (tmp > amd_iommu_gpt_level) {
> + pr_err("Cannot support Guest paging mode=%#x
> (dom_id=%#x).\n",
> + pdom->guest_paging_mode, pdom->id);
> + tmp = amd_iommu_gpt_level;
> + }

why not returning error to the caller

2024-03-08 13:49:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFCv2 PATCH 1/7] iommu/amd: Introduce struct gcr3_tbl_info.giov

On Thu, Jan 11, 2024 at 06:06:40PM -0600, Suravee Suthikulpanit wrote:
> To track DTE[GIOV] programming during IOMMU domain attach, also add logic
> to determine if the GIOV is required, and set the variable accordinglly.
>
> This is also a preparation for adding nested domain support, where the GIOV
> setting is determined by the child domain.
>
> Signed-off-by: Suravee Suthikulpanit <[email protected]>
> ---
> drivers/iommu/amd/amd_iommu_types.h | 1 +
> drivers/iommu/amd/iommu.c | 11 +++++++++--
> 2 files changed, 10 insertions(+), 2 deletions(-)

I really think the DTE handling needs to be cleaned up before nesting
will be easy.

Various bits of the DTE should just flow directly through from the
VM's version of the DTE, it is going to be a mess to do that in this
manner

> @@ -2067,6 +2066,14 @@ static int do_attach(struct iommu_dev_data *dev_data,
> free_gcr3_table(dev_data);
> return ret;
> }
> +
> + /*
> + * GIOV is required for PD_MODE_V2 because we need
> + * to support the case where the end-point device
> + * does not have PASID in the TLP prefix when setting
> + * up to use the v2 table.
> + */
> + dev_data->gcr3_info.giov = true;

Ie who clears this once you set it ? :(

Jason

2024-03-08 13:52:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFCv2 PATCH 2/7] iommu/amd: Refactor set_dte_entry

On Thu, Jan 11, 2024 at 06:06:41PM -0600, Suravee Suthikulpanit wrote:

> - old_domid = dev_table[devid].data[1] & DEV_DOMID_MASK;
> - dev_table[devid].data[1] = flags;
> - dev_table[devid].data[0] = pte_root;
> + dev_table[devid].data[0] = target.data[0];
> + dev_table[devid].data[1] = target.data[1];
> + dev_table[devid].data[2] = target.data[2];

Maybe you could have convinced me this doesn't have bugs today with 2
qword updates, but no way this is OK.

The multi qword update needs to be sequenced correctly so the HW
doesn't have tearing. The manual talks about this and it talks about
128 bit updates too.

This needs to follow the design ARM has where it uses the special
update logic to reliably sequence the change. Since nesting is making
the update more complex it becomes important.

Jason

2024-03-08 13:55:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFCv2 PATCH 3/7] iommu/amd: Update PASID, GATS, and GLX feature related macros

On Thu, Jan 11, 2024 at 06:06:42PM -0600, Suravee Suthikulpanit wrote:
> diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
> index ff56c857f6ad..f8baa8d88832 100644
> --- a/drivers/iommu/amd/amd_iommu_types.h
> +++ b/drivers/iommu/amd/amd_iommu_types.h
> @@ -93,8 +93,6 @@
> #define FEATURE_GA BIT_ULL(7)
> #define FEATURE_HE BIT_ULL(8)
> #define FEATURE_PC BIT_ULL(9)
> -#define FEATURE_GATS_SHIFT (12)
> -#define FEATURE_GATS_MASK (3ULL)
> #define FEATURE_GAM_VAPIC BIT_ULL(21)
> #define FEATURE_GIOSUP BIT_ULL(48)
> #define FEATURE_HASUP BIT_ULL(49)
> @@ -102,11 +100,14 @@
> #define FEATURE_HDSUP BIT_ULL(52)
> #define FEATURE_SNP BIT_ULL(63)
>
> -#define FEATURE_PASID_SHIFT 32
> -#define FEATURE_PASID_MASK (0x1fULL << FEATURE_PASID_SHIFT)
> +#define FEATURE_GATS_SHIFT 12
> +#define FEATURE_GATS_MASK (0x03ULL << FEATURE_GATS_SHIFT)
>
> -#define FEATURE_GLXVAL_SHIFT 14
> -#define FEATURE_GLXVAL_MASK (0x03ULL << FEATURE_GLXVAL_SHIFT)
> +#define FEATURE_GLX_SHIFT 14
> +#define FEATURE_GLX_MASK (0x03ULL << FEATURE_GLX_SHIFT)
> +
> +#define FEATURE_PASMAX_SHIFT 32
> +#define FEATURE_PASMAX_MASK (0x1FULL << FEATURE_PASMAX_SHIFT)

If you are touching these please convert them all over to GENMASK.
ie:

#define FEATURE_PASMAX GENMASK_ULL(36, 32)

pasmax = FIELD_GET(FEATURE_PASMAX, amd_iommu_efr)

Jason

2024-03-08 14:05:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFCv2 PATCH 6/7] iommu/amd: Add nested domain allocation support

On Thu, Jan 11, 2024 at 06:06:45PM -0600, Suravee Suthikulpanit wrote:
> @@ -2367,8 +2372,9 @@ static struct protection_domain *protection_domain_alloc(unsigned int type)
> domain->nid = NUMA_NO_NODE;
>
> switch (type) {
> - /* No need to allocate io pgtable ops in passthrough mode */
> + /* No need to allocate io pgtable ops in passthrough and nested mode */
> case IOMMU_DOMAIN_IDENTITY:
> + case IOMMU_DOMAIN_NESTED:

These constants should not show up in the driver, it needs to be
reorganized to use the new allocation APIs to avoid this.

> -static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
> +static const struct iommu_domain_ops nested_domain_ops = {
> + .attach_dev = amd_iommu_attach_device,
> + .free = amd_iommu_domain_free,
> +};

I would expect nesting to have its own attach function too, because it
should be quite different.

> static struct iommu_domain *
> amd_iommu_domain_alloc_user(struct device *dev, u32 flags,
> struct iommu_domain *parent,
> const struct iommu_user_data *user_data)
> -
> {
> + struct iommu_domain *dom;
> + struct iommu_dev_data *dev_data;
> unsigned int type = IOMMU_DOMAIN_UNMANAGED;
> + bool nested_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
> +
> + if (parent) {
> + int ret;
> + struct iommu_hwpt_amd_v2 hwpt;
> +
> + if (parent->ops != amd_iommu_ops.default_domain_ops)
> + return ERR_PTR(-EINVAL);
> +
> + ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
> + if (ret)
> + return ERR_PTR(ret);
>
> - if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
> + return amd_iommu_nested_domain_alloc(dev, type, flags,
> + &hwpt, parent);
> + }
> +
> + /* Check supported flags */
> + if ((flags & ~amd_iommu_hwpt_supported_flags) ||
> + !check_nested_support(flags))
> return ERR_PTR(-EOPNOTSUPP);
>
> - return do_iommu_domain_alloc(type, dev, flags);
> + dev_data = dev_iommu_priv_get(dev);
> +
> + /*
> + * When allocated nested parent domain, the device may already
> + * have been attached to a domain. For example, a device is already
> + * attached to the domain allocated by VFIO, which contains GPA->SPA mapping.
> + * In such case, return reference to the same domain.
> + */

What? No! This would break all the lifecycle model. Domain allocation must
always allocate. qemu needs to allocate the nested partent domain type
from the very start.

> +static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt,
> + struct protection_domain *pdom,
> + struct protection_domain *ppdom,
> + struct device *dev)
> +{
> + struct pci_dev *pdev;
> + struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev);
> +
> + pdev = to_pci_dev(dev);
> + if (!pdev)
> + return -EINVAL;
> +
> + /* Note: Currently only support GCR3TRPMode with nested translation */
> + if (!check_feature2(FEATURE_GCR3TRPMODE))
> + return -EOPNOTSUPP;
> +
> + pdom->parent = ppdom;
> + pdom->guest_domain_id = hwpt->gdom_id;
> + pdom->guest_paging_mode = hwpt->flags.guest_paging_mode;
> +
> + dev_data->gcr3_info.trp_gpa = hwpt->gcr3;
> + dev_data->gcr3_info.glx = hwpt->flags.glx;
> + dev_data->gcr3_info.giov = hwpt->flags.giov;

Here again, this is just data copied from the vDTE to the pDTE - the
guest's gcr3 related parameters are just bits being flowed through

And as before "alloc" shouldn't touch anything outside the
iommu_domain being allocated, touching dev_data here is completely
wrong.

Jason

2024-03-08 14:07:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFCv2 PATCH 4/7] iommu/amd: Add support for hw_info for iommu capability query

On Thu, Jan 11, 2024 at 06:06:43PM -0600, Suravee Suthikulpanit wrote:

> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 71099e5fbaee..134f4af921dc 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2849,8 +2849,44 @@ static const struct iommu_dirty_ops amd_dirty_ops = {
> .read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
> };
>
> +void amd_iommu_build_efr(u64 *efr, u64 *efr2)
> +{
> + /* Build the EFR against the current hardware capabilities */
> + if (efr) {
> + *efr = 0ULL;
> + *efr |= (amd_iommu_efr & FEATURE_GT);
> + *efr |= (amd_iommu_efr & FEATURE_GIOSUP);
> + *efr |= (amd_iommu_efr & FEATURE_PPR);
> + *efr |= (amd_iommu_efr & FEATURE_GATS_MASK);
> + *efr |= (amd_iommu_efr & FEATURE_GLX_MASK);
> + *efr |= (amd_iommu_efr & FEATURE_PASMAX_MASK);
> + pr_debug("%s: efr=%#llx\n", __func__, *efr);
> + }

Why scrub it like this in the kernel?

The userspace has to scrub it anyhow to be compatible with future
kernels.

Is there a reason not to just forward the two efr registers directly?

Jason

2024-03-08 14:08:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFCv2 PATCH 5/7] iommufd: Introduce data struct for AMD nested domain allocation

On Thu, Jan 11, 2024 at 06:06:44PM -0600, Suravee Suthikulpanit wrote:
> Introduce IOMMU_HWPT_DATA_AMD_V2 data type for AMD IOMMU v2 page table,
> which is used for stage-1 in nested translation. The data structure
> contains information necessary for setting up the AMD HW-vIOMMU support.
>
> Signed-off-by: Suravee Suthikulpanit <[email protected]>
> ---
> include/uapi/linux/iommufd.h | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 9901b9f4abe2..b28ec5947571 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -389,14 +389,39 @@ struct iommu_hwpt_vtd_s1 {
> __u32 __reserved;
> };
>
> +/**
> + * struct iommu_hwpt_amd_v2 - AMD IOMMU specific user-managed
> + * v2 I/O page table data
> + * @gcr3: GCR3 guest physical ddress
> + * @flags.glx: GCR3 table levels
> + * @flags.giov: GIOV mode
> + * @flags.guest_paging_mode: Guest v2 page table paging mode
> + * @flags.reserved : Must be 0
> + * @gdom_id: Guest domain ID
> + * @__reserved: Must be 0
> + */
> +struct iommu_hwpt_amd_v2 {
> + __aligned_u64 gcr3;
> + struct {
> + __aligned_u64 glx : 1,
> + giov : 1,
> + guest_paging_mode : 2,
> + reserved : 60;
> + } flags;

IIRC you should not put bitfileds in UAPI headers..

I suggest this should just be something like:

__aligned_u64 dte[3];

And you actually just copy bits directly from the vDTE to the pDTE.

Jason