2023-09-22 07:27:29

by Yi Liu

[permalink] [raw]
Subject: [PATCH v5 00/11] Add Intel VT-d nested translation

This is to add Intel VT-d nested translation based on IOMMUFD nesting
infrastructure. As the iommufd nesting infrastructure series[1], iommu
core supports new ops to report iommu hardware information, allocate
domains with user data and invalidate stage-1 IOTLB when there is mapping
changed in stage-1 page table. The data required in the three paths are
vendor-specific, so

1) IOMMU_HWPT_TYPE_VTD_S1 is defined for the Intel VT-d stage-1 page
table, it will be used in the stage-1 domain allocation and IOTLB
syncing path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation.
struct iommu_hwpt_vtd_s1_invalidate is defined to pass the data for
the Intel VT-d stage-1 IOTLB invalidation.
2) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_hw_info_vtd are defined
to report iommu hardware information for Intel VT-d.

With above IOMMUFD extensions, the intel iommu driver implements the three
paths to support nested translation.

The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.

A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.

Intel-iommu driver is armed with necessary checks to prevent such mix
in patch12 of this series.

Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.

In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.

Complete code can be found in [3], corresponding QEMU could can be found
in [4].

[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1

Change log:

v5:
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain

v4: https://lore.kernel.org/linux-iommu/[email protected]/
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"

v3: https://lore.kernel.org/linux-iommu/[email protected]/
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.

v2: https://lore.kernel.org/linux-iommu/[email protected]/
- The iommufd infrastructure is split to be separate series.

v1: https://lore.kernel.org/linux-iommu/[email protected]/

Regards,
Yi Liu

Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain

Yi Liu (6):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
iommu/vt-d: Make iotlb flush helpers to be extern
iommu/vt-d: Add iotlb flush for nested domain

drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 60 +++++++++----
drivers/iommu/intel/iommu.h | 51 +++++++++--
drivers/iommu/intel/nested.c | 162 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 125 +++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 76 +++++++++++++++-
7 files changed, 452 insertions(+), 26 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c

--
2.34.1


2023-09-22 09:03:01

by Yi Liu

[permalink] [raw]
Subject: [PATCH v5 02/11] iommu/vt-d: Extend dmar_domain to support nested domain

From: Lu Baolu <[email protected]>

The nested domain fields are exclusive to those that used for a DMA
remapping domain. Use union to avoid memory waste.

Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
---
drivers/iommu/intel/iommu.h | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index c18fb699c87a..4a7f1cc16afa 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -597,15 +597,38 @@ struct dmar_domain {
struct list_head devices; /* all devices' list */
struct list_head dev_pasids; /* all attached pasids */

- struct dma_pte *pgd; /* virtual address */
- int gaw; /* max guest address width */
-
- /* adjusted guest address width, 0 is level 2 30-bit */
- int agaw;
int iommu_superpage;/* Level of superpages supported:
0 == 4KiB (no superpages), 1 == 2MiB,
2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
- u64 max_addr; /* maximum mapped address */
+ union {
+ /* DMA remapping domain */
+ struct {
+ /* virtual address */
+ struct dma_pte *pgd;
+ /* max guest address width */
+ int gaw;
+ /*
+ * adjusted guest address width:
+ * 0: level 2 30-bit
+ * 1: level 3 39-bit
+ * 2: level 4 48-bit
+ * 3: level 5 57-bit
+ */
+ int agaw;
+ /* maximum mapped address */
+ u64 max_addr;
+ };
+
+ /* Nested user domain */
+ struct {
+ /* parent page table which the user domain is nested on */
+ struct dmar_domain *s2_domain;
+ /* user page table pointer (in GPA) */
+ unsigned long s1_pgtbl;
+ /* page table attributes */
+ struct iommu_hwpt_vtd_s1 s1_cfg;
+ };
+ };

struct iommu_domain domain; /* generic domain data structure for
iommu core */
--
2.34.1

2023-09-22 10:53:32

by Yi Liu

[permalink] [raw]
Subject: [PATCH v5 09/11] iommu/vt-d: Add iotlb flush for nested domain

This implements the .cache_invalidate_user() callback to support iotlb
flush for nested domain.

Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
---
drivers/iommu/intel/nested.c | 58 ++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)

diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
index f9c6ade72416..4853fee216d9 100644
--- a/drivers/iommu/intel/nested.c
+++ b/drivers/iommu/intel/nested.c
@@ -68,9 +68,67 @@ static void intel_nested_domain_free(struct iommu_domain *domain)
kfree(to_dmar_domain(domain));
}

+static void domain_flush_iotlb_psi(struct dmar_domain *domain,
+ u64 addr, unsigned long npages)
+{
+ struct iommu_domain_info *info;
+ unsigned long i;
+
+ xa_for_each(&domain->iommu_array, i, info)
+ iommu_flush_iotlb_psi(info->iommu, domain,
+ addr >> VTD_PAGE_SHIFT, npages, 1, 0);
+}
+
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
+ struct iommu_user_data_array *array,
+ u32 *cerror_idx)
+{
+ const size_t min_len =
+ offsetofend(struct iommu_hwpt_vtd_s1_invalidate, __reserved);
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+ struct iommu_hwpt_vtd_s1_invalidate inv_info;
+ u32 index;
+ int ret;
+
+ /* REVISIT:
+ * VT-d has defined ITE, ICE, IQE for invalidation failure per hardware,
+ * but no error code yet, so just set the error code to be 0.
+ */
+ *cerror_idx = 0;
+
+ if (array->entry_len < min_len)
+ return -EINVAL;
+
+ for (index = 0; index < array->entry_num; index++) {
+ ret = iommu_copy_user_data_from_array(&inv_info, array, index,
+ sizeof(inv_info), min_len);
+ if (ret) {
+ pr_err_ratelimited("Failed to fetch invalidation request\n");
+ break;
+ }
+
+ if (inv_info.__reserved || (inv_info.flags & ~IOMMU_VTD_QI_FLAGS_LEAF) ||
+ !IS_ALIGNED(inv_info.addr, VTD_PAGE_SIZE)) {
+ ret = -EINVAL;
+ break;
+ }
+
+ if (inv_info.addr == 0 && inv_info.npages == -1)
+ intel_flush_iotlb_all(domain);
+ else
+ domain_flush_iotlb_psi(dmar_domain,
+ inv_info.addr, inv_info.npages);
+ }
+
+ array->entry_num = index;
+
+ return ret;
+}
+
static const struct iommu_domain_ops intel_nested_domain_ops = {
.attach_dev = intel_nested_attach_dev,
.free = intel_nested_domain_free,
+ .cache_invalidate_user = intel_nested_cache_invalidate_user,
};

struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *s2_domain,
--
2.34.1

2023-09-27 10:00:14

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v5 09/11] iommu/vt-d: Add iotlb flush for nested domain

> From: Liu, Yi L <[email protected]>
> Sent: Thursday, September 21, 2023 3:54 PM
> +
> + /* REVISIT:
> + * VT-d has defined ITE, ICE, IQE for invalidation failure per hardware,
> + * but no error code yet, so just set the error code to be 0.
> + */
> + *cerror_idx = 0;
> +

Is it "hardware doesn't provide error code now though it's defined in
spec" or "intel-iommu driver doesn't retrieve the error code though
it's provided by the hardware"?

Is there guarantee that '0' isn't used for an existing error code or
won't be used for any new error code later?

2023-10-13 12:55:22

by Yi Liu

[permalink] [raw]
Subject: Re: [PATCH v5 09/11] iommu/vt-d: Add iotlb flush for nested domain

On 2023/9/27 14:53, Tian, Kevin wrote:
>> From: Liu, Yi L <[email protected]>
>> Sent: Thursday, September 21, 2023 3:54 PM
>> +
>> + /* REVISIT:
>> + * VT-d has defined ITE, ICE, IQE for invalidation failure per hardware,
>> + * but no error code yet, so just set the error code to be 0.
>> + */
>> + *cerror_idx = 0;
>> +
>
> Is it "hardware doesn't provide error code now though it's defined in
> spec" or "intel-iommu driver doesn't retrieve the error code though
> it's provided by the hardware"?

I didn't see vtd spec defines error code for cache invalidation. :(

>
> Is there guarantee that '0' isn't used for an existing error code or
> won't be used for any new error code later?

may need to check it.

--
Regards,
Yi Liu