2020-04-21 18:48:29

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 0/8] Nested Shared Virtual Address (SVA) VT-d support

Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security.
This series is intended to enable SVA virtualization, i.e. enable use of SVA
within a guest user application.

This is the remaining portion of the original patchset that is based on
Joerg's x86/vt-d branch. The preparatory and cleanup patches are merged here.
(git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git)

Only IOMMU portion of the changes are included in this series. Additional
support is needed in VFIO and QEMU (will be submitted separately) to complete
this functionality.

To make incremental changes and reduce the size of each patchset. This series
does not inlcude support for page request services.

In VT-d implementation, PASID table is per device and maintained in the host.
Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.

.-------------. .---------------------------.
| vIOMMU | | Guest process CR3, FL only|
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush -
'-------------' |
| | V
| | CR3 in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .----------------------.
| pIOMMU | | Bind FL for GVA-GPA |
| | '----------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.------------------------------.
| | |SL for GPA-HPA, default domain|
| | '------------------------------'
'-------------'
Where:
- FL = First level/stage one page tables
- SL = Second level/stage two page tables

This is the remaining VT-d only portion of V5 since the uAPIs and IOASID common
code have been applied to Joerg's IOMMU core branch.
(https://lkml.org/lkml/2019/10/2/833)

The complete set with VFIO patches are here:
https://github.com/jacobpan/linux.git:siov_sva

The complete nested SVA upstream patches are divided into three phases:
1. Common APIs and PCI device direct assignment
2. Page Request Services (PRS) support
3. Mediated device assignment

With this set and the accompanied VFIO code, we will achieve phase #1.

Thanks,

Jacob

ChangeLog:
- v12
- Fixed IA64 cross compile error
- Squashed two patches that add macros with its users
- Use ratelimited prints for all user called APIs
- Check domain nesting attr for vSVA APIs.
- Misc style improvements

- v11 Misc fixes and improvements based on review by Kevin & Eric
- Fixed devTLB granularity conversion
- Simplified VT-d granulairy lookup by replacing 2D map array
with invalid entries.
- Fixed locking in bind guest PASID
- Added nesting domain attr check
- Squashed agaw checking patch with user
- Use rate limitted error message for all user originated calls

- v10
- Addressed Eric's review in v7 and v9. Most fixes are in 3/10 and
6/10. Extra condition checks and consolidation of duplicated codes.

- v9
- Addressed Baolu's comments for v8 for IOTLB flush consolidation,
bug fixes
- Removed IOASID notifier code which will be submitted separately
to address PASID life cycle management with multiple users.

- v8
- Extracted cleanup patches from V7 and accepted into maintainer's
tree (https://lkml.org/lkml/2019/12/2/514).
- Added IOASID notifier and VT-d handler for termination of PASID
IOMMU context upon free. This will ensure success of VFIO IOASID
free API regardless PASID is in use.
(https://lore.kernel.org/linux-iommu/[email protected]/)

- V7
- Respect vIOMMU PASID range in virtual command PASID/IOASID allocator
- Caching virtual command capabilities to avoid runtime checks that
could cause vmexits.

- V6
- Rebased on top of Joerg's core branch
(git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git core)
- Adapt to new uAPIs and IOASID allocators

- V5
Rebased on v5.3-rc4 which has some of the IOMMU fault APIs merged.
Addressed v4 review comments from Eric Auger, Baolu Lu, and
Jonathan Cameron. Specific changes are as follows:
- Refined custom IOASID allocator to support multiple vIOMMU, hotplug
cases.
- Extracted vendor data from IOMMU guest PASID bind data, for VT-d
will support all necessary guest PASID entry fields for PASID
bind.
- Support non-identity host-guest PASID mapping
- Exception handling in various cases

- V4
- Redesigned IOASID allocator such that it can support custom
allocators with shared helper functions. Use separate XArray
to store IOASIDs per allocator. Took advice from Eric Auger to
have default allocator use the generic allocator structure.
Combined into one patch in that the default allocator is just
"another" allocator now. Can be built as a module in case of
driver use without IOMMU.
- Extended bind guest PASID data to support SMMU and non-identity
guest to host PASID mapping https://lkml.org/lkml/2019/5/21/802
- Rebased on Jean's sva/api common tree, new patches starts with
[PATCH v4 10/22]

- V3
- Addressed thorough review comments from Eric Auger (Thank you!)
- Moved IOASID allocator from driver core to IOMMU code per
suggestion by Christoph Hellwig
(https://lkml.org/lkml/2019/4/26/462)
- Rebased on top of Jean's SVA API branch and Eric's v7[1]
(git://linux-arm.org/linux-jpb.git sva/api)
- All IOMMU APIs are unmodified (except the new bind guest PASID
call in patch 9/16)

- V2
- Rebased on Joerg's IOMMU x86/vt-d branch v5.1-rc4
- Integrated with Eric Auger's new v7 series for common APIs
(https://github.com/eauger/linux/tree/v5.1-rc3-2stage-v7)
- Addressed review comments from Andy Shevchenko and Alex Williamson on
IOASID custom allocator.
- Support multiple custom IOASID allocators (vIOMMUs) and dynamic
registration.



Jacob Pan (7):
iommu/vt-d: Move domain helper to header
iommu/vt-d: Use a helper function to skip agaw for SL
iommu/vt-d: Add nested translation helper function
iommu/vt-d: Add bind guest PASID support
iommu/vt-d: Support flushing more translation cache types
iommu/vt-d: Add svm/sva invalidate function
iommu/vt-d: Add custom allocator for IOASID

Lu Baolu (1):
iommu/vt-d: Enlightened PASID allocation

drivers/iommu/dmar.c | 40 ++++++
drivers/iommu/intel-iommu.c | 295 +++++++++++++++++++++++++++++++++----
drivers/iommu/intel-pasid.c | 344 ++++++++++++++++++++++++++++++++++++++++++--
drivers/iommu/intel-pasid.h | 23 ++-
drivers/iommu/intel-svm.c | 204 ++++++++++++++++++++++++++
include/linux/intel-iommu.h | 71 ++++++++-
include/linux/intel-svm.h | 17 +++
include/uapi/linux/iommu.h | 5 +
8 files changed, 948 insertions(+), 51 deletions(-)

--
2.7.4


2020-04-21 18:48:35

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 5/8] iommu/vt-d: Support flushing more translation cache types

When Shared Virtual Memory is exposed to a guest via vIOMMU, scalable
IOTLB invalidation may be passed down from outside IOMMU subsystems.
This patch adds invalidation functions that can be used for additional
translation cache types.

Signed-off-by: Jacob Pan <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
---
drivers/iommu/dmar.c | 39 +++++++++++++++++++++++++++++++++++++++
drivers/iommu/intel-pasid.c | 3 ++-
include/linux/intel-iommu.h | 21 +++++++++++++++++----
3 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index f77dae7ba7d4..a2b64f5f0372 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1421,6 +1421,45 @@ void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
qi_submit_sync(&desc, iommu);
}

+/* PASID-based device IOTLB Invalidate */
+void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+ u32 pasid, u16 qdep, u64 addr,
+ unsigned int size_order, u64 granu)
+{
+ unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
+ struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
+
+ desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+ QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
+ QI_DEV_IOTLB_PFSID(pfsid);
+ desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
+
+ /*
+ * If S bit is 0, we only flush a single page. If S bit is set,
+ * The least significant zero bit indicates the invalidation address
+ * range. VT-d spec 6.5.2.6.
+ * e.g. address bit 12[0] indicates 8KB, 13[0] indicates 16KB.
+ * size order = 0 is PAGE_SIZE 4KB
+ * Max Invs Pending (MIP) is set to 0 for now until we have DIT in
+ * ECAP.
+ */
+ desc.qw1 |= addr & ~mask;
+ if (size_order)
+ desc.qw1 |= QI_DEV_EIOTLB_SIZE;
+
+ qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did,
+ u64 granu, int pasid)
+{
+ struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
+
+ desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) |
+ QI_PC_GRAN(granu) | QI_PC_TYPE;
+ qi_submit_sync(&desc, iommu);
+}
+
/*
* Disable Queued Invalidation interface.
*/
diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index becd6cc7f608..708bfa08a3d7 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
{
struct qi_desc desc;

- desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL | QI_PC_PASID(pasid);
+ desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
+ QI_PC_PASID(pasid) | QI_PC_TYPE;
desc.qw1 = 0;
desc.qw2 = 0;
desc.qw3 = 0;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index c8ce2336f8d8..9a44245c0f7d 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -334,7 +334,7 @@ enum {
#define QI_IOTLB_GRAN(gran) (((u64)gran) >> (DMA_TLB_FLUSH_GRANU_OFFSET-4))
#define QI_IOTLB_ADDR(addr) (((u64)addr) & VTD_PAGE_MASK)
#define QI_IOTLB_IH(ih) (((u64)ih) << 6)
-#define QI_IOTLB_AM(am) (((u8)am))
+#define QI_IOTLB_AM(am) (((u8)am) & 0x3f)

#define QI_CC_FM(fm) (((u64)fm) << 48)
#define QI_CC_SID(sid) (((u64)sid) << 32)
@@ -353,16 +353,21 @@ enum {
#define QI_PC_DID(did) (((u64)did) << 16)
#define QI_PC_GRAN(gran) (((u64)gran) << 4)

-#define QI_PC_ALL_PASIDS (QI_PC_TYPE | QI_PC_GRAN(0))
-#define QI_PC_PASID_SEL (QI_PC_TYPE | QI_PC_GRAN(1))
+/* PASID cache invalidation granu */
+#define QI_PC_ALL_PASIDS 0
+#define QI_PC_PASID_SEL 1

#define QI_EIOTLB_ADDR(addr) ((u64)(addr) & VTD_PAGE_MASK)
#define QI_EIOTLB_IH(ih) (((u64)ih) << 6)
-#define QI_EIOTLB_AM(am) (((u64)am))
+#define QI_EIOTLB_AM(am) (((u64)am) & 0x3f)
#define QI_EIOTLB_PASID(pasid) (((u64)pasid) << 32)
#define QI_EIOTLB_DID(did) (((u64)did) << 16)
#define QI_EIOTLB_GRAN(gran) (((u64)gran) << 4)

+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL 1
+#define QI_DEV_IOTLB_GRAN_PASID_SEL 0
+
#define QI_DEV_EIOTLB_ADDR(a) ((u64)(a) & VTD_PAGE_MASK)
#define QI_DEV_EIOTLB_SIZE (((u64)1) << 11)
#define QI_DEV_EIOTLB_GLOB(g) ((u64)g)
@@ -687,8 +692,16 @@ extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
unsigned int size_order, u64 type);
extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
u16 qdep, u64 addr, unsigned mask);
+
void qi_flush_piotlb(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr,
unsigned long npages, bool ih);
+
+void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+ u32 pasid, u16 qdep, u64 addr,
+ unsigned int size_order, u64 granu);
+void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64 granu,
+ int pasid);
+
extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);

extern int dmar_ir_support(void);
--
2.7.4

2020-04-21 18:49:02

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 8/8] iommu/vt-d: Add custom allocator for IOASID

When VT-d driver runs in the guest, PASID allocation must be
performed via virtual command interface. This patch registers a
custom IOASID allocator which takes precedence over the default
XArray based allocator. The resulting IOASID allocation will always
come from the host. This ensures that PASID namespace is system-
wide.

Virtual command registers are used in the guest only, to prevent
vmexit cost, we cache the capability and store it during initialization.

---
v12 - squashed virtual command register caching patch with its user
- reduce indentation
---

Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
---
drivers/iommu/dmar.c | 1 +
drivers/iommu/intel-iommu.c | 85 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 7 ++++
3 files changed, 93 insertions(+)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index a2b64f5f0372..d9dc787feef7 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu, u64 phys_addr)
warn_invalid_dmar(phys_addr, " returns all ones");
goto unmap;
}
+ iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);

/* the registers might be more than one page */
map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 24de233faaf5..771dba33f402 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1732,6 +1732,9 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
if (ecap_prs(iommu->ecap))
intel_svm_finish_prq(iommu);
}
+ if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
+ ioasid_unregister_allocator(&iommu->pasid_allocator);
+
#endif
}

@@ -3266,6 +3269,85 @@ static int copy_translation_tables(struct intel_iommu *iommu)
return ret;
}

+#ifdef CONFIG_INTEL_IOMMU_SVM
+static ioasid_t intel_vcmd_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)
+{
+ struct intel_iommu *iommu = data;
+ ioasid_t ioasid;
+
+ if (!iommu)
+ return INVALID_IOASID;
+ /*
+ * VT-d virtual command interface always uses the full 20 bit
+ * PASID range. Host can partition guest PASID range based on
+ * policies but it is out of guest's control.
+ */
+ if (min < PASID_MIN || max > intel_pasid_max_id)
+ return INVALID_IOASID;
+
+ if (vcmd_alloc_pasid(iommu, &ioasid))
+ return INVALID_IOASID;
+
+ return ioasid;
+}
+
+static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
+{
+ struct intel_iommu *iommu = data;
+
+ if (!iommu)
+ return;
+ /*
+ * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
+ * We can only free the PASID when all the devices are unbound.
+ */
+ if (ioasid_find(NULL, ioasid, NULL)) {
+ pr_alert("Cannot free active IOASID %d\n", ioasid);
+ return;
+ }
+ vcmd_free_pasid(iommu, ioasid);
+}
+
+static void register_pasid_allocator(struct intel_iommu *iommu)
+{
+ /*
+ * If we are running in the host, no need for custom allocator
+ * in that PASIDs are allocated from the host system-wide.
+ */
+ if (!cap_caching_mode(iommu->cap))
+ return;
+
+ if (!sm_supported(iommu)) {
+ pr_warn("VT-d Scalable Mode not enabled, no PASID allocation\n");
+ return;
+ }
+
+ /*
+ * Register a custom PASID allocator if we are running in a guest,
+ * guest PASID must be obtained via virtual command interface.
+ * There can be multiple vIOMMUs in each guest but only one allocator
+ * is active. All vIOMMU allocators will eventually be calling the same
+ * host allocator.
+ */
+ if (!ecap_vcs(iommu->ecap) || !vccap_pasid(iommu->vccap))
+ return;
+
+ pr_info("Register custom PASID allocator\n");
+ iommu->pasid_allocator.alloc = intel_vcmd_ioasid_alloc;
+ iommu->pasid_allocator.free = intel_vcmd_ioasid_free;
+ iommu->pasid_allocator.pdata = (void *)iommu;
+ if (ioasid_register_allocator(&iommu->pasid_allocator)) {
+ pr_warn("Custom PASID allocator failed, scalable mode disabled\n");
+ /*
+ * Disable scalable mode on this IOMMU if there
+ * is no custom allocator. Mixing SM capable vIOMMU
+ * and non-SM vIOMMU are not supported.
+ */
+ intel_iommu_sm = 0;
+ }
+}
+#endif
+
static int __init init_dmars(void)
{
struct dmar_drhd_unit *drhd;
@@ -3383,6 +3465,9 @@ static int __init init_dmars(void)
*/
for_each_active_iommu(iommu, drhd) {
iommu_flush_write_buffer(iommu);
+#ifdef CONFIG_INTEL_IOMMU_SVM
+ register_pasid_allocator(iommu);
+#endif
iommu_set_root_entry(iommu);
iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 1af16c6d3aec..cfac363bb2c7 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -19,6 +19,7 @@
#include <linux/iommu.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/dmar.h>
+#include <linux/ioasid.h>

#include <asm/cacheflush.h>
#include <asm/iommu.h>
@@ -195,6 +196,9 @@
#define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
#define ecap_sc_support(e) ((e >> 7) & 0x1) /* Snooping Control */

+/* Virtual command interface capability */
+#define vccap_pasid(v) (((v) & DMA_VCS_PAS)) /* PASID allocation */
+
/* IOTLB_REG */
#define DMA_TLB_FLUSH_GRANU_OFFSET 60
#define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
@@ -288,6 +292,7 @@

/* PRS_REG */
#define DMA_PRS_PPR ((u32)1)
+#define DMA_VCS_PAS ((u64)1)

#define IOMMU_WAIT_OP(iommu, offset, op, cond, sts) \
do { \
@@ -563,6 +568,7 @@ struct intel_iommu {
u64 reg_size; /* size of hw register set */
u64 cap;
u64 ecap;
+ u64 vccap;
u32 gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF */
raw_spinlock_t register_lock; /* protect register handling */
int seq_id; /* sequence id of the iommu */
@@ -583,6 +589,7 @@ struct intel_iommu {
#ifdef CONFIG_INTEL_IOMMU_SVM
struct page_req_dsc *prq;
unsigned char prq_name[16]; /* Name for PRQ interrupt */
+ struct ioasid_allocator_ops pasid_allocator; /* Custom allocator for PASIDs */
#endif
struct q_inval *qi; /* Queued invalidation info */
u32 *iommu_state; /* Store iommu states between suspend and resume.*/
--
2.7.4

2020-04-21 18:49:11

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 6/8] iommu/vt-d: Add svm/sva invalidate function

When Shared Virtual Address (SVA) is enabled for a guest OS via
vIOMMU, we need to provide invalidation support at IOMMU API and driver
level. This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.

The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.

The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.

---
v12 - Use ratelimited prints for all user called APIs.
- Check for domain nesting attr
---
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel-iommu.c | 175 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 175 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 8862d6b0ef21..24de233faaf5 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5595,6 +5595,180 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
aux_domain_remove_dev(to_dmar_domain(domain), dev);
}

+/*
+ * 2D array for converting and sanitizing IOMMU generic TLB granularity to
+ * VT-d granularity. Invalidation is typically included in the unmap operation
+ * as a result of DMA or VFIO unmap. However, for assigned devices guest
+ * owns the first level page tables. Invalidations of translation caches in the
+ * guest are trapped and passed down to the host.
+ *
+ * vIOMMU in the guest will only expose first level page tables, therefore
+ * we do not support IOTLB granularity for request without PASID (second level).
+ *
+ * For example, to find the VT-d granularity encoding for IOTLB
+ * type and page selective granularity within PASID:
+ * X: indexed by iommu cache type
+ * Y: indexed by enum iommu_inv_granularity
+ * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
+ */
+
+const static int
+inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
+ /*
+ * PASID based IOTLB invalidation: PASID selective (per PASID),
+ * page selective (address granularity)
+ */
+ {-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
+ /* PASID based dev TLBs */
+ {-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
+ /* PASID cache */
+ {-EINVAL, -EINVAL, -EINVAL}
+};
+
+static inline int to_vtd_granularity(int type, int granu)
+{
+ return inv_type_granu_table[type][granu];
+}
+
+static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
+{
+ u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
+
+ /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+ * IOMMU cache invalidate API passes granu_size in bytes, and number of
+ * granu size in contiguous memory.
+ */
+ return order_base_2(nr_pages);
+}
+
+#ifdef CONFIG_INTEL_IOMMU_SVM
+static int
+intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev,
+ struct iommu_cache_invalidate_info *inv_info)
+{
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+ struct device_domain_info *info;
+ struct intel_iommu *iommu;
+ unsigned long flags;
+ int cache_type;
+ u8 bus, devfn;
+ u16 did, sid;
+ int ret = 0;
+ u64 size = 0;
+
+ if (!inv_info || !dmar_domain ||
+ inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+ return -EINVAL;
+
+ if (!dev || !dev_is_pci(dev))
+ return -ENODEV;
+
+ iommu = device_to_iommu(dev, &bus, &devfn);
+ if (!iommu)
+ return -ENODEV;
+
+ if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
+ return -EINVAL;
+
+ spin_lock_irqsave(&device_domain_lock, flags);
+ spin_lock(&iommu->lock);
+ info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+ if (!info) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+ did = dmar_domain->iommu_did[iommu->seq_id];
+ sid = PCI_DEVID(bus, devfn);
+
+ /* Size is only valid in non-PASID selective invalidation */
+ if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
+ size = to_vtd_size(inv_info->addr_info.granule_size,
+ inv_info->addr_info.nb_granules);
+
+ for_each_set_bit(cache_type,
+ (unsigned long *)&inv_info->cache,
+ IOMMU_CACHE_INV_TYPE_NR) {
+ int granu = 0;
+ u64 pasid = 0;
+
+ granu = to_vtd_granularity(cache_type, inv_info->granularity);
+ if (granu == -EINVAL) {
+ pr_err_ratelimited("Invalid cache type and granu combination %d/%d\n",
+ cache_type, inv_info->granularity);
+ break;
+ }
+
+ /*
+ * PASID is stored in different locations based on the
+ * granularity.
+ */
+ if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
+ (inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID))
+ pasid = inv_info->pasid_info.pasid;
+ else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
+ (inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID))
+ pasid = inv_info->addr_info.pasid;
+
+ switch (BIT(cache_type)) {
+ case IOMMU_CACHE_INV_TYPE_IOTLB:
+ if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
+ size &&
+ (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
+ pr_err_ratelimited("Address out of range, 0x%llx, size order %llu\n",
+ inv_info->addr_info.addr, size);
+ ret = -ERANGE;
+ goto out_unlock;
+ }
+
+ /*
+ * If granu is PASID-selective, address is ignored.
+ * We use npages = -1 to indicate that.
+ */
+ qi_flush_piotlb(iommu, did, pasid,
+ mm_to_dma_pfn(inv_info->addr_info.addr),
+ (granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
+ inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
+
+ /*
+ * Always flush device IOTLB if ATS is enabled. vIOMMU
+ * in the guest may assume IOTLB flush is inclusive,
+ * which is more efficient.
+ */
+ if (info->ats_enabled)
+ qi_flush_dev_iotlb_pasid(iommu, sid,
+ info->pfsid, pasid,
+ info->ats_qdep,
+ inv_info->addr_info.addr,
+ size, granu);
+ break;
+ case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
+ if (info->ats_enabled)
+ qi_flush_dev_iotlb_pasid(iommu, sid,
+ info->pfsid, pasid,
+ info->ats_qdep,
+ inv_info->addr_info.addr,
+ size, granu);
+ else
+ pr_warn_ratelimited("Passdown device IOTLB flush w/o ATS!\n");
+ break;
+ case IOMMU_CACHE_INV_TYPE_PASID:
+ qi_flush_pasid_cache(iommu, did, granu,
+ inv_info->pasid_info.pasid);
+ break;
+ default:
+ dev_err_ratelimited(dev, "Unsupported IOMMU invalidation type %d\n",
+ cache_type);
+ ret = -EINVAL;
+ }
+ }
+out_unlock:
+ spin_unlock(&iommu->lock);
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+
+ return ret;
+}
+#endif
+
static int intel_iommu_map(struct iommu_domain *domain,
unsigned long iova, phys_addr_t hpa,
size_t size, int iommu_prot, gfp_t gfp)
@@ -6180,6 +6354,7 @@ const struct iommu_ops intel_iommu_ops = {
.is_attach_deferred = intel_iommu_is_attach_deferred,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
#ifdef CONFIG_INTEL_IOMMU_SVM
+ .cache_invalidate = intel_iommu_sva_invalidate,
.sva_bind_gpasid = intel_svm_bind_gpasid,
.sva_unbind_gpasid = intel_svm_unbind_gpasid,
#endif
--
2.7.4

2020-04-21 18:49:32

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 1/8] iommu/vt-d: Move domain helper to header

Move domain helper to header to be used by SVA code.

Signed-off-by: Jacob Pan <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/iommu/intel-iommu.c | 6 ------
include/linux/intel-iommu.h | 6 ++++++
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index ef0a5246700e..60c31db9ee93 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -446,12 +446,6 @@ static void init_translation_status(struct intel_iommu *iommu)
iommu->flags |= VTD_FLAG_TRANS_PRE_ENABLED;
}

-/* Convert generic 'struct iommu_domain to private struct dmar_domain */
-static struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
-{
- return container_of(dom, struct dmar_domain, domain);
-}
-
static int __init intel_iommu_setup(char *str)
{
if (!str)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 980234ae0312..ed7171d2ae1f 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -595,6 +595,12 @@ static inline void __iommu_flush_cache(
clflush_cache_range(addr, size);
}

+/* Convert generic struct iommu_domain to private struct dmar_domain */
+static inline struct dmar_domain *to_dmar_domain(struct iommu_domain *dom)
+{
+ return container_of(dom, struct dmar_domain, domain);
+}
+
/*
* 0: readable
* 1: writable
--
2.7.4

2020-04-21 18:49:37

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 3/8] iommu/vt-d: Add nested translation helper function

Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
With PASID granular translation type set to 0x11b, translation
result from the first level(FL) also subject to a second level(SL)
page table translation. This mode is used for SVA virtualization,
where FL performs guest virtual to guest physical translation and
SL performs guest physical to host physical translation.

This patch adds a helper function for setting up nested translation
where second level comes from a domain and first level comes from
a guest PGD.

Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
---
drivers/iommu/intel-iommu.c | 25 -----
drivers/iommu/intel-pasid.c | 251 +++++++++++++++++++++++++++++++++++++++++++-
drivers/iommu/intel-pasid.h | 10 ++
include/linux/intel-iommu.h | 28 +++++
include/uapi/linux/iommu.h | 5 +
5 files changed, 291 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 60c31db9ee93..9c01e391a931 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -296,31 +296,6 @@ static inline void context_clear_entry(struct context_entry *context)
static struct dmar_domain *si_domain;
static int hw_pass_through = 1;

-/* si_domain contains mulitple devices */
-#define DOMAIN_FLAG_STATIC_IDENTITY BIT(0)
-
-/*
- * This is a DMA domain allocated through the iommu domain allocation
- * interface. But one or more devices belonging to this domain have
- * been chosen to use a private domain. We should avoid to use the
- * map/unmap/iova_to_phys APIs on it.
- */
-#define DOMAIN_FLAG_LOSE_CHILDREN BIT(1)
-
-/*
- * When VT-d works in the scalable mode, it allows DMA translation to
- * happen through either first level or second level page table. This
- * bit marks that the DMA translation for the domain goes through the
- * first level page table, otherwise, it goes through the second level.
- */
-#define DOMAIN_FLAG_USE_FIRST_LEVEL BIT(2)
-
-/*
- * Domain represents a virtual machine which demands iommu nested
- * translation mode support.
- */
-#define DOMAIN_FLAG_NESTING_MODE BIT(3)
-
#define for_each_domain_iommu(idx, domain) \
for (idx = 0; idx < g_num_of_iommus; idx++) \
if (domain->iommu_refcnt[idx])
diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index d9cea3011b58..becd6cc7f608 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
}

+/*
+ * Setup the Extended Memory Type(EMT) field (Bits 91-93)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_emt(struct pasid_entry *pe, u64 value)
+{
+ pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value << 27);
+}
+
+/*
+ * Setup the Page Attribute Table (PAT) field (Bits 96-127)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pat(struct pasid_entry *pe, u64 value)
+{
+ pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value << 32);
+}
+
+/*
+ * Setup the Cache Disable (CD) field (Bit 89)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_cd(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
+}
+
+/*
+ * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_emte(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
+}
+
+/*
+ * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_eafe(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
+}
+
+/*
+ * Setup the Page-level Cache Disable (PCD) field (Bit 95)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pcd(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
+}
+
+/*
+ * Setup the Page-level Write-Through (PWT)) field (Bit 94)
+ * of a scalable mode PASID entry.
+ */
+static inline void
+pasid_set_pwt(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
+}
+
static void
pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
u16 did, int pasid)
@@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));

/* Setup Present and PASID Granular Transfer Type: */
- pasid_set_translation_type(pte, 1);
+ pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
pasid_set_present(pte);
pasid_flush_caches(iommu, pte, pasid, did);

@@ -561,7 +631,7 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
pasid_set_domain_id(pte, did);
pasid_set_slptr(pte, pgd_val);
pasid_set_address_width(pte, agaw);
- pasid_set_translation_type(pte, 2);
+ pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
pasid_set_fault_enable(pte);
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));

@@ -595,7 +665,7 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
pasid_clear_entry(pte);
pasid_set_domain_id(pte, did);
pasid_set_address_width(pte, iommu->agaw);
- pasid_set_translation_type(pte, 4);
+ pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
pasid_set_fault_enable(pte);
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));

@@ -609,3 +679,178 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,

return 0;
}
+
+static int
+intel_pasid_setup_bind_data(struct intel_iommu *iommu, struct pasid_entry *pte,
+ struct iommu_gpasid_bind_data_vtd *pasid_data)
+{
+ /*
+ * Not all guest PASID table entry fields are passed down during bind,
+ * here we only set up the ones that are dependent on guest settings.
+ * Execution related bits such as NXE, SMEP are not meaningful to IOMMU,
+ * therefore not set. Other fields, such as snoop related, are set based
+ * on host needs regardless of guest settings.
+ */
+ if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_SRE) {
+ if (!ecap_srs(iommu->ecap)) {
+ pr_err_ratelimited("No supervisor request support on %s\n",
+ iommu->name);
+ return -EINVAL;
+ }
+ pasid_set_sre(pte);
+ }
+
+ if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {
+ if (!ecap_eafs(iommu->ecap)) {
+ pr_err_ratelimited("No extended access flag support on %s\n",
+ iommu->name);
+ return -EINVAL;
+ }
+ pasid_set_eafe(pte);
+ }
+
+ /*
+ * Memory type is only applicable to devices inside processor coherent
+ * domain. PCIe devices are not included. We can skip the rest of the
+ * flags if IOMMU does not support MTS.
+ */
+ if (!(pasid_data->flags & IOMMU_SVA_VTD_GPASID_MTS_MASK))
+ return 0;
+
+ if (!ecap_mts(iommu->ecap)) {
+ pr_err_ratelimited("No memory type support for bind guest PASID on %s\n",
+ iommu->name);
+ return -EINVAL;
+ }
+
+ if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EMTE) {
+ pasid_set_emte(pte);
+ pasid_set_emt(pte, pasid_data->emt);
+ }
+ if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PCD)
+ pasid_set_pcd(pte);
+ if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_PWT)
+ pasid_set_pwt(pte);
+ if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_CD)
+ pasid_set_cd(pte);
+ pasid_set_pat(pte, pasid_data->pat);
+
+ return 0;
+}
+
+/**
+ * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
+ * This could be used for guest shared virtual address. In this case, the
+ * first level page tables are used for GVA-GPA translation in the guest,
+ * second level page tables are used for GPA-HPA translation.
+ *
+ * @iommu: IOMMU which the device belong to
+ * @dev: Device to be set up for translation
+ * @gpgd: FLPTPTR: First Level Page translation pointer in GPA
+ * @pasid: PASID to be programmed in the device PASID table
+ * @pasid_data: Additional PASID info from the guest bind request
+ * @domain: Domain info for setting up second level page tables
+ * @addr_width: Address width of the first level (guest)
+ */
+int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
+ pgd_t *gpgd, int pasid,
+ struct iommu_gpasid_bind_data_vtd *pasid_data,
+ struct dmar_domain *domain, int addr_width)
+{
+ struct pasid_entry *pte;
+ struct dma_pte *pgd;
+ int ret = 0;
+ u64 pgd_val;
+ int agaw;
+ u16 did;
+
+ if (!ecap_nest(iommu->ecap)) {
+ pr_err_ratelimited("IOMMU: %s: No nested translation support\n",
+ iommu->name);
+ return -EINVAL;
+ }
+
+ if (!(domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
+ pr_err_ratelimited("Domain is not in nesting mode, %x\n",
+ domain->flags);
+ return -EINVAL;
+ }
+
+ pte = intel_pasid_get_entry(dev, pasid);
+ if (WARN_ON(!pte))
+ return -EINVAL;
+
+ /*
+ * Caller must ensure PASID entry is not in use, i.e. not bind the
+ * same PASID to the same device twice.
+ */
+ if (pasid_pte_is_present(pte))
+ return -EBUSY;
+
+ pasid_clear_entry(pte);
+
+ /* Sanity checking performed by caller to make sure address
+ * width matching in two dimensions:
+ * 1. CPU vs. IOMMU
+ * 2. Guest vs. Host.
+ */
+ switch (addr_width) {
+#ifdef CONFIG_X86
+ case ADDR_WIDTH_5LEVEL:
+ if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+ cap_5lp_support(iommu->cap)) {
+ pasid_set_flpm(pte, 1);
+ } else {
+ dev_err_ratelimited(dev, "5-level paging not supported\n");
+ return -EINVAL;
+ }
+ break;
+#endif
+ case ADDR_WIDTH_4LEVEL:
+ pasid_set_flpm(pte, 0);
+ break;
+ default:
+ dev_err_ratelimited(dev, "Invalid guest address width %d\n",
+ addr_width);
+ return -EINVAL;
+ }
+
+ /* First level PGD is in GPA, must be supported by the second level */
+ if ((unsigned long long)gpgd > domain->max_addr) {
+ dev_err_ratelimited(dev,
+ "Guest PGD %llx not supported, max %llx\n",
+ (unsigned long long)gpgd, domain->max_addr);
+ return -EINVAL;
+ }
+ pasid_set_flptr(pte, (u64)gpgd);
+
+ ret = intel_pasid_setup_bind_data(iommu, pte, pasid_data);
+ if (ret) {
+ dev_err_ratelimited(dev, "Guest PASID bind data not supported\n");
+ return ret;
+ }
+
+ /* Setup the second level based on the given domain */
+ pgd = domain->pgd;
+
+ agaw = iommu_skip_agaw(domain, iommu, &pgd);
+ if (agaw < 0) {
+ dev_err_ratelimited(dev, "Invalid domain page table\n");
+ return -EINVAL;
+ }
+ pgd_val = virt_to_phys(pgd);
+ pasid_set_slptr(pte, pgd_val);
+ pasid_set_fault_enable(pte);
+
+ did = domain->iommu_did[iommu->seq_id];
+ pasid_set_domain_id(pte, did);
+
+ pasid_set_address_width(pte, agaw);
+ pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
+
+ pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
+ pasid_set_present(pte);
+ pasid_flush_caches(iommu, pte, pasid, did);
+
+ return ret;
+}
diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
index 92de6df24ccb..ccd50c2ae75c 100644
--- a/drivers/iommu/intel-pasid.h
+++ b/drivers/iommu/intel-pasid.h
@@ -36,6 +36,7 @@
* to vmalloc or even module mappings.
*/
#define PASID_FLAG_SUPERVISOR_MODE BIT(0)
+#define PASID_FLAG_NESTED BIT(1)

/*
* The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
@@ -51,6 +52,11 @@ struct pasid_entry {
u64 val[8];
};

+#define PASID_ENTRY_PGTT_FL_ONLY (1)
+#define PASID_ENTRY_PGTT_SL_ONLY (2)
+#define PASID_ENTRY_PGTT_NESTED (3)
+#define PASID_ENTRY_PGTT_PT (4)
+
/* The representative of a PASID table */
struct pasid_table {
void *table; /* pasid table pointer */
@@ -99,6 +105,10 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
struct dmar_domain *domain,
struct device *dev, int pasid);
+int intel_pasid_setup_nested(struct intel_iommu *iommu,
+ struct device *dev, pgd_t *pgd, int pasid,
+ struct iommu_gpasid_bind_data_vtd *pasid_data,
+ struct dmar_domain *domain, int addr_width);
void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
struct device *dev, int pasid);

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ed7171d2ae1f..6da03f627ba3 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -42,6 +42,9 @@
#define DMA_FL_PTE_PRESENT BIT_ULL(0)
#define DMA_FL_PTE_XD BIT_ULL(63)

+#define ADDR_WIDTH_5LEVEL (57)
+#define ADDR_WIDTH_4LEVEL (48)
+
#define CONTEXT_TT_MULTI_LEVEL 0
#define CONTEXT_TT_DEV_IOTLB 1
#define CONTEXT_TT_PASS_THROUGH 2
@@ -480,6 +483,31 @@ struct context_entry {
u64 hi;
};

+/* si_domain contains mulitple devices */
+#define DOMAIN_FLAG_STATIC_IDENTITY BIT(0)
+
+/*
+ * This is a DMA domain allocated through the iommu domain allocation
+ * interface. But one or more devices belonging to this domain have
+ * been chosen to use a private domain. We should avoid to use the
+ * map/unmap/iova_to_phys APIs on it.
+ */
+#define DOMAIN_FLAG_LOSE_CHILDREN BIT(1)
+
+/*
+ * When VT-d works in the scalable mode, it allows DMA translation to
+ * happen through either first level or second level page table. This
+ * bit marks that the DMA translation for the domain goes through the
+ * first level page table, otherwise, it goes through the second level.
+ */
+#define DOMAIN_FLAG_USE_FIRST_LEVEL BIT(2)
+
+/*
+ * Domain represents a virtual machine which demands iommu nested
+ * translation mode support.
+ */
+#define DOMAIN_FLAG_NESTING_MODE BIT(3)
+
struct dmar_domain {
int nid; /* node id */

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 4ad3496e5c43..e907b7091a46 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -285,6 +285,11 @@ struct iommu_gpasid_bind_data_vtd {
__u32 emt;
};

+#define IOMMU_SVA_VTD_GPASID_MTS_MASK (IOMMU_SVA_VTD_GPASID_CD | \
+ IOMMU_SVA_VTD_GPASID_EMTE | \
+ IOMMU_SVA_VTD_GPASID_PCD | \
+ IOMMU_SVA_VTD_GPASID_PWT)
+
/**
* struct iommu_gpasid_bind_data - Information about device and guest PASID binding
* @version: Version of this data structure
--
2.7.4

2020-04-21 18:50:29

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 4/8] iommu/vt-d: Add bind guest PASID support

When supporting guest SVA with emulated IOMMU, the guest PASID
table is shadowed in VMM. Updates to guest vIOMMU PASID table
will result in PASID cache flush which will be passed down to
the host as bind guest PASID calls.

For the SL page tables, it will be harvested from device's
default domain (request w/o PASID), or aux domain in case of
mediated device.

.-------------. .---------------------------.
| vIOMMU | | Guest process CR3, FL only|
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush -
'-------------' |
| | V
| | CR3 in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .----------------------.
| pIOMMU | | Bind FL for GVA-GPA |
| | '----------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.------------------------------.
| | |SL for GPA-HPA, default domain|
| | '------------------------------'
'-------------'
Where:
- FL = First level/stage one page tables
- SL = Second level/stage two page tables

Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
---
drivers/iommu/intel-iommu.c | 4 +
drivers/iommu/intel-svm.c | 204 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 8 +-
include/linux/intel-svm.h | 17 ++++
4 files changed, 232 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 9c01e391a931..8862d6b0ef21 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -6179,6 +6179,10 @@ const struct iommu_ops intel_iommu_ops = {
.dev_disable_feat = intel_iommu_dev_disable_feat,
.is_attach_deferred = intel_iommu_is_attach_deferred,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+ .sva_bind_gpasid = intel_svm_bind_gpasid,
+ .sva_unbind_gpasid = intel_svm_unbind_gpasid,
+#endif
};

static void quirk_iommu_igfx(struct pci_dev *dev)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 2998418f0a38..69b2070b843d 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -226,6 +226,210 @@ static LIST_HEAD(global_svm_list);
list_for_each_entry((sdev), &(svm)->devs, list) \
if ((d) != (sdev)->dev) {} else

+static inline void intel_svm_free_if_empty(struct intel_svm *svm, u64 pasid)
+{
+ if (list_empty(&svm->devs)) {
+ ioasid_set_data(pasid, NULL);
+ kfree(svm);
+ }
+}
+
+int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
+ struct iommu_gpasid_bind_data *data)
+{
+ struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+ struct dmar_domain *dmar_domain;
+ struct intel_svm_dev *sdev;
+ struct intel_svm *svm;
+ int ret = 0;
+
+ if (WARN_ON(!iommu) || !data)
+ return -EINVAL;
+
+ if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
+ data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
+ return -EINVAL;
+
+ if (dev_is_pci(dev)) {
+ /* VT-d supports devices with full 20 bit PASIDs only */
+ if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
+ return -EINVAL;
+ } else {
+ return -ENOTSUPP;
+ }
+
+ /*
+ * We only check host PASID range, we have no knowledge to check
+ * guest PASID range.
+ */
+ if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
+ return -EINVAL;
+
+ dmar_domain = to_dmar_domain(domain);
+
+ mutex_lock(&pasid_mutex);
+ svm = ioasid_find(NULL, data->hpasid, NULL);
+ if (IS_ERR(svm)) {
+ ret = PTR_ERR(svm);
+ goto out;
+ }
+
+ if (svm) {
+ /*
+ * If we found svm for the PASID, there must be at
+ * least one device bond, otherwise svm should be freed.
+ */
+ if (WARN_ON(list_empty(&svm->devs))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ for_each_svm_dev(sdev, svm, dev) {
+ /* In case of multiple sub-devices of the same pdev
+ * assigned, we should allow multiple bind calls with
+ * the same PASID and pdev.
+ */
+ sdev->users++;
+ goto out;
+ }
+ } else {
+ /* We come here when PASID has never been bond to a device. */
+ svm = kzalloc(sizeof(*svm), GFP_KERNEL);
+ if (!svm) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ /* REVISIT: upper layer/VFIO can track host process that bind
+ * the PASID. ioasid_set = mm might be sufficient for vfio to
+ * check pasid VMM ownership. We can drop the following line
+ * once VFIO and IOASID set check is in place.
+ */
+ svm->mm = get_task_mm(current);
+ svm->pasid = data->hpasid;
+ if (data->flags & IOMMU_SVA_GPASID_VAL) {
+ svm->gpasid = data->gpasid;
+ svm->flags |= SVM_FLAG_GUEST_PASID;
+ }
+ ioasid_set_data(data->hpasid, svm);
+ INIT_LIST_HEAD_RCU(&svm->devs);
+ mmput(svm->mm);
+ }
+ sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
+ if (!sdev) {
+ /*
+ * If this is a new PASID that never bond to a device, then
+ * the device list must be empty which indicates struct svm
+ * was allocated in this function.
+ */
+ intel_svm_free_if_empty(svm, data->hpasid);
+ ret = -ENOMEM;
+ goto out;
+ }
+ sdev->dev = dev;
+ sdev->users = 1;
+
+ /* Set up device context entry for PASID if not enabled already */
+ ret = intel_iommu_enable_pasid(iommu, sdev->dev);
+ if (ret) {
+ dev_err_ratelimited(dev, "Failed to enable PASID capability\n");
+ kfree(sdev);
+ intel_svm_free_if_empty(svm, data->hpasid);
+ goto out;
+ }
+
+ /*
+ * PASID table is per device for better security. Therefore, for
+ * each bind of a new device even with an existing PASID, we need to
+ * call the nested mode setup function here.
+ */
+ spin_lock(&iommu->lock);
+ ret = intel_pasid_setup_nested(iommu,
+ dev,
+ (pgd_t *)data->gpgd,
+ data->hpasid,
+ &data->vtd,
+ dmar_domain,
+ data->addr_width);
+ if (ret) {
+ dev_err_ratelimited(dev, "Failed to set up PASID %llu in nested mode, Err %d\n",
+ data->hpasid, ret);
+ /*
+ * PASID entry should be in cleared state if nested mode
+ * set up failed. So we only need to clear IOASID tracking
+ * data such that free call will succeed.
+ */
+ kfree(sdev);
+ intel_svm_free_if_empty(svm, data->hpasid);
+ spin_unlock(&iommu->lock);
+ goto out;
+ }
+ spin_unlock(&iommu->lock);
+ svm->flags |= SVM_FLAG_GUEST_MODE;
+
+ init_rcu_head(&sdev->rcu);
+ list_add_rcu(&sdev->list, &svm->devs);
+ out:
+ mutex_unlock(&pasid_mutex);
+ return ret;
+}
+
+int intel_svm_unbind_gpasid(struct device *dev, int pasid)
+{
+ struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+ struct intel_svm_dev *sdev;
+ struct intel_svm *svm;
+ int ret = -EINVAL;
+
+ if (WARN_ON(!iommu))
+ return -EINVAL;
+
+ mutex_lock(&pasid_mutex);
+ svm = ioasid_find(NULL, pasid, NULL);
+ if (!svm) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (IS_ERR(svm)) {
+ ret = PTR_ERR(svm);
+ goto out;
+ }
+
+ for_each_svm_dev(sdev, svm, dev) {
+ ret = 0;
+ sdev->users--;
+ if (!sdev->users) {
+ list_del_rcu(&sdev->list);
+ intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
+ intel_flush_svm_range_dev(svm, sdev, 0, -1, 0);
+ /* TODO: Drain in flight PRQ for the PASID since it
+ * may get reused soon, we don't want to
+ * confuse with its previous life.
+ * intel_svm_drain_prq(dev, pasid);
+ */
+ kfree_rcu(sdev, rcu);
+
+ if (list_empty(&svm->devs)) {
+ /*
+ * We do not free the IOASID here in that
+ * IOMMU driver did not allocate it.
+ * Unlike native SVM, IOASID for guest use was
+ * allocated prior to the bind call.
+ * In any case, if the free call comes before
+ * the unbind, IOMMU driver will get notified
+ * and perform cleanup.
+ */
+ ioasid_set_data(pasid, NULL);
+ kfree(svm);
+ }
+ }
+ break;
+ }
+out:
+ mutex_unlock(&pasid_mutex);
+ return ret;
+}
+
int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
{
struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 6da03f627ba3..c8ce2336f8d8 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device *dev);
extern void intel_svm_check(struct intel_iommu *iommu);
extern int intel_svm_enable_prq(struct intel_iommu *iommu);
extern int intel_svm_finish_prq(struct intel_iommu *iommu);
-
+int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
+ struct iommu_gpasid_bind_data *data);
+int intel_svm_unbind_gpasid(struct device *dev, int pasid);
struct svm_dev_ops;

struct intel_svm_dev {
@@ -723,9 +725,13 @@ struct intel_svm_dev {
struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
+
struct intel_iommu *iommu;
int flags;
int pasid;
+ int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
+ * to guest PASID mapping.
+ */
struct list_head devs;
struct list_head list;
};
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index d7c403d0dd27..c19690937540 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -44,6 +44,23 @@ struct svm_dev_ops {
* do such IOTLB flushes automatically.
*/
#define SVM_FLAG_SUPERVISOR_MODE (1<<1)
+/*
+ * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
+ * In this case the mm_struct is in the guest kernel or userspace, its life
+ * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
+ * means to bind/unbind guest CR3 with PASIDs allocated for a device.
+ */
+#define SVM_FLAG_GUEST_MODE (1<<2)
+/*
+ * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
+ * which requires guest and host PASID translation at both directions. We keep
+ * track of guest PASID in order to provide lookup service to device drivers.
+ * One such example is a physical function (PF) driver that supports mediated
+ * device (mdev) assignment. Guest programming of mdev configuration space can
+ * only be done with guest PASID, therefore PF driver needs to find the matching
+ * host PASID to program the real hardware.
+ */
+#define SVM_FLAG_GUEST_PASID (1<<3)

#ifdef CONFIG_INTEL_IOMMU_SVM

--
2.7.4

2020-04-21 18:50:30

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v12 2/8] iommu/vt-d: Use a helper function to skip agaw for SL

An Intel iommu domain uses 5-level page table by default. If the
iommu that the domain tries to attach supports less page levels,
the top level page tables should be skipped. Add a helper to do
this so that it could be used in other places.

Signed-off-by: Jacob Pan <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
---
drivers/iommu/intel-pasid.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 22b30f10b396..d9cea3011b58 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -500,6 +500,25 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
}

/*
+ * Skip top levels of page tables for iommu which has less agaw
+ * than default. Unnecessary for PT mode.
+ */
+static inline int iommu_skip_agaw(struct dmar_domain *domain,
+ struct intel_iommu *iommu,
+ struct dma_pte **pgd)
+{
+ int agaw;
+
+ for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
+ *pgd = phys_to_virt(dma_pte_addr(*pgd));
+ if (!dma_pte_present(*pgd))
+ return -EINVAL;
+ }
+
+ return agaw;
+}
+
+/*
* Set up the scalable mode pasid entry for second only translation type.
*/
int intel_pasid_setup_second_level(struct intel_iommu *iommu,
@@ -522,17 +541,11 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
return -EINVAL;
}

- /*
- * Skip top levels of page tables for iommu which has less agaw
- * than default. Unnecessary for PT mode.
- */
pgd = domain->pgd;
- for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
- pgd = phys_to_virt(dma_pte_addr(pgd));
- if (!dma_pte_present(pgd)) {
- dev_err(dev, "Invalid domain page table\n");
- return -EINVAL;
- }
+ agaw = iommu_skip_agaw(domain, iommu, &pgd);
+ if (agaw < 0) {
+ dev_err(dev, "Invalid domain page table\n");
+ return -EINVAL;
}

pgd_val = virt_to_phys(pgd);
--
2.7.4

2020-04-23 07:32:05

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v12 2/8] iommu/vt-d: Use a helper function to skip agaw for SL

> From: Jacob Pan <[email protected]>
> Sent: Wednesday, April 22, 2020 2:53 AM
>
> An Intel iommu domain uses 5-level page table by default. If the
> iommu that the domain tries to attach supports less page levels,
> the top level page tables should be skipped. Add a helper to do
> this so that it could be used in other places.
>
> Signed-off-by: Jacob Pan <[email protected]>
> Reviewed-by: Eric Auger <[email protected]>
> ---
> drivers/iommu/intel-pasid.c | 33 +++++++++++++++++++++++----------
> 1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 22b30f10b396..d9cea3011b58 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -500,6 +500,25 @@ int intel_pasid_setup_first_level(struct intel_iommu
> *iommu,
> }
>
> /*
> + * Skip top levels of page tables for iommu which has less agaw
> + * than default. Unnecessary for PT mode.
> + */
> +static inline int iommu_skip_agaw(struct dmar_domain *domain,
> + struct intel_iommu *iommu,
> + struct dma_pte **pgd)
> +{
> + int agaw;
> +
> + for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> + *pgd = phys_to_virt(dma_pte_addr(*pgd));
> + if (!dma_pte_present(*pgd))
> + return -EINVAL;
> + }
> +
> + return agaw;
> +}
> +
> +/*
> * Set up the scalable mode pasid entry for second only translation type.
> */
> int intel_pasid_setup_second_level(struct intel_iommu *iommu,
> @@ -522,17 +541,11 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
> return -EINVAL;
> }
>
> - /*
> - * Skip top levels of page tables for iommu which has less agaw
> - * than default. Unnecessary for PT mode.
> - */
> pgd = domain->pgd;
> - for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> - pgd = phys_to_virt(dma_pte_addr(pgd));
> - if (!dma_pte_present(pgd)) {
> - dev_err(dev, "Invalid domain page table\n");
> - return -EINVAL;
> - }
> + agaw = iommu_skip_agaw(domain, iommu, &pgd);
> + if (agaw < 0) {
> + dev_err(dev, "Invalid domain page table\n");
> + return -EINVAL;
> }
>
> pgd_val = virt_to_phys(pgd);
> --
> 2.7.4

Reviewed-by: Kevin Tian <[email protected]>

2020-04-24 10:50:22

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v12 4/8] iommu/vt-d: Add bind guest PASID support

> From: Jacob Pan <[email protected]>
> Sent: Wednesday, April 22, 2020 2:53 AM
>
> When supporting guest SVA with emulated IOMMU, the guest PASID
> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> will result in PASID cache flush which will be passed down to
> the host as bind guest PASID calls.

Above description is not accurate. Guest PASID table updates don't
'result in' PASID cache flush automatically. What about:
--
The guest needs to invalidate the PASID cache for any update to
guest PASID table. Those invalidation requests are intercepted
by the VMM and passed down to the host as binding guest PASID
calls.
--
>
> For the SL page tables, it will be harvested from device's
> default domain (request w/o PASID), or aux domain in case of
> mediated device.
>
> .-------------. .---------------------------.
> | vIOMMU | | Guest process CR3, FL only|
> | | '---------------------------'
> .----------------/
> | PASID Entry |--- PASID cache flush -
> '-------------' |
> | | V
> | | CR3 in GPA
> '-------------'
> Guest
> ------| Shadow |--------------------------|--------
> v v v
> Host
> .-------------. .----------------------.
> | pIOMMU | | Bind FL for GVA-GPA |
> | | '----------------------'
> .----------------/ |
> | PASID Entry | V (Nested xlate)
> '----------------\.------------------------------.
> | | |SL for GPA-HPA, default domain|
> | | '------------------------------'
> '-------------'
> Where:
> - FL = First level/stage one page tables
> - SL = Second level/stage two page tables
>
> Signed-off-by: Jacob Pan <[email protected]>
> Signed-off-by: Liu, Yi L <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 4 +
> drivers/iommu/intel-svm.c | 204
> ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/intel-iommu.h | 8 +-
> include/linux/intel-svm.h | 17 ++++
> 4 files changed, 232 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 9c01e391a931..8862d6b0ef21 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6179,6 +6179,10 @@ const struct iommu_ops intel_iommu_ops = {
> .dev_disable_feat = intel_iommu_dev_disable_feat,
> .is_attach_deferred = intel_iommu_is_attach_deferred,
> .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> + .sva_bind_gpasid = intel_svm_bind_gpasid,
> + .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> +#endif
> };
>
> static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 2998418f0a38..69b2070b843d 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -226,6 +226,210 @@ static LIST_HEAD(global_svm_list);
> list_for_each_entry((sdev), &(svm)->devs, list) \
> if ((d) != (sdev)->dev) {} else
>
> +static inline void intel_svm_free_if_empty(struct intel_svm *svm, u64 pasid)
> +{
> + if (list_empty(&svm->devs)) {
> + ioasid_set_data(pasid, NULL);
> + kfree(svm);
> + }
> +}

Do we really need a function form instead of putting the 4 lines directly
after the 'out' label?

> +
> +int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device
> *dev,
> + struct iommu_gpasid_bind_data *data)
> +{
> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> + struct dmar_domain *dmar_domain;
> + struct intel_svm_dev *sdev;
> + struct intel_svm *svm;
> + int ret = 0;
> +
> + if (WARN_ON(!iommu) || !data)
> + return -EINVAL;

well, why not checking !dev together?

> +
> + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> + return -EINVAL;
> +
> + if (dev_is_pci(dev)) {
> + /* VT-d supports devices with full 20 bit PASIDs only */
> + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> + return -EINVAL;
> + } else {
> + return -ENOTSUPP;
> + }
> +
> + /*
> + * We only check host PASID range, we have no knowledge to check
> + * guest PASID range.
> + */
> + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> + return -EINVAL;
> +
> + dmar_domain = to_dmar_domain(domain);
> +
> + mutex_lock(&pasid_mutex);
> + svm = ioasid_find(NULL, data->hpasid, NULL);
> + if (IS_ERR(svm)) {
> + ret = PTR_ERR(svm);
> + goto out;
> + }
> +
> + if (svm) {
> + /*
> + * If we found svm for the PASID, there must be at
> + * least one device bond, otherwise svm should be freed.
> + */
> + if (WARN_ON(list_empty(&svm->devs))) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + for_each_svm_dev(sdev, svm, dev) {
> + /* In case of multiple sub-devices of the same pdev
> + * assigned, we should allow multiple bind calls with
> + * the same PASID and pdev.
> + */
> + sdev->users++;
> + goto out;

in last review Eric raised the open about what about binding the same
PASID to the same pdev multiple times. We discussed that should be
disallowed. Here can you check whether aux_domain is enabled on pdev
to restrict multiple-binding only for sub-devices?

> + }
> + } else {
> + /* We come here when PASID has never been bond to a
> device. */
> + svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> + if (!svm) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + /* REVISIT: upper layer/VFIO can track host process that bind
> + * the PASID. ioasid_set = mm might be sufficient for vfio to
> + * check pasid VMM ownership. We can drop the following
> line
> + * once VFIO and IOASID set check is in place.
> + */

there is no check below this comment. Following lines are simply
initializing the svm fields.

> + svm->mm = get_task_mm(current);
> + svm->pasid = data->hpasid;
> + if (data->flags & IOMMU_SVA_GPASID_VAL) {
> + svm->gpasid = data->gpasid;
> + svm->flags |= SVM_FLAG_GUEST_PASID;
> + }
> + ioasid_set_data(data->hpasid, svm);
> + INIT_LIST_HEAD_RCU(&svm->devs);
> + mmput(svm->mm);
> + }
> + sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> + if (!sdev) {
> + /*
> + * If this is a new PASID that never bond to a device, then
> + * the device list must be empty which indicates struct svm
> + * was allocated in this function.
> + */
> + intel_svm_free_if_empty(svm, data->hpasid);
> + ret = -ENOMEM;
> + goto out;
> + }
> + sdev->dev = dev;
> + sdev->users = 1;
> +
> + /* Set up device context entry for PASID if not enabled already */
> + ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> + if (ret) {
> + dev_err_ratelimited(dev, "Failed to enable PASID
> capability\n");

print hpasid

> + kfree(sdev);
> + intel_svm_free_if_empty(svm, data->hpasid);
> + goto out;
> + }
> +
> + /*
> + * PASID table is per device for better security. Therefore, for
> + * each bind of a new device even with an existing PASID, we need to
> + * call the nested mode setup function here.
> + */
> + spin_lock(&iommu->lock);
> + ret = intel_pasid_setup_nested(iommu,
> + dev,
> + (pgd_t *)data->gpgd,
> + data->hpasid,
> + &data->vtd,
> + dmar_domain,
> + data->addr_width);
> + if (ret) {
> + dev_err_ratelimited(dev, "Failed to set up PASID %llu in
> nested mode, Err %d\n",
> + data->hpasid, ret);
> + /*
> + * PASID entry should be in cleared state if nested mode
> + * set up failed. So we only need to clear IOASID tracking
> + * data such that free call will succeed.
> + */
> + kfree(sdev);
> + intel_svm_free_if_empty(svm, data->hpasid);
> + spin_unlock(&iommu->lock);
> + goto out;
> + }
> + spin_unlock(&iommu->lock);

spin_unlock can be moved before if(ret)?

> + svm->flags |= SVM_FLAG_GUEST_MODE;
> +
> + init_rcu_head(&sdev->rcu);
> + list_add_rcu(&sdev->list, &svm->devs);
> + out:
> + mutex_unlock(&pasid_mutex);
> + return ret;
> +}
> +
> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> +{
> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> + struct intel_svm_dev *sdev;
> + struct intel_svm *svm;
> + int ret = -EINVAL;
> +
> + if (WARN_ON(!iommu))
> + return -EINVAL;
> +
> + mutex_lock(&pasid_mutex);
> + svm = ioasid_find(NULL, pasid, NULL);
> + if (!svm) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + if (IS_ERR(svm)) {
> + ret = PTR_ERR(svm);
> + goto out;
> + }
> +
> + for_each_svm_dev(sdev, svm, dev) {
> + ret = 0;
> + sdev->users--;
> + if (!sdev->users) {
> + list_del_rcu(&sdev->list);
> + intel_pasid_tear_down_entry(iommu, dev, svm-
> >pasid);
> + intel_flush_svm_range_dev(svm, sdev, 0, -1, 0);
> + /* TODO: Drain in flight PRQ for the PASID since it
> + * may get reused soon, we don't want to
> + * confuse with its previous life.
> + * intel_svm_drain_prq(dev, pasid);
> + */
> + kfree_rcu(sdev, rcu);
> +
> + if (list_empty(&svm->devs)) {
> + /*
> + * We do not free the IOASID here in that
> + * IOMMU driver did not allocate it.
> + * Unlike native SVM, IOASID for guest use
> was
> + * allocated prior to the bind call.
> + * In any case, if the free call comes before
> + * the unbind, IOMMU driver will get notified
> + * and perform cleanup.
> + */
> + ioasid_set_data(pasid, NULL);
> + kfree(svm);
> + }

is it safer moving above empty check outside of the loop?

> + }
> + break;
> + }
> +out:
> + mutex_unlock(&pasid_mutex);
> + return ret;
> +}
> +
> int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct
> svm_dev_ops *ops)
> {
> struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 6da03f627ba3..c8ce2336f8d8 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device
> *dev);
> extern void intel_svm_check(struct intel_iommu *iommu);
> extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> -
> +int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device
> *dev,
> + struct iommu_gpasid_bind_data *data);
> +int intel_svm_unbind_gpasid(struct device *dev, int pasid);
> struct svm_dev_ops;
>
> struct intel_svm_dev {
> @@ -723,9 +725,13 @@ struct intel_svm_dev {
> struct intel_svm {
> struct mmu_notifier notifier;
> struct mm_struct *mm;
> +
> struct intel_iommu *iommu;
> int flags;
> int pasid;
> + int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
> + * to guest PASID mapping.
> + */

/* in case that guest PASID is different from host PASID */

> struct list_head devs;
> struct list_head list;
> };
> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> index d7c403d0dd27..c19690937540 100644
> --- a/include/linux/intel-svm.h
> +++ b/include/linux/intel-svm.h
> @@ -44,6 +44,23 @@ struct svm_dev_ops {
> * do such IOTLB flushes automatically.
> */
> #define SVM_FLAG_SUPERVISOR_MODE (1<<1)
> +/*
> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a
> device.
> + * In this case the mm_struct is in the guest kernel or userspace, its life

this statement is confusing. We still have mm_struct in the host side to
claim the ownership of a PASID.

> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API

why is a flag becoming an API?

> provides
> + * means to bind/unbind guest CR3 with PASIDs allocated for a device.
> + */
> +#define SVM_FLAG_GUEST_MODE (1<<2)
> +/*
> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID
> space,
> + * which requires guest and host PASID translation at both directions. We
> keep
> + * track of guest PASID in order to provide lookup service to device drivers.
> + * One such example is a physical function (PF) driver that supports
> mediated
> + * device (mdev) assignment. Guest programming of mdev configuration
> space can
> + * only be done with guest PASID, therefore PF driver needs to find the
> matching
> + * host PASID to program the real hardware.

I feel such example doesn't belong here, which is purely userspace policy.
Here just describe what the flag is for should be sufficient.

> + */
> +#define SVM_FLAG_GUEST_PASID (1<<3)
>
> #ifdef CONFIG_INTEL_IOMMU_SVM
>
> --
> 2.7.4

Thanks
Kevin

2020-04-27 20:29:58

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v12 4/8] iommu/vt-d: Add bind guest PASID support

On Fri, 24 Apr 2020 10:47:45 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Jacob Pan <[email protected]>
> > Sent: Wednesday, April 22, 2020 2:53 AM
> >
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
>
> Above description is not accurate. Guest PASID table updates don't
> 'result in' PASID cache flush automatically. What about:
> --
> The guest needs to invalidate the PASID cache for any update to
> guest PASID table. Those invalidation requests are intercepted
> by the VMM and passed down to the host as binding guest PASID
> calls.
> --
It is good to add more details, thanks.

> >
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> >
> > .-------------. .---------------------------.
> > | vIOMMU | | Guest process CR3, FL only|
> > | | '---------------------------'
> > .----------------/
> > | PASID Entry |--- PASID cache flush -
> > '-------------' |
> > | | V
> > | | CR3 in GPA
> > '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> > v v v
> > Host
> > .-------------. .----------------------.
> > | pIOMMU | | Bind FL for GVA-GPA |
> > | | '----------------------'
> > .----------------/ |
> > | PASID Entry | V (Nested xlate)
> > '----------------\.------------------------------.
> > | | |SL for GPA-HPA, default domain|
> > | | '------------------------------'
> > '-------------'
> > Where:
> > - FL = First level/stage one page tables
> > - SL = Second level/stage two page tables
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > Signed-off-by: Liu, Yi L <[email protected]>
> > ---
> > drivers/iommu/intel-iommu.c | 4 +
> > drivers/iommu/intel-svm.c | 204
> > ++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h | 8 +-
> > include/linux/intel-svm.h | 17 ++++
> > 4 files changed, 232 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index 9c01e391a931..8862d6b0ef21
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -6179,6 +6179,10 @@ const struct iommu_ops intel_iommu_ops = {
> > .dev_disable_feat = intel_iommu_dev_disable_feat,
> > .is_attach_deferred =
> > intel_iommu_is_attach_deferred, .pgsize_bitmap =
> > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > + .sva_bind_gpasid = intel_svm_bind_gpasid,
> > + .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> > +#endif
> > };
> >
> > static void quirk_iommu_igfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index 2998418f0a38..69b2070b843d 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,210 @@ static LIST_HEAD(global_svm_list);
> > list_for_each_entry((sdev), &(svm)->devs, list) \
> > if ((d) != (sdev)->dev) {} else
> >
> > +static inline void intel_svm_free_if_empty(struct intel_svm *svm,
> > u64 pasid) +{
> > + if (list_empty(&svm->devs)) {
> > + ioasid_set_data(pasid, NULL);
> > + kfree(svm);
> > + }
> > +}
>
> Do we really need a function form instead of putting the 4 lines
> directly after the 'out' label?
>
it is more readable and good for code sharing.

> > +
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain, struct
> > device *dev,
> > + struct iommu_gpasid_bind_data *data)
> > +{
> > + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > + struct dmar_domain *dmar_domain;
> > + struct intel_svm_dev *sdev;
> > + struct intel_svm *svm;
> > + int ret = 0;
> > +
> > + if (WARN_ON(!iommu) || !data)
> > + return -EINVAL;
>
> well, why not checking !dev together?
This is kernel API, unlike iommu and data caller fills in dev directly.

>
> > +
> > + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > + return -EINVAL;
> > +
> > + if (dev_is_pci(dev)) {
> > + /* VT-d supports devices with full 20 bit PASIDs
> > only */
> > + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > + return -EINVAL;
> > + } else {
> > + return -ENOTSUPP;
> > + }
> > +
> > + /*
> > + * We only check host PASID range, we have no knowledge to
> > check
> > + * guest PASID range.
> > + */
> > + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > + return -EINVAL;
> > +
> > + dmar_domain = to_dmar_domain(domain);
> > +
> > + mutex_lock(&pasid_mutex);
> > + svm = ioasid_find(NULL, data->hpasid, NULL);
> > + if (IS_ERR(svm)) {
> > + ret = PTR_ERR(svm);
> > + goto out;
> > + }
> > +
> > + if (svm) {
> > + /*
> > + * If we found svm for the PASID, there must be at
> > + * least one device bond, otherwise svm should be
> > freed.
> > + */
> > + if (WARN_ON(list_empty(&svm->devs))) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + for_each_svm_dev(sdev, svm, dev) {
> > + /* In case of multiple sub-devices of the
> > same pdev
> > + * assigned, we should allow multiple bind
> > calls with
> > + * the same PASID and pdev.
> > + */
> > + sdev->users++;
> > + goto out;
>
> in last review Eric raised the open about what about binding the same
> PASID to the same pdev multiple times. We discussed that should be
> disallowed. Here can you check whether aux_domain is enabled on pdev
> to restrict multiple-binding only for sub-devices?
Why aux_domain is sufficient? A pdev could have aux_domain enabled but
still bind pdev many times more than its mdevs.

Either we allow multiple bind or not.

>
> > + }
> > + } else {
> > + /* We come here when PASID has never been bond to a
> > device. */
> > + svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > + if (!svm) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + /* REVISIT: upper layer/VFIO can track host
> > process that bind
> > + * the PASID. ioasid_set = mm might be sufficient
> > for vfio to
> > + * check pasid VMM ownership. We can drop the
> > following line
> > + * once VFIO and IOASID set check is in place.
> > + */
>
> there is no check below this comment. Following lines are simply
> initializing the svm fields.
>
What it meant to say is that once IOASID set is checked in VFIO layer,
we can drop the assignment of svm->mm, IOMMU driver will not check.

You are right, this is just a place holder to help handle many moving
pieces.

> > + svm->mm = get_task_mm(current);
> > + svm->pasid = data->hpasid;
> > + if (data->flags & IOMMU_SVA_GPASID_VAL) {
> > + svm->gpasid = data->gpasid;
> > + svm->flags |= SVM_FLAG_GUEST_PASID;
> > + }
> > + ioasid_set_data(data->hpasid, svm);
> > + INIT_LIST_HEAD_RCU(&svm->devs);
> > + mmput(svm->mm);
> > + }
> > + sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> > + if (!sdev) {
> > + /*
> > + * If this is a new PASID that never bond to a
> > device, then
> > + * the device list must be empty which indicates
> > struct svm
> > + * was allocated in this function.
> > + */
> > + intel_svm_free_if_empty(svm, data->hpasid);
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + sdev->dev = dev;
> > + sdev->users = 1;
> > +
> > + /* Set up device context entry for PASID if not enabled
> > already */
> > + ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> > + if (ret) {
> > + dev_err_ratelimited(dev, "Failed to enable PASID
> > capability\n");
>
> print hpasid

OK, sounds good.
>
> > + kfree(sdev);
> > + intel_svm_free_if_empty(svm, data->hpasid);
> > + goto out;
> > + }
> > +
> > + /*
> > + * PASID table is per device for better security.
> > Therefore, for
> > + * each bind of a new device even with an existing PASID,
> > we need to
> > + * call the nested mode setup function here.
> > + */
> > + spin_lock(&iommu->lock);
> > + ret = intel_pasid_setup_nested(iommu,
> > + dev,
> > + (pgd_t *)data->gpgd,
> > + data->hpasid,
> > + &data->vtd,
> > + dmar_domain,
> > + data->addr_width);
> > + if (ret) {
> > + dev_err_ratelimited(dev, "Failed to set up PASID
> > %llu in nested mode, Err %d\n",
> > + data->hpasid, ret);
> > + /*
> > + * PASID entry should be in cleared state if
> > nested mode
> > + * set up failed. So we only need to clear IOASID
> > tracking
> > + * data such that free call will succeed.
> > + */
> > + kfree(sdev);
> > + intel_svm_free_if_empty(svm, data->hpasid);
> > + spin_unlock(&iommu->lock);
> > + goto out;
> > + }
> > + spin_unlock(&iommu->lock);
>
> spin_unlock can be moved before if(ret)?
Yes, good point. We can combine the unlock.

>
> > + svm->flags |= SVM_FLAG_GUEST_MODE;
> > +
> > + init_rcu_head(&sdev->rcu);
> > + list_add_rcu(&sdev->list, &svm->devs);
> > + out:
> > + mutex_unlock(&pasid_mutex);
> > + return ret;
> > +}
> > +
> > +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +{
> > + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > + struct intel_svm_dev *sdev;
> > + struct intel_svm *svm;
> > + int ret = -EINVAL;
> > +
> > + if (WARN_ON(!iommu))
> > + return -EINVAL;
> > +
> > + mutex_lock(&pasid_mutex);
> > + svm = ioasid_find(NULL, pasid, NULL);
> > + if (!svm) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + if (IS_ERR(svm)) {
> > + ret = PTR_ERR(svm);
> > + goto out;
> > + }
> > +
> > + for_each_svm_dev(sdev, svm, dev) {
> > + ret = 0;
> > + sdev->users--;
> > + if (!sdev->users) {
> > + list_del_rcu(&sdev->list);
> > + intel_pasid_tear_down_entry(iommu, dev,
> > svm-
> > >pasid);
> > + intel_flush_svm_range_dev(svm, sdev, 0,
> > -1, 0);
> > + /* TODO: Drain in flight PRQ for the PASID
> > since it
> > + * may get reused soon, we don't want to
> > + * confuse with its previous life.
> > + * intel_svm_drain_prq(dev, pasid);
> > + */
> > + kfree_rcu(sdev, rcu);
> > +
> > + if (list_empty(&svm->devs)) {
> > + /*
> > + * We do not free the IOASID here
> > in that
> > + * IOMMU driver did not allocate
> > it.
> > + * Unlike native SVM, IOASID for
> > guest use was
> > + * allocated prior to the bind
> > call.
> > + * In any case, if the free call
> > comes before
> > + * the unbind, IOMMU driver will
> > get notified
> > + * and perform cleanup.
> > + */
> > + ioasid_set_data(pasid, NULL);
> > + kfree(svm);
> > + }
>
> is it safer moving above empty check outside of the loop?
why? could you explain.

Note that this is not a loop.

>
> > + }
> > + break;
> > + }
> > +out:
> > + mutex_unlock(&pasid_mutex);
> > + return ret;
> > +}
> > +
> > int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > struct svm_dev_ops *ops)
> > {
> > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 6da03f627ba3..c8ce2336f8d8
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device
> > *dev);
> > extern void intel_svm_check(struct intel_iommu *iommu);
> > extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> > extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > -
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain, struct
> > device *dev,
> > + struct iommu_gpasid_bind_data *data);
> > +int intel_svm_unbind_gpasid(struct device *dev, int pasid);
> > struct svm_dev_ops;
> >
> > struct intel_svm_dev {
> > @@ -723,9 +725,13 @@ struct intel_svm_dev {
> > struct intel_svm {
> > struct mmu_notifier notifier;
> > struct mm_struct *mm;
> > +
> > struct intel_iommu *iommu;
> > int flags;
> > int pasid;
> > + int gpasid; /* Guest PASID in case of vSVA bind with
> > non-identity host
> > + * to guest PASID mapping.
> > + */
>
> /* in case that guest PASID is different from host PASID */
OK, will do.

>
> > struct list_head devs;
> > struct list_head list;
> > };
> > diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> > index d7c403d0dd27..c19690937540 100644
> > --- a/include/linux/intel-svm.h
> > +++ b/include/linux/intel-svm.h
> > @@ -44,6 +44,23 @@ struct svm_dev_ops {
> > * do such IOTLB flushes automatically.
> > */
> > #define SVM_FLAG_SUPERVISOR_MODE (1<<1)
> > +/*
> > + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> > to a device.
> > + * In this case the mm_struct is in the guest kernel or userspace,
> > its life
>
> this statement is confusing. We still have mm_struct in the host side
> to claim the ownership of a PASID.
>
How about this:
/*
* The SVM_FLAG_GUEST_MODE flag is used when a PASID bind is for guest
* processes. Compared to the host bind, the primary differences are:
* 1. mm life cycle management
* 2. fault reporting
*/

> > + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> > API
>
> why is a flag becoming an API?
>
will refer as flag.

> > provides
> > + * means to bind/unbind guest CR3 with PASIDs allocated for a
> > device.
> > + */
> > +#define SVM_FLAG_GUEST_MODE (1<<2)
> > +/*
> > + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
> > PASID space,
> > + * which requires guest and host PASID translation at both
> > directions. We keep
> > + * track of guest PASID in order to provide lookup service to
> > device drivers.
> > + * One such example is a physical function (PF) driver that
> > supports mediated
> > + * device (mdev) assignment. Guest programming of mdev
> > configuration space can
> > + * only be done with guest PASID, therefore PF driver needs to
> > find the matching
> > + * host PASID to program the real hardware.
>
> I feel such example doesn't belong here, which is purely userspace
> policy. Here just describe what the flag is for should be sufficient.
>
Will remove the example. How about this?

/*
* The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
* which requires guest and host PASID translation at both directions.
*/


> > + */
> > +#define SVM_FLAG_GUEST_PASID (1<<3)
> >
> > #ifdef CONFIG_INTEL_IOMMU_SVM
> >
> > --
> > 2.7.4
>
> Thanks
> Kevin
>

[Jacob Pan]

2020-04-29 13:50:55

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v12 6/8] iommu/vt-d: Add svm/sva invalidate function

Hi Jacob,

On 4/21/20 8:52 PM, Jacob Pan wrote:
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
>
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
>
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
>
> ---
> v12 - Use ratelimited prints for all user called APIs.
> - Check for domain nesting attr
> ---
> Signed-off-by: Jacob Pan <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> Signed-off-by: Liu, Yi L <[email protected]>
Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 175 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 175 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 8862d6b0ef21..24de233faaf5 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5595,6 +5595,180 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
> aux_domain_remove_dev(to_dmar_domain(domain), dev);
> }
>
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap operation
> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
> + * owns the first level page tables. Invalidations of translation caches in the
> + * guest are trapped and passed down to the host.
> + *
> + * vIOMMU in the guest will only expose first level page tables, therefore
> + * we do not support IOTLB granularity for request without PASID (second level).
> + *
> + * For example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by iommu cache type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> + */
> +
> +const static int
> +inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
> + /*
> + * PASID based IOTLB invalidation: PASID selective (per PASID),
> + * page selective (address granularity)
> + */
> + {-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> + /* PASID based dev TLBs */
> + {-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
> + /* PASID cache */
> + {-EINVAL, -EINVAL, -EINVAL}
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu)
> +{
> + return inv_type_granu_table[type][granu];
> +}
> +
> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> +{
> + u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> +
> + /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> + * IOMMU cache invalidate API passes granu_size in bytes, and number of
> + * granu size in contiguous memory.
> + */
> + return order_base_2(nr_pages);
> +}
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static int
> +intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev,
> + struct iommu_cache_invalidate_info *inv_info)
> +{
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> + struct device_domain_info *info;
> + struct intel_iommu *iommu;
> + unsigned long flags;
> + int cache_type;
> + u8 bus, devfn;
> + u16 did, sid;
> + int ret = 0;
> + u64 size = 0;
> +
> + if (!inv_info || !dmar_domain ||
> + inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> + return -EINVAL;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
> +
> + iommu = device_to_iommu(dev, &bus, &devfn);
> + if (!iommu)
> + return -ENODEV;
> +
> + if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
> + return -EINVAL;
> +
> + spin_lock_irqsave(&device_domain_lock, flags);
> + spin_lock(&iommu->lock);
> + info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> + if (!info) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> + did = dmar_domain->iommu_did[iommu->seq_id];
> + sid = PCI_DEVID(bus, devfn);
> +
> + /* Size is only valid in non-PASID selective invalidation */
> + if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> + size = to_vtd_size(inv_info->addr_info.granule_size,
> + inv_info->addr_info.nb_granules);
> +
> + for_each_set_bit(cache_type,
> + (unsigned long *)&inv_info->cache,
> + IOMMU_CACHE_INV_TYPE_NR) {
> + int granu = 0;
> + u64 pasid = 0;
> +
> + granu = to_vtd_granularity(cache_type, inv_info->granularity);
> + if (granu == -EINVAL) {
> + pr_err_ratelimited("Invalid cache type and granu combination %d/%d\n",
> + cache_type, inv_info->granularity);
> + break;
> + }
> +
> + /*
> + * PASID is stored in different locations based on the
> + * granularity.
> + */
> + if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> + (inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID))
> + pasid = inv_info->pasid_info.pasid;
> + else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
> + (inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID))
> + pasid = inv_info->addr_info.pasid;
> +
> + switch (BIT(cache_type)) {
> + case IOMMU_CACHE_INV_TYPE_IOTLB:
> + if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
> + size &&
> + (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> + pr_err_ratelimited("Address out of range, 0x%llx, size order %llu\n",
> + inv_info->addr_info.addr, size);
> + ret = -ERANGE;
> + goto out_unlock;
> + }
> +
> + /*
> + * If granu is PASID-selective, address is ignored.
> + * We use npages = -1 to indicate that.
> + */
> + qi_flush_piotlb(iommu, did, pasid,
> + mm_to_dma_pfn(inv_info->addr_info.addr),
> + (granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> + inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
> +
> + /*
> + * Always flush device IOTLB if ATS is enabled. vIOMMU
> + * in the guest may assume IOTLB flush is inclusive,
> + * which is more efficient.
> + */
> + if (info->ats_enabled)
> + qi_flush_dev_iotlb_pasid(iommu, sid,
> + info->pfsid, pasid,
> + info->ats_qdep,
> + inv_info->addr_info.addr,
> + size, granu);
> + break;
> + case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> + if (info->ats_enabled)
> + qi_flush_dev_iotlb_pasid(iommu, sid,
> + info->pfsid, pasid,
> + info->ats_qdep,
> + inv_info->addr_info.addr,
> + size, granu);
> + else
> + pr_warn_ratelimited("Passdown device IOTLB flush w/o ATS!\n");
> + break;
> + case IOMMU_CACHE_INV_TYPE_PASID:
> + qi_flush_pasid_cache(iommu, did, granu,
> + inv_info->pasid_info.pasid);
> + break;
> + default:
> + dev_err_ratelimited(dev, "Unsupported IOMMU invalidation type %d\n",
> + cache_type);
> + ret = -EINVAL;
> + }
> + }
> +out_unlock:
> + spin_unlock(&iommu->lock);
> + spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> + return ret;
> +}
> +#endif
> +
> static int intel_iommu_map(struct iommu_domain *domain,
> unsigned long iova, phys_addr_t hpa,
> size_t size, int iommu_prot, gfp_t gfp)
> @@ -6180,6 +6354,7 @@ const struct iommu_ops intel_iommu_ops = {
> .is_attach_deferred = intel_iommu_is_attach_deferred,
> .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
> #ifdef CONFIG_INTEL_IOMMU_SVM
> + .cache_invalidate = intel_iommu_sva_invalidate,
> .sva_bind_gpasid = intel_svm_bind_gpasid,
> .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> #endif
>

2020-04-29 14:16:55

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v12 4/8] iommu/vt-d: Add bind guest PASID support

Hi,

On 4/27/20 10:34 PM, Jacob Pan wrote:
> On Fri, 24 Apr 2020 10:47:45 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
>>> From: Jacob Pan <[email protected]>
>>> Sent: Wednesday, April 22, 2020 2:53 AM
>>>
>>> When supporting guest SVA with emulated IOMMU, the guest PASID
>>> table is shadowed in VMM. Updates to guest vIOMMU PASID table
>>> will result in PASID cache flush which will be passed down to
>>> the host as bind guest PASID calls.
>>
>> Above description is not accurate. Guest PASID table updates don't
>> 'result in' PASID cache flush automatically. What about:
>> --
>> The guest needs to invalidate the PASID cache for any update to
>> guest PASID table. Those invalidation requests are intercepted
>> by the VMM and passed down to the host as binding guest PASID
>> calls.
>> --
> It is good to add more details, thanks.
>
>>>
>>> For the SL page tables, it will be harvested from device's
>>> default domain (request w/o PASID), or aux domain in case of
>>> mediated device.
>>>
>>> .-------------. .---------------------------.
>>> | vIOMMU | | Guest process CR3, FL only|
>>> | | '---------------------------'
>>> .----------------/
>>> | PASID Entry |--- PASID cache flush -
>>> '-------------' |
>>> | | V
>>> | | CR3 in GPA
>>> '-------------'
>>> Guest
>>> ------| Shadow |--------------------------|--------
>>> v v v
>>> Host
>>> .-------------. .----------------------.
>>> | pIOMMU | | Bind FL for GVA-GPA |
>>> | | '----------------------'
>>> .----------------/ |
>>> | PASID Entry | V (Nested xlate)
>>> '----------------\.------------------------------.
>>> | | |SL for GPA-HPA, default domain|
>>> | | '------------------------------'
>>> '-------------'
>>> Where:
>>> - FL = First level/stage one page tables
>>> - SL = Second level/stage two page tables
>>>
>>> Signed-off-by: Jacob Pan <[email protected]>
>>> Signed-off-by: Liu, Yi L <[email protected]>
>>> ---
>>> drivers/iommu/intel-iommu.c | 4 +
>>> drivers/iommu/intel-svm.c | 204
>>> ++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/intel-iommu.h | 8 +-
>>> include/linux/intel-svm.h | 17 ++++
>>> 4 files changed, 232 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index 9c01e391a931..8862d6b0ef21
>>> 100644 --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -6179,6 +6179,10 @@ const struct iommu_ops intel_iommu_ops = {
>>> .dev_disable_feat = intel_iommu_dev_disable_feat,
>>> .is_attach_deferred =
>>> intel_iommu_is_attach_deferred, .pgsize_bitmap =
>>> INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
>>> + .sva_bind_gpasid = intel_svm_bind_gpasid,
>>> + .sva_unbind_gpasid = intel_svm_unbind_gpasid,
>>> +#endif
>>> };
>>>
>>> static void quirk_iommu_igfx(struct pci_dev *dev)
>>> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
>>> index 2998418f0a38..69b2070b843d 100644
>>> --- a/drivers/iommu/intel-svm.c
>>> +++ b/drivers/iommu/intel-svm.c
>>> @@ -226,6 +226,210 @@ static LIST_HEAD(global_svm_list);
>>> list_for_each_entry((sdev), &(svm)->devs, list) \
>>> if ((d) != (sdev)->dev) {} else
>>>
>>> +static inline void intel_svm_free_if_empty(struct intel_svm *svm,
>>> u64 pasid) +{
>>> + if (list_empty(&svm->devs)) {
>>> + ioasid_set_data(pasid, NULL);
>>> + kfree(svm);
>>> + }
>>> +}
>>
>> Do we really need a function form instead of putting the 4 lines
>> directly after the 'out' label?
>>
> it is more readable and good for code sharing.
That's my fault: I suggested to add this helper because I noticed this
was repeated several times in the code. But adding a new goto label to
handle that job is identical.


>
>>> +
>>> +int intel_svm_bind_gpasid(struct iommu_domain *domain, struct
>>> device *dev,
>>> + struct iommu_gpasid_bind_data *data)
>>> +{
>>> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> + struct dmar_domain *dmar_domain;
>>> + struct intel_svm_dev *sdev;
>>> + struct intel_svm *svm;
>>> + int ret = 0;
>>> +
>>> + if (WARN_ON(!iommu) || !data)
>>> + return -EINVAL;
>>
>> well, why not checking !dev together?
> This is kernel API, unlike iommu and data caller fills in dev directly.
>
>>
>>> +
>>> + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
>>> + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
>>> + return -EINVAL;
>>> +
>>> + if (dev_is_pci(dev)) {
>>> + /* VT-d supports devices with full 20 bit PASIDs
>>> only */
>>> + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
>>> + return -EINVAL;
>>> + } else {
>>> + return -ENOTSUPP;
>>> + }
>>> +
>>> + /*
>>> + * We only check host PASID range, we have no knowledge to
>>> check
>>> + * guest PASID range.
>>> + */
>>> + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
>>> + return -EINVAL;
>>> +
>>> + dmar_domain = to_dmar_domain(domain);
>>> +
>>> + mutex_lock(&pasid_mutex);
>>> + svm = ioasid_find(NULL, data->hpasid, NULL);
>>> + if (IS_ERR(svm)) {
>>> + ret = PTR_ERR(svm);
>>> + goto out;
>>> + }
>>> +
>>> + if (svm) {
>>> + /*
>>> + * If we found svm for the PASID, there must be at
>>> + * least one device bond, otherwise svm should be
>>> freed.
>>> + */
>>> + if (WARN_ON(list_empty(&svm->devs))) {
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + for_each_svm_dev(sdev, svm, dev) {
>>> + /* In case of multiple sub-devices of the
>>> same pdev
>>> + * assigned, we should allow multiple bind
>>> calls with
>>> + * the same PASID and pdev.
>>> + */
>>> + sdev->users++;
>>> + goto out;
>>
>> in last review Eric raised the open about what about binding the same
>> PASID to the same pdev multiple times. We discussed that should be
>> disallowed. Here can you check whether aux_domain is enabled on pdev
>> to restrict multiple-binding only for sub-devices?
> Why aux_domain is sufficient? A pdev could have aux_domain enabled but
> still bind pdev many times more than its mdevs.
>
> Either we allow multiple bind or not.

I tried to figure out whether binding the same PASID to the same pdev
was meaningful. I understood it is not. If this case can be detected at
VFIO level I am fine as well.

Thanks

Eric
>
>>
>>> + }
>>> + } else {
>>> + /* We come here when PASID has never been bond to a
>>> device. */
>>> + svm = kzalloc(sizeof(*svm), GFP_KERNEL);
>>> + if (!svm) {
>>> + ret = -ENOMEM;
>>> + goto out;
>>> + }
>>> + /* REVISIT: upper layer/VFIO can track host
>>> process that bind
>>> + * the PASID. ioasid_set = mm might be sufficient
>>> for vfio to
>>> + * check pasid VMM ownership. We can drop the
>>> following line
>>> + * once VFIO and IOASID set check is in place.
>>> + */
>>
>> there is no check below this comment. Following lines are simply
>> initializing the svm fields.
>>
> What it meant to say is that once IOASID set is checked in VFIO layer,
> we can drop the assignment of svm->mm, IOMMU driver will not check.
>
> You are right, this is just a place holder to help handle many moving
> pieces.
>
>>> + svm->mm = get_task_mm(current);
>>> + svm->pasid = data->hpasid;
>>> + if (data->flags & IOMMU_SVA_GPASID_VAL) {
>>> + svm->gpasid = data->gpasid;
>>> + svm->flags |= SVM_FLAG_GUEST_PASID;
>>> + }
>>> + ioasid_set_data(data->hpasid, svm);
>>> + INIT_LIST_HEAD_RCU(&svm->devs);
>>> + mmput(svm->mm);
>>> + }
>>> + sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
>>> + if (!sdev) {
>>> + /*
>>> + * If this is a new PASID that never bond to a
>>> device, then
>>> + * the device list must be empty which indicates
>>> struct svm
>>> + * was allocated in this function.
>>> + */
>>> + intel_svm_free_if_empty(svm, data->hpasid);
>>> + ret = -ENOMEM;
>>> + goto out;
>>> + }
>>> + sdev->dev = dev;
>>> + sdev->users = 1;
>>> +
>>> + /* Set up device context entry for PASID if not enabled
>>> already */
>>> + ret = intel_iommu_enable_pasid(iommu, sdev->dev);
>>> + if (ret) {
>>> + dev_err_ratelimited(dev, "Failed to enable PASID
>>> capability\n");
>>
>> print hpasid
>
> OK, sounds good.
>>
>>> + kfree(sdev);
>>> + intel_svm_free_if_empty(svm, data->hpasid);
>>> + goto out;
>>> + }
>>> +
>>> + /*
>>> + * PASID table is per device for better security.
>>> Therefore, for
>>> + * each bind of a new device even with an existing PASID,
>>> we need to
>>> + * call the nested mode setup function here.
>>> + */
>>> + spin_lock(&iommu->lock);
>>> + ret = intel_pasid_setup_nested(iommu,
>>> + dev,
>>> + (pgd_t *)data->gpgd,
>>> + data->hpasid,
>>> + &data->vtd,
>>> + dmar_domain,
>>> + data->addr_width);
>>> + if (ret) {
>>> + dev_err_ratelimited(dev, "Failed to set up PASID
>>> %llu in nested mode, Err %d\n",
>>> + data->hpasid, ret);
>>> + /*
>>> + * PASID entry should be in cleared state if
>>> nested mode
>>> + * set up failed. So we only need to clear IOASID
>>> tracking
>>> + * data such that free call will succeed.
>>> + */
>>> + kfree(sdev);
>>> + intel_svm_free_if_empty(svm, data->hpasid);
>>> + spin_unlock(&iommu->lock);
>>> + goto out;
>>> + }
>>> + spin_unlock(&iommu->lock);
>>
>> spin_unlock can be moved before if(ret)?
> Yes, good point. We can combine the unlock.
>
>>
>>> + svm->flags |= SVM_FLAG_GUEST_MODE;
>>> +
>>> + init_rcu_head(&sdev->rcu);
>>> + list_add_rcu(&sdev->list, &svm->devs);
>>> + out:
>>> + mutex_unlock(&pasid_mutex);
>>> + return ret;
>>> +}
>>> +
>>> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>>> +{
>>> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> + struct intel_svm_dev *sdev;
>>> + struct intel_svm *svm;
>>> + int ret = -EINVAL;
>>> +
>>> + if (WARN_ON(!iommu))
>>> + return -EINVAL;
>>> +
>>> + mutex_lock(&pasid_mutex);
>>> + svm = ioasid_find(NULL, pasid, NULL);
>>> + if (!svm) {
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + if (IS_ERR(svm)) {
>>> + ret = PTR_ERR(svm);
>>> + goto out;
>>> + }
>>> +
>>> + for_each_svm_dev(sdev, svm, dev) {
>>> + ret = 0;
>>> + sdev->users--;
>>> + if (!sdev->users) {
>>> + list_del_rcu(&sdev->list);
>>> + intel_pasid_tear_down_entry(iommu, dev,
>>> svm-
>>>> pasid);
>>> + intel_flush_svm_range_dev(svm, sdev, 0,
>>> -1, 0);
>>> + /* TODO: Drain in flight PRQ for the PASID
>>> since it
>>> + * may get reused soon, we don't want to
>>> + * confuse with its previous life.
>>> + * intel_svm_drain_prq(dev, pasid);
>>> + */
>>> + kfree_rcu(sdev, rcu);
>>> +
>>> + if (list_empty(&svm->devs)) {
>>> + /*
>>> + * We do not free the IOASID here
>>> in that
>>> + * IOMMU driver did not allocate
>>> it.
>>> + * Unlike native SVM, IOASID for
>>> guest use was
>>> + * allocated prior to the bind
>>> call.
>>> + * In any case, if the free call
>>> comes before
>>> + * the unbind, IOMMU driver will
>>> get notified
>>> + * and perform cleanup.
>>> + */
>>> + ioasid_set_data(pasid, NULL);
>>> + kfree(svm);
>>> + }
>>
>> is it safer moving above empty check outside of the loop?
> why? could you explain.
>
> Note that this is not a loop.
>
>>
>>> + }
>>> + break;
>>> + }
>>> +out:
>>> + mutex_unlock(&pasid_mutex);
>>> + return ret;
>>> +}
>>> +
>>> int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
>>> struct svm_dev_ops *ops)
>>> {
>>> struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> diff --git a/include/linux/intel-iommu.h
>>> b/include/linux/intel-iommu.h index 6da03f627ba3..c8ce2336f8d8
>>> 100644 --- a/include/linux/intel-iommu.h
>>> +++ b/include/linux/intel-iommu.h
>>> @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device
>>> *dev);
>>> extern void intel_svm_check(struct intel_iommu *iommu);
>>> extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>>> extern int intel_svm_finish_prq(struct intel_iommu *iommu);
>>> -
>>> +int intel_svm_bind_gpasid(struct iommu_domain *domain, struct
>>> device *dev,
>>> + struct iommu_gpasid_bind_data *data);
>>> +int intel_svm_unbind_gpasid(struct device *dev, int pasid);
>>> struct svm_dev_ops;
>>>
>>> struct intel_svm_dev {
>>> @@ -723,9 +725,13 @@ struct intel_svm_dev {
>>> struct intel_svm {
>>> struct mmu_notifier notifier;
>>> struct mm_struct *mm;
>>> +
>>> struct intel_iommu *iommu;
>>> int flags;
>>> int pasid;
>>> + int gpasid; /* Guest PASID in case of vSVA bind with
>>> non-identity host
>>> + * to guest PASID mapping.
>>> + */
>>
>> /* in case that guest PASID is different from host PASID */
> OK, will do.
>
>>
>>> struct list_head devs;
>>> struct list_head list;
>>> };
>>> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
>>> index d7c403d0dd27..c19690937540 100644
>>> --- a/include/linux/intel-svm.h
>>> +++ b/include/linux/intel-svm.h
>>> @@ -44,6 +44,23 @@ struct svm_dev_ops {
>>> * do such IOTLB flushes automatically.
>>> */
>>> #define SVM_FLAG_SUPERVISOR_MODE (1<<1)
>>> +/*
>>> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
>>> to a device.
>>> + * In this case the mm_struct is in the guest kernel or userspace,
>>> its life
>>
>> this statement is confusing. We still have mm_struct in the host side
>> to claim the ownership of a PASID.
>>
> How about this:
> /*
> * The SVM_FLAG_GUEST_MODE flag is used when a PASID bind is for guest
> * processes. Compared to the host bind, the primary differences are:
> * 1. mm life cycle management
> * 2. fault reporting
> */
>
>>> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
>>> API
>>
>> why is a flag becoming an API?
>>
> will refer as flag.
>
>>> provides
>>> + * means to bind/unbind guest CR3 with PASIDs allocated for a
>>> device.
>>> + */
>>> +#define SVM_FLAG_GUEST_MODE (1<<2)
>>> +/*
>>> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
>>> PASID space,
>>> + * which requires guest and host PASID translation at both
>>> directions. We keep
>>> + * track of guest PASID in order to provide lookup service to
>>> device drivers.
>>> + * One such example is a physical function (PF) driver that
>>> supports mediated
>>> + * device (mdev) assignment. Guest programming of mdev
>>> configuration space can
>>> + * only be done with guest PASID, therefore PF driver needs to
>>> find the matching
>>> + * host PASID to program the real hardware.
>>
>> I feel such example doesn't belong here, which is purely userspace
>> policy. Here just describe what the flag is for should be sufficient.
>>
> Will remove the example. How about this?
>
> /*
> * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
> * which requires guest and host PASID translation at both directions.
> */
>
>
>>> + */
>>> +#define SVM_FLAG_GUEST_PASID (1<<3)
>>>
>>> #ifdef CONFIG_INTEL_IOMMU_SVM
>>>
>>> --
>>> 2.7.4
>>
>> Thanks
>> Kevin
>>
>
> [Jacob Pan]
>

2020-04-29 22:27:59

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v12 4/8] iommu/vt-d: Add bind guest PASID support

On Wed, 29 Apr 2020 16:12:01 +0200
Auger Eric <[email protected]> wrote:

> >> in last review Eric raised the open about what about binding the
> >> same PASID to the same pdev multiple times. We discussed that
> >> should be disallowed. Here can you check whether aux_domain is
> >> enabled on pdev to restrict multiple-binding only for
> >> sub-devices?
> > Why aux_domain is sufficient? A pdev could have aux_domain enabled
> > but still bind pdev many times more than its mdevs.
> >
> > Either we allow multiple bind or not.
>
> I tried to figure out whether binding the same PASID to the same pdev
> was meaningful. I understood it is not. If this case can be detected
> at VFIO level I am fine as well.
I will remove the multiple bind support for now. Reintroduce it when we
enable mdev.

Thanks,

Jacob