Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security.
This series is intended to enable SVA virtualization, i.e. enable use of SVA
within a guest user application.
This is the remaining portion of the original patchset that is based on
Joerg's x86/vt-d branch. The preparatory and cleanup patches are merged here.
(git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git)
Only IOMMU portion of the changes are included in this series. Additional
support is needed in VFIO and QEMU (will be submitted separately) to complete
this functionality.
To make incremental changes and reduce the size of each patchset. This series
does not inlcude support for page request services.
In VT-d implementation, PASID table is per device and maintained in the host.
Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.
.-------------. .---------------------------.
| vIOMMU | | Guest process CR3, FL only|
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush -
'-------------' |
| | V
| | CR3 in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .----------------------.
| pIOMMU | | Bind FL for GVA-GPA |
| | '----------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.------------------------------.
| | |SL for GPA-HPA, default domain|
| | '------------------------------'
'-------------'
Where:
- FL = First level/stage one page tables
- SL = Second level/stage two page tables
This is the remaining VT-d only portion of V5 since the uAPIs and IOASID common
code have been applied to Joerg's IOMMU core branch.
(https://lkml.org/lkml/2019/10/2/833)
The complete set with VFIO patches are here:
https://github.com/jacobpan/linux.git:siov_sva
The complete nested SVA upstream patches are divided into three phases:
1. Common APIs and PCI device direct assignment
2. Page Request Services (PRS) support
3. Mediated device assignment
With this set and the accompanied VFIO code, we will achieve phase #1.
Thanks,
Jacob
ChangeLog:
- v11 Misc fixes and improvements based on review by Kevin & Eric
- Fixed devTLB granularity conversion
- Simplified VT-d granulairy lookup by replacing 2D map array
with invalid entries.
- Fixed locking in bind guest PASID
- Added nesting domain attr check
- Squashed agaw checking patch with user
- Use rate limitted error message for all user originated calls
- v10
- Addressed Eric's review in v7 and v9. Most fixes are in 3/10 and
6/10. Extra condition checks and consolidation of duplicated codes.
- v9
- Addressed Baolu's comments for v8 for IOTLB flush consolidation,
bug fixes
- Removed IOASID notifier code which will be submitted separately
to address PASID life cycle management with multiple users.
- v8
- Extracted cleanup patches from V7 and accepted into maintainer's
tree (https://lkml.org/lkml/2019/12/2/514).
- Added IOASID notifier and VT-d handler for termination of PASID
IOMMU context upon free. This will ensure success of VFIO IOASID
free API regardless PASID is in use.
(https://lore.kernel.org/linux-iommu/[email protected]/)
- V7
- Respect vIOMMU PASID range in virtual command PASID/IOASID allocator
- Caching virtual command capabilities to avoid runtime checks that
could cause vmexits.
- V6
- Rebased on top of Joerg's core branch
(git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git core)
- Adapt to new uAPIs and IOASID allocators
- V5
Rebased on v5.3-rc4 which has some of the IOMMU fault APIs merged.
Addressed v4 review comments from Eric Auger, Baolu Lu, and
Jonathan Cameron. Specific changes are as follows:
- Refined custom IOASID allocator to support multiple vIOMMU, hotplug
cases.
- Extracted vendor data from IOMMU guest PASID bind data, for VT-d
will support all necessary guest PASID entry fields for PASID
bind.
- Support non-identity host-guest PASID mapping
- Exception handling in various cases
- V4
- Redesigned IOASID allocator such that it can support custom
allocators with shared helper functions. Use separate XArray
to store IOASIDs per allocator. Took advice from Eric Auger to
have default allocator use the generic allocator structure.
Combined into one patch in that the default allocator is just
"another" allocator now. Can be built as a module in case of
driver use without IOMMU.
- Extended bind guest PASID data to support SMMU and non-identity
guest to host PASID mapping https://lkml.org/lkml/2019/5/21/802
- Rebased on Jean's sva/api common tree, new patches starts with
[PATCH v4 10/22]
- V3
- Addressed thorough review comments from Eric Auger (Thank you!)
- Moved IOASID allocator from driver core to IOMMU code per
suggestion by Christoph Hellwig
(https://lkml.org/lkml/2019/4/26/462)
- Rebased on top of Jean's SVA API branch and Eric's v7[1]
(git://linux-arm.org/linux-jpb.git sva/api)
- All IOMMU APIs are unmodified (except the new bind guest PASID
call in patch 9/16)
- V2
- Rebased on Joerg's IOMMU x86/vt-d branch v5.1-rc4
- Integrated with Eric Auger's new v7 series for common APIs
(https://github.com/eauger/linux/tree/v5.1-rc3-2stage-v7)
- Addressed review comments from Andy Shevchenko and Alex Williamson on
IOASID custom allocator.
- Support multiple custom IOASID allocators (vIOMMUs) and dynamic
registration.
Jacob Pan (9):
iommu/vt-d: Move domain helper to header
iommu/uapi: Define a mask for bind data
iommu/vt-d: Use a helper function to skip agaw for SL
iommu/vt-d: Add nested translation helper function
iommu/vt-d: Add bind guest PASID support
iommu/vt-d: Support flushing more translation cache types
iommu/vt-d: Add svm/sva invalidate function
iommu/vt-d: Cache virtual command capability register
iommu/vt-d: Add custom allocator for IOASID
Lu Baolu (1):
iommu/vt-d: Enlightened PASID allocation
drivers/iommu/dmar.c | 37 +++++
drivers/iommu/intel-iommu.c | 277 ++++++++++++++++++++++++++++++++----
drivers/iommu/intel-pasid.c | 340 ++++++++++++++++++++++++++++++++++++++++++--
drivers/iommu/intel-pasid.h | 25 +++-
drivers/iommu/intel-svm.c | 206 +++++++++++++++++++++++++++
include/linux/intel-iommu.h | 70 ++++++++-
include/linux/intel-svm.h | 17 +++
include/uapi/linux/iommu.h | 5 +-
8 files changed, 925 insertions(+), 52 deletions(-)
--
2.7.4
Memory type related flags can be grouped together for one simple check.
---
v9 renamed from EMT to MTS since these are memory type support flags.
---
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
include/uapi/linux/iommu.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 4ad3496e5c43..d7bcbc5f79b0 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
__u32 pat;
__u32 emt;
};
-
+#define IOMMU_SVA_VTD_GPASID_MTS_MASK (IOMMU_SVA_VTD_GPASID_CD | \
+ IOMMU_SVA_VTD_GPASID_EMTE | \
+ IOMMU_SVA_VTD_GPASID_PCD | \
+ IOMMU_SVA_VTD_GPASID_PWT)
/**
* struct iommu_gpasid_bind_data - Information about device and guest PASID binding
* @version: Version of this data structure
--
2.7.4
When Shared Virtual Address (SVA) is enabled for a guest OS via
vIOMMU, we need to provide invalidation support at IOMMU API and driver
level. This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.
The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.
The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.
---
v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
Fixed devTLB invalidation granularity mapping. Disregard G=1 case
and use address selective invalidation only.
v7 review fixed in v10
---
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
---
drivers/iommu/intel-iommu.c | 158 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 158 insertions(+)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 94c7993dac6a..045c5c08d71d 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5594,6 +5594,163 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
aux_domain_remove_dev(to_dmar_domain(domain), dev);
}
+/*
+ * 2D array for converting and sanitizing IOMMU generic TLB granularity to
+ * VT-d granularity. Invalidation is typically included in the unmap operation
+ * as a result of DMA or VFIO unmap. However, for assigned devices guest
+ * owns the first level page tables. Invalidations of translation caches in the
+ * guest are trapped and passed down to the host.
+ *
+ * vIOMMU in the guest will only expose first level page tables, therefore
+ * we do not support IOTLB granularity for request without PASID (second level).
+ *
+ * For example, to find the VT-d granularity encoding for IOTLB
+ * type and page selective granularity within PASID:
+ * X: indexed by iommu cache type
+ * Y: indexed by enum iommu_inv_granularity
+ * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
+ */
+
+const static int inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
+ /*
+ * PASID based IOTLB invalidation: PASID selective (per PASID),
+ * page selective (address granularity)
+ */
+ {-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
+ /* PASID based dev TLBs */
+ {-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
+ /* PASID cache */
+ {-EINVAL, -EINVAL, -EINVAL}
+};
+
+static inline int to_vtd_granularity(int type, int granu)
+{
+ return inv_type_granu_table[type][granu];
+}
+
+static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
+{
+ u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
+
+ /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+ * IOMMU cache invalidate API passes granu_size in bytes, and number of
+ * granu size in contiguous memory.
+ */
+ return order_base_2(nr_pages);
+}
+
+#ifdef CONFIG_INTEL_IOMMU_SVM
+static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
+ struct device *dev, struct iommu_cache_invalidate_info *inv_info)
+{
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+ struct device_domain_info *info;
+ struct intel_iommu *iommu;
+ unsigned long flags;
+ int cache_type;
+ u8 bus, devfn;
+ u16 did, sid;
+ int ret = 0;
+ u64 size = 0;
+
+ if (!inv_info || !dmar_domain ||
+ inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
+ return -EINVAL;
+
+ if (!dev || !dev_is_pci(dev))
+ return -ENODEV;
+
+ iommu = device_to_iommu(dev, &bus, &devfn);
+ if (!iommu)
+ return -ENODEV;
+
+ spin_lock_irqsave(&device_domain_lock, flags);
+ spin_lock(&iommu->lock);
+ info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+ if (!info) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+ did = dmar_domain->iommu_did[iommu->seq_id];
+ sid = PCI_DEVID(bus, devfn);
+
+ /* Size is only valid in non-PASID selective invalidation */
+ if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
+ size = to_vtd_size(inv_info->addr_info.granule_size,
+ inv_info->addr_info.nb_granules);
+
+ for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
+ int granu = 0;
+ u64 pasid = 0;
+
+ granu = to_vtd_granularity(cache_type, inv_info->granularity);
+ if (granu == -EINVAL) {
+ pr_err("Invalid cache type and granu combination %d/%d\n", cache_type,
+ inv_info->granularity);
+ break;
+ }
+
+ /* PASID is stored in different locations based on granularity */
+ if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
+ inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID)
+ pasid = inv_info->pasid_info.pasid;
+ else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
+ inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID)
+ pasid = inv_info->addr_info.pasid;
+
+ switch (BIT(cache_type)) {
+ case IOMMU_CACHE_INV_TYPE_IOTLB:
+ if ((inv_info->granularity == IOMMU_INV_GRANU_ADDR) &&
+ size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
+ pr_err("Address out of range, 0x%llx, size order %llu\n",
+ inv_info->addr_info.addr, size);
+ ret = -ERANGE;
+ goto out_unlock;
+ }
+
+ qi_flush_piotlb(iommu, did,
+ pasid,
+ mm_to_dma_pfn(inv_info->addr_info.addr),
+ (granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
+ inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
+
+ /*
+ * Always flush device IOTLB if ATS is enabled. vIOMMU
+ * in the guest may assume IOTLB flush is inclusive,
+ * which is more efficient.
+ */
+ if (info->ats_enabled)
+ qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
+ pasid, info->ats_qdep,
+ inv_info->addr_info.addr, size,
+ granu);
+ break;
+ case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
+ if (info->ats_enabled)
+ qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
+ inv_info->addr_info.pasid, info->ats_qdep,
+ inv_info->addr_info.addr, size,
+ granu);
+ else
+ pr_warn("Passdown device IOTLB flush w/o ATS!\n");
+ break;
+ case IOMMU_CACHE_INV_TYPE_PASID:
+ qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid_info.pasid);
+ break;
+ default:
+ dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
+ cache_type);
+ ret = -EINVAL;
+ }
+ }
+out_unlock:
+ spin_unlock(&iommu->lock);
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+
+ return ret;
+}
+#endif
+
static int intel_iommu_map(struct iommu_domain *domain,
unsigned long iova, phys_addr_t hpa,
size_t size, int iommu_prot, gfp_t gfp)
@@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
.is_attach_deferred = intel_iommu_is_attach_deferred,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
#ifdef CONFIG_INTEL_IOMMU_SVM
+ .cache_invalidate = intel_iommu_sva_invalidate,
.sva_bind_gpasid = intel_svm_bind_gpasid,
.sva_unbind_gpasid = intel_svm_unbind_gpasid,
#endif
--
2.7.4
When supporting guest SVA with emulated IOMMU, the guest PASID
table is shadowed in VMM. Updates to guest vIOMMU PASID table
will result in PASID cache flush which will be passed down to
the host as bind guest PASID calls.
For the SL page tables, it will be harvested from device's
default domain (request w/o PASID), or aux domain in case of
mediated device.
.-------------. .---------------------------.
| vIOMMU | | Guest process CR3, FL only|
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush -
'-------------' |
| | V
| | CR3 in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .----------------------.
| pIOMMU | | Bind FL for GVA-GPA |
| | '----------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.------------------------------.
| | |SL for GPA-HPA, default domain|
| | '------------------------------'
'-------------'
Where:
- FL = First level/stage one page tables
- SL = Second level/stage two page tables
---
v11: Fixed locking, avoid duplicated paging mode check, added helper to
free svm if device list is empty. Use rate limited error message since
the bind gpasid call comes from user space.
---
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
---
drivers/iommu/intel-iommu.c | 4 +
drivers/iommu/intel-svm.c | 206 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 8 +-
include/linux/intel-svm.h | 17 ++++
4 files changed, 234 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index c0dadec5a6b3..94c7993dac6a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
.dev_disable_feat = intel_iommu_dev_disable_feat,
.is_attach_deferred = intel_iommu_is_attach_deferred,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+ .sva_bind_gpasid = intel_svm_bind_gpasid,
+ .sva_unbind_gpasid = intel_svm_unbind_gpasid,
+#endif
};
static void quirk_iommu_igfx(struct pci_dev *dev)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index d7f2a5358900..7cf711318b87 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
list_for_each_entry((sdev), &(svm)->devs, list) \
if ((d) != (sdev)->dev) {} else
+
+static inline void intel_svm_free_if_empty(struct intel_svm *svm, u64 pasid)
+{
+ if (list_empty(&svm->devs)) {
+ ioasid_set_data(pasid, NULL);
+ kfree(svm);
+ }
+}
+
+int intel_svm_bind_gpasid(struct iommu_domain *domain,
+ struct device *dev,
+ struct iommu_gpasid_bind_data *data)
+{
+ struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+ struct dmar_domain *dmar_domain;
+ struct intel_svm_dev *sdev;
+ struct intel_svm *svm;
+ int ret = 0;
+
+ if (WARN_ON(!iommu) || !data)
+ return -EINVAL;
+
+ if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
+ data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
+ return -EINVAL;
+
+ if (dev_is_pci(dev)) {
+ /* VT-d supports devices with full 20 bit PASIDs only */
+ if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
+ return -EINVAL;
+ } else {
+ return -ENOTSUPP;
+ }
+
+ /*
+ * We only check host PASID range, we have no knowledge to check
+ * guest PASID range.
+ */
+ if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
+ return -EINVAL;
+
+ dmar_domain = to_dmar_domain(domain);
+
+ mutex_lock(&pasid_mutex);
+ svm = ioasid_find(NULL, data->hpasid, NULL);
+ if (IS_ERR(svm)) {
+ ret = PTR_ERR(svm);
+ goto out;
+ }
+
+ if (svm) {
+ /*
+ * If we found svm for the PASID, there must be at
+ * least one device bond, otherwise svm should be freed.
+ */
+ if (WARN_ON(list_empty(&svm->devs))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ for_each_svm_dev(sdev, svm, dev) {
+ /* In case of multiple sub-devices of the same pdev
+ * assigned, we should allow multiple bind calls with
+ * the same PASID and pdev.
+ */
+ sdev->users++;
+ goto out;
+ }
+ } else {
+ /* We come here when PASID has never been bond to a device. */
+ svm = kzalloc(sizeof(*svm), GFP_KERNEL);
+ if (!svm) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ /* REVISIT: upper layer/VFIO can track host process that bind the PASID.
+ * ioasid_set = mm might be sufficient for vfio to check pasid VMM
+ * ownership. We can drop the following line once VFIO and IOASID set
+ * check is in place.
+ */
+ svm->mm = get_task_mm(current);
+ svm->pasid = data->hpasid;
+ if (data->flags & IOMMU_SVA_GPASID_VAL) {
+ svm->gpasid = data->gpasid;
+ svm->flags |= SVM_FLAG_GUEST_PASID;
+ }
+ ioasid_set_data(data->hpasid, svm);
+ INIT_LIST_HEAD_RCU(&svm->devs);
+ mmput(svm->mm);
+ }
+ sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
+ if (!sdev) {
+ /*
+ * If this is a new PASID that never bond to a device, then
+ * the device list must be empty which indicates struct svm
+ * was allocated in this function.
+ */
+ intel_svm_free_if_empty(svm, data->hpasid);
+ ret = -ENOMEM;
+ goto out;
+ }
+ sdev->dev = dev;
+ sdev->users = 1;
+
+ /* Set up device context entry for PASID if not enabled already */
+ ret = intel_iommu_enable_pasid(iommu, sdev->dev);
+ if (ret) {
+ dev_err_ratelimited(dev, "Failed to enable PASID capability\n");
+ kfree(sdev);
+ intel_svm_free_if_empty(svm, data->hpasid);
+ goto out;
+ }
+
+ /*
+ * PASID table is per device for better security. Therefore, for
+ * each bind of a new device even with an existing PASID, we need to
+ * call the nested mode setup function here.
+ */
+ spin_lock(&iommu->lock);
+ ret = intel_pasid_setup_nested(iommu,
+ dev,
+ (pgd_t *)data->gpgd,
+ data->hpasid,
+ &data->vtd,
+ dmar_domain,
+ data->addr_width);
+ if (ret) {
+ dev_err_ratelimited(dev, "Failed to set up PASID %llu in nested mode, Err %d\n",
+ data->hpasid, ret);
+ /*
+ * PASID entry should be in cleared state if nested mode
+ * set up failed. So we only need to clear IOASID tracking
+ * data such that free call will succeed.
+ */
+ kfree(sdev);
+ intel_svm_free_if_empty(svm, data->hpasid);
+ spin_unlock(&iommu->lock);
+ goto out;
+ }
+ spin_unlock(&iommu->lock);
+ svm->flags |= SVM_FLAG_GUEST_MODE;
+
+ init_rcu_head(&sdev->rcu);
+ list_add_rcu(&sdev->list, &svm->devs);
+ out:
+ mutex_unlock(&pasid_mutex);
+ return ret;
+}
+
+int intel_svm_unbind_gpasid(struct device *dev, int pasid)
+{
+ struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+ struct intel_svm_dev *sdev;
+ struct intel_svm *svm;
+ int ret = -EINVAL;
+
+ if (WARN_ON(!iommu))
+ return -EINVAL;
+
+ mutex_lock(&pasid_mutex);
+ svm = ioasid_find(NULL, pasid, NULL);
+ if (!svm) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (IS_ERR(svm)) {
+ ret = PTR_ERR(svm);
+ goto out;
+ }
+
+ for_each_svm_dev(sdev, svm, dev) {
+ ret = 0;
+ sdev->users--;
+ if (!sdev->users) {
+ list_del_rcu(&sdev->list);
+ intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
+ /* TODO: Drain in flight PRQ for the PASID since it
+ * may get reused soon, we don't want to
+ * confuse with its previous life.
+ * intel_svm_drain_prq(dev, pasid);
+ */
+ kfree_rcu(sdev, rcu);
+
+ if (list_empty(&svm->devs)) {
+ /*
+ * We do not free the IOASID here in that
+ * IOMMU driver did not allocate it.
+ * Unlike native SVM, IOASID for guest use was
+ * allocated prior to the bind call.
+ * In any case, if the free call comes before
+ * the unbind, IOMMU driver will get notified
+ * and perform cleanup.
+ */
+ ioasid_set_data(pasid, NULL);
+ kfree(svm);
+ }
+ }
+ break;
+ }
+out:
+ mutex_unlock(&pasid_mutex);
+
+ return ret;
+}
+
int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
{
struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 6da03f627ba3..a5bd53cf190c 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device *dev);
extern void intel_svm_check(struct intel_iommu *iommu);
extern int intel_svm_enable_prq(struct intel_iommu *iommu);
extern int intel_svm_finish_prq(struct intel_iommu *iommu);
-
+extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
+ struct device *dev, struct iommu_gpasid_bind_data *data);
+extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
struct svm_dev_ops;
struct intel_svm_dev {
@@ -723,9 +725,13 @@ struct intel_svm_dev {
struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
+
struct intel_iommu *iommu;
int flags;
int pasid;
+ int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
+ * to guest PASID mapping.
+ */
struct list_head devs;
struct list_head list;
};
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index d7c403d0dd27..c19690937540 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -44,6 +44,23 @@ struct svm_dev_ops {
* do such IOTLB flushes automatically.
*/
#define SVM_FLAG_SUPERVISOR_MODE (1<<1)
+/*
+ * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
+ * In this case the mm_struct is in the guest kernel or userspace, its life
+ * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
+ * means to bind/unbind guest CR3 with PASIDs allocated for a device.
+ */
+#define SVM_FLAG_GUEST_MODE (1<<2)
+/*
+ * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
+ * which requires guest and host PASID translation at both directions. We keep
+ * track of guest PASID in order to provide lookup service to device drivers.
+ * One such example is a physical function (PF) driver that supports mediated
+ * device (mdev) assignment. Guest programming of mdev configuration space can
+ * only be done with guest PASID, therefore PF driver needs to find the matching
+ * host PASID to program the real hardware.
+ */
+#define SVM_FLAG_GUEST_PASID (1<<3)
#ifdef CONFIG_INTEL_IOMMU_SVM
--
2.7.4
Virtual command registers are used in the guest only, to prevent
vmexit cost, we cache the capability and store it during initialization.
Signed-off-by: Jacob Pan <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Reviewed-by: Lu Baolu <[email protected]>
---
v7 Reviewed by Eric & Baolu
---
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/dmar.c | 1 +
include/linux/intel-iommu.h | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 4d6b7b5b37ee..3b36491c8bbb 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu, u64 phys_addr)
warn_invalid_dmar(phys_addr, " returns all ones");
goto unmap;
}
+ iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
/* the registers might be more than one page */
map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index f775bb825343..e02a96848586 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -194,6 +194,9 @@
#define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
#define ecap_sc_support(e) ((e >> 7) & 0x1) /* Snooping Control */
+/* Virtual command interface capability */
+#define vccap_pasid(v) ((v & DMA_VCS_PAS)) /* PASID allocation */
+
/* IOTLB_REG */
#define DMA_TLB_FLUSH_GRANU_OFFSET 60
#define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
@@ -287,6 +290,7 @@
/* PRS_REG */
#define DMA_PRS_PPR ((u32)1)
+#define DMA_VCS_PAS ((u64)1)
#define IOMMU_WAIT_OP(iommu, offset, op, cond, sts) \
do { \
@@ -562,6 +566,7 @@ struct intel_iommu {
u64 reg_size; /* size of hw register set */
u64 cap;
u64 ecap;
+ u64 vccap;
u32 gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF */
raw_spinlock_t register_lock; /* protect register handling */
int seq_id; /* sequence id of the iommu */
--
2.7.4
On 2020/4/4 2:42, Jacob Pan wrote:
> Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
> platforms allow address space sharing between device DMA and applications.
> SVA can reduce programming complexity and enhance security.
> This series is intended to enable SVA virtualization, i.e. enable use of SVA
> within a guest user application.
>
> This is the remaining portion of the original patchset that is based on
> Joerg's x86/vt-d branch. The preparatory and cleanup patches are merged here.
> (git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git)
With Eric's reviewed-by added, all patches in this series have been
piled up for v5.7.
Thank you all!
Best regards,
baolu
On Fri, Apr 03, 2020 at 11:42:06AM -0700, Jacob Pan wrote:
> Memory type related flags can be grouped together for one simple check.
>
> ---
> v9 renamed from EMT to MTS since these are memory type support flags.
> ---
>
> Reviewed-by: Eric Auger <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> include/uapi/linux/iommu.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4ad3496e5c43..d7bcbc5f79b0 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
> __u32 pat;
> __u32 emt;
> };
> -
> +#define IOMMU_SVA_VTD_GPASID_MTS_MASK (IOMMU_SVA_VTD_GPASID_CD | \
> + IOMMU_SVA_VTD_GPASID_EMTE | \
> + IOMMU_SVA_VTD_GPASID_PCD | \
> + IOMMU_SVA_VTD_GPASID_PWT)
Where and how will this be used? Can you add this in the patch that
actually makes use of it?
Also please add newlines before and after that define.
Regards,
Joerg
On Wed, 8 Apr 2020 15:07:22 +0200
Joerg Roedel <[email protected]> wrote:
> On Fri, Apr 03, 2020 at 11:42:06AM -0700, Jacob Pan wrote:
> > Memory type related flags can be grouped together for one simple
> > check.
> >
> > ---
> > v9 renamed from EMT to MTS since these are memory type support
> > flags. ---
> >
> > Reviewed-by: Eric Auger <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > include/uapi/linux/iommu.h | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 4ad3496e5c43..d7bcbc5f79b0 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -284,7 +284,10 @@ struct iommu_gpasid_bind_data_vtd {
> > __u32 pat;
> > __u32 emt;
> > };
> > -
> > +#define IOMMU_SVA_VTD_GPASID_MTS_MASK
> > (IOMMU_SVA_VTD_GPASID_CD | \
> > + IOMMU_SVA_VTD_GPASID_EMTE
> > | \
> > + IOMMU_SVA_VTD_GPASID_PCD
> > | \
> > +
> > IOMMU_SVA_VTD_GPASID_PWT)
>
> Where and how will this be used? Can you add this in the patch that
> actually makes use of it?
>
They are used in later patch when setting up PASID entry for guest SVA.
[PATCH v11 04/10] iommu/vt-d: Add nested translation helper function
> Also please add newlines before and after that define.
>
Will do, thanks.
>
> Regards,
>
> Joerg
[Jacob Pan]
Hi Jacob,
On 4/3/20 8:42 PM, Jacob Pan wrote:
> When supporting guest SVA with emulated IOMMU, the guest PASID
> table is shadowed in VMM. Updates to guest vIOMMU PASID table
> will result in PASID cache flush which will be passed down to
> the host as bind guest PASID calls.
>
> For the SL page tables, it will be harvested from device's
> default domain (request w/o PASID), or aux domain in case of
> mediated device.
>
> .-------------. .---------------------------.
> | vIOMMU | | Guest process CR3, FL only|
> | | '---------------------------'
> .----------------/
> | PASID Entry |--- PASID cache flush -
> '-------------' |
> | | V
> | | CR3 in GPA
> '-------------'
> Guest
> ------| Shadow |--------------------------|--------
> v v v
> Host
> .-------------. .----------------------.
> | pIOMMU | | Bind FL for GVA-GPA |
> | | '----------------------'
> .----------------/ |
> | PASID Entry | V (Nested xlate)
> '----------------\.------------------------------.
> | | |SL for GPA-HPA, default domain|
> | | '------------------------------'
> '-------------'
> Where:
> - FL = First level/stage one page tables
> - SL = Second level/stage two page tables
>
> ---
> v11: Fixed locking, avoid duplicated paging mode check, added helper to
> free svm if device list is empty. Use rate limited error message since
> the bind gpasid call comes from user space.
> ---
>
> Signed-off-by: Jacob Pan <[email protected]>
> Signed-off-by: Liu, Yi L <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 4 +
> drivers/iommu/intel-svm.c | 206 ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/intel-iommu.h | 8 +-
> include/linux/intel-svm.h | 17 ++++
> 4 files changed, 234 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index c0dadec5a6b3..94c7993dac6a 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
> .dev_disable_feat = intel_iommu_dev_disable_feat,
> .is_attach_deferred = intel_iommu_is_attach_deferred,
> .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> + .sva_bind_gpasid = intel_svm_bind_gpasid,
> + .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> +#endif
> };
>
> static void quirk_iommu_igfx(struct pci_dev *dev)
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index d7f2a5358900..7cf711318b87 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
> list_for_each_entry((sdev), &(svm)->devs, list) \
> if ((d) != (sdev)->dev) {} else
>
> +
> +static inline void intel_svm_free_if_empty(struct intel_svm *svm, u64 pasid)
> +{
> + if (list_empty(&svm->devs)) {
> + ioasid_set_data(pasid, NULL);
> + kfree(svm);
> + }
> +}
> +
> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> + struct device *dev,
> + struct iommu_gpasid_bind_data *data)
> +{
> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> + struct dmar_domain *dmar_domain;
> + struct intel_svm_dev *sdev;
> + struct intel_svm *svm;
> + int ret = 0;
> +
> + if (WARN_ON(!iommu) || !data)
> + return -EINVAL;
> +
> + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> + return -EINVAL;
> +
> + if (dev_is_pci(dev)) {
> + /* VT-d supports devices with full 20 bit PASIDs only */
> + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> + return -EINVAL;
> + } else {
> + return -ENOTSUPP;
> + }
> +
> + /*
> + * We only check host PASID range, we have no knowledge to check
> + * guest PASID range.
> + */
> + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> + return -EINVAL;
> +
> + dmar_domain = to_dmar_domain(domain);
> +
> + mutex_lock(&pasid_mutex);
> + svm = ioasid_find(NULL, data->hpasid, NULL);
> + if (IS_ERR(svm)) {
> + ret = PTR_ERR(svm);
> + goto out;
> + }
> +
> + if (svm) {
> + /*
> + * If we found svm for the PASID, there must be at
> + * least one device bond, otherwise svm should be freed.
> + */
> + if (WARN_ON(list_empty(&svm->devs))) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + for_each_svm_dev(sdev, svm, dev) {
> + /* In case of multiple sub-devices of the same pdev
> + * assigned, we should allow multiple bind calls with
> + * the same PASID and pdev.
> + */
> + sdev->users++;
What if this is not an mdev device. Is it also allowed?
> + goto out;
> + }
> + } else {
> + /* We come here when PASID has never been bond to a device. */
> + svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> + if (!svm) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + /* REVISIT: upper layer/VFIO can track host process that bind the PASID.
> + * ioasid_set = mm might be sufficient for vfio to check pasid VMM
> + * ownership. We can drop the following line once VFIO and IOASID set
> + * check is in place.
> + */
> + svm->mm = get_task_mm(current);
> + svm->pasid = data->hpasid;
> + if (data->flags & IOMMU_SVA_GPASID_VAL) {
> + svm->gpasid = data->gpasid;
> + svm->flags |= SVM_FLAG_GUEST_PASID;
> + }
> + ioasid_set_data(data->hpasid, svm);
> + INIT_LIST_HEAD_RCU(&svm->devs);
> + mmput(svm->mm);
> + }
> + sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> + if (!sdev) {
> + /*
> + * If this is a new PASID that never bond to a device, then
> + * the device list must be empty which indicates struct svm
> + * was allocated in this function.
> + */
> + intel_svm_free_if_empty(svm, data->hpasid);
> + ret = -ENOMEM;
> + goto out;
> + }
> + sdev->dev = dev;
> + sdev->users = 1;
> +
> + /* Set up device context entry for PASID if not enabled already */
> + ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> + if (ret) {
> + dev_err_ratelimited(dev, "Failed to enable PASID capability\n");
> + kfree(sdev);
> + intel_svm_free_if_empty(svm, data->hpasid);
> + goto out;
> + }
> +
> + /*
> + * PASID table is per device for better security. Therefore, for
> + * each bind of a new device even with an existing PASID, we need to
> + * call the nested mode setup function here.
> + */
> + spin_lock(&iommu->lock);
> + ret = intel_pasid_setup_nested(iommu,
> + dev,
> + (pgd_t *)data->gpgd,
> + data->hpasid,
> + &data->vtd,
> + dmar_domain,
> + data->addr_width);
> + if (ret) {
> + dev_err_ratelimited(dev, "Failed to set up PASID %llu in nested mode, Err %d\n",
> + data->hpasid, ret);
> + /*
> + * PASID entry should be in cleared state if nested mode
> + * set up failed. So we only need to clear IOASID tracking
> + * data such that free call will succeed.
> + */
> + kfree(sdev);
> + intel_svm_free_if_empty(svm, data->hpasid);
> + spin_unlock(&iommu->lock);
> + goto out;
> + }
> + spin_unlock(&iommu->lock);
> + svm->flags |= SVM_FLAG_GUEST_MODE;
> +
> + init_rcu_head(&sdev->rcu);
> + list_add_rcu(&sdev->list, &svm->devs);
> + out:
> + mutex_unlock(&pasid_mutex);
> + return ret;
> +}
> +
> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> +{
This function's code looks very very similar to intel_svm_unbind_mm()
besides to the mmu_notifier_unregister and the ioasid_free(). Don't we
have means to factorize the code by checking the SVM_FLAG_GUEST_MODE flag?
Please could you explain again what does happen if the guest process
dies/VM dies. How do we make sure the svm get freed. I fail to
understand the whole picture at the moment as you cannot fully rely on
the userspace to call unbind_gpasid.
> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> + struct intel_svm_dev *sdev;
> + struct intel_svm *svm;
> + int ret = -EINVAL;
> +
> + if (WARN_ON(!iommu))
> + return -EINVAL;
> +
> + mutex_lock(&pasid_mutex);
> + svm = ioasid_find(NULL, pasid, NULL);
> + if (!svm) {
> + ret = -EINVAL;
As per the discussion on the VFIO series, shall we return an error in
that case? Same below?
> + goto out;
> + }
> +
> + if (IS_ERR(svm)) {
> + ret = PTR_ERR(svm);
> + goto out;
> + }
> +
> + for_each_svm_dev(sdev, svm, dev) {
> + ret = 0;
> + sdev->users--;
> + if (!sdev->users) {
> + list_del_rcu(&sdev->list);
> + intel_pasid_tear_down_entry(iommu, dev, svm->pasid);
intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0, -1, 0);
Don't we need to flush the (DEV-)IOTLBs as well?
> + /* TODO: Drain in flight PRQ for the PASID since it
> + * may get reused soon, we don't want to
> + * confuse with its previous life.
> + * intel_svm_drain_prq(dev, pasid);
> + */
> + kfree_rcu(sdev, rcu);
> +
> + if (list_empty(&svm->devs)) {
> + /*
> + * We do not free the IOASID here in that
> + * IOMMU driver did not allocate it.
s/in/as?
> + * Unlike native SVM, IOASID for guest use was
> + * allocated prior to the bind call.> + * In any case, if the free call comes before
> + * the unbind, IOMMU driver will get notified
> + * and perform cleanup.
> + */
> + ioasid_set_data(pasid, NULL);
> + kfree(svm);
> + }
nit: you may use intel_svm_free_if_empty()
> + }
> + break;
> + }
> +out:
> + mutex_unlock(&pasid_mutex);
> +
nit : spare new line
> + return ret;
> +}
> +
> int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
> {
> struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 6da03f627ba3..a5bd53cf190c 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device *dev);
> extern void intel_svm_check(struct intel_iommu *iommu);
> extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> -
> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> + struct device *dev, struct iommu_gpasid_bind_data *data);
> +extern int intel_svm_unbind_gpasid(struct device *dev, int pasid);
> struct svm_dev_ops;
>
> struct intel_svm_dev {
> @@ -723,9 +725,13 @@ struct intel_svm_dev {
> struct intel_svm {
> struct mmu_notifier notifier;
> struct mm_struct *mm;
> +
> struct intel_iommu *iommu;
> int flags;
> int pasid;
> + int gpasid; /* Guest PASID in case of vSVA bind with non-identity host
> + * to guest PASID mapping.
> + */
> struct list_head devs;
> struct list_head list;
> };
> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> index d7c403d0dd27..c19690937540 100644
> --- a/include/linux/intel-svm.h
> +++ b/include/linux/intel-svm.h
> @@ -44,6 +44,23 @@ struct svm_dev_ops {
> * do such IOTLB flushes automatically.
> */
> #define SVM_FLAG_SUPERVISOR_MODE (1<<1)
> +/*
> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind to a device.
> + * In this case the mm_struct is in the guest kernel or userspace, its life
> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this API provides
> + * means to bind/unbind guest CR3 with PASIDs allocated for a device.
> + */
> +#define SVM_FLAG_GUEST_MODE (1<<2)
> +/*
> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
> + * which requires guest and host PASID translation at both directions. We keep
> + * track of guest PASID in order to provide lookup service to device drivers.
> + * One such example is a physical function (PF) driver that supports mediated
> + * device (mdev) assignment. Guest programming of mdev configuration space can
> + * only be done with guest PASID, therefore PF driver needs to find the matching
> + * host PASID to program the real hardware.
> + */
> +#define SVM_FLAG_GUEST_PASID (1<<3)
>
> #ifdef CONFIG_INTEL_IOMMU_SVM
>
>
Thanks
Eric
Hi Jacob,
On 4/3/20 8:42 PM, Jacob Pan wrote:
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
>
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
>
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
>
> ---
> v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
> Fixed devTLB invalidation granularity mapping. Disregard G=1 case
> and use address selective invalidation only.
>
> v7 review fixed in v10
> ---
>
> Signed-off-by: Jacob Pan <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> Signed-off-by: Liu, Yi L <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 158 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 158 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 94c7993dac6a..045c5c08d71d 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5594,6 +5594,163 @@ static void intel_iommu_aux_detach_device(struct iommu_domain *domain,
> aux_domain_remove_dev(to_dmar_domain(domain), dev);
> }
>
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap operation
> + * as a result of DMA or VFIO unmap. However, for assigned devices guest
> + * owns the first level page tables. Invalidations of translation caches in the
> + * guest are trapped and passed down to the host.
> + *
> + * vIOMMU in the guest will only expose first level page tables, therefore
> + * we do not support IOTLB granularity for request without PASID (second level).
> + *
> + * For example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by iommu cache type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> + */
> +
> +const static int inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] = {
> + /*
> + * PASID based IOTLB invalidation: PASID selective (per PASID),
> + * page selective (address granularity)
> + */
> + {-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> + /* PASID based dev TLBs */
> + {-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
> + /* PASID cache */
> + {-EINVAL, -EINVAL, -EINVAL}
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu)
> +{
> + return inv_type_granu_table[type][granu];
> +}
> +
> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> +{
> + u64 nr_pages = (granu_size * nr_granules) >> VTD_PAGE_SHIFT;
> +
> + /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> + * IOMMU cache invalidate API passes granu_size in bytes, and number of
> + * granu size in contiguous memory.
> + */
> + return order_base_2(nr_pages);
> +}
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> + struct device *dev, struct iommu_cache_invalidate_info *inv_info)
> +{
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> + struct device_domain_info *info;
> + struct intel_iommu *iommu;
> + unsigned long flags;
> + int cache_type;
> + u8 bus, devfn;
> + u16 did, sid;
> + int ret = 0;
> + u64 size = 0;
> +
> + if (!inv_info || !dmar_domain ||
> + inv_info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> + return -EINVAL;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?
> +
> + iommu = device_to_iommu(dev, &bus, &devfn);
> + if (!iommu)
> + return -ENODEV;
> +
> + spin_lock_irqsave(&device_domain_lock, flags);
> + spin_lock(&iommu->lock);
> + info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> + if (!info) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> + did = dmar_domain->iommu_did[iommu->seq_id];
> + sid = PCI_DEVID(bus, devfn);
> +
> + /* Size is only valid in non-PASID selective invalidation */
> + if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> + size = to_vtd_size(inv_info->addr_info.granule_size,
> + inv_info->addr_info.nb_granules);
> +
> + for_each_set_bit(cache_type, (unsigned long *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> + int granu = 0;
> + u64 pasid = 0;
> +
> + granu = to_vtd_granularity(cache_type, inv_info->granularity);
> + if (granu == -EINVAL) {
> + pr_err("Invalid cache type and granu combination %d/%d\n", cache_type,
> + inv_info->granularity);
rate unlimited traces here and after.
ret = -EINVAL?
also related to the discussion we had on the VFIO series. What is the
error policy?
> + break;
> + }
> +
> + /* PASID is stored in different locations based on granularity */
not sure the above comment really is requested
> + if (inv_info->granularity == IOMMU_INV_GRANU_PASID &&
> + inv_info->pasid_info.flags & IOMMU_INV_PASID_FLAGS_PASID)
> + pasid = inv_info->pasid_info.pasid;
> + else if (inv_info->granularity == IOMMU_INV_GRANU_ADDR &&
> + inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID)
> + pasid = inv_info->addr_info.pasid;
> +
> + switch (BIT(cache_type)) {
> + case IOMMU_CACHE_INV_TYPE_IOTLB:
> + if ((inv_info->granularity == IOMMU_INV_GRANU_ADDR) &&
> + size && (inv_info->addr_info.addr & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> + pr_err("Address out of range, 0x%llx, size order %llu\n",
> + inv_info->addr_info.addr, size);
> + ret = -ERANGE;
> + goto out_unlock;
> + }
> +
> + qi_flush_piotlb(iommu, did,
> + pasid,
> + mm_to_dma_pfn(inv_info->addr_info.addr),
This does not sound correct to me:
inv_info->addr_info.addr and inv_info->addr_info.flags are not valid in
case of PASID-selective invalidation
> + (granu == QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> + inv_info->addr_info.flags & IOMMU_INV_ADDR_FLAGS_LEAF);
> +
> + /*
> + * Always flush device IOTLB if ATS is enabled. vIOMMU
> + * in the guest may assume IOTLB flush is inclusive,
> + * which is more efficient.
> + */
> + if (info->ats_enabled)
> + qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
> + pasid, info->ats_qdep,
> + inv_info->addr_info.addr, size,
same
> + granu);
> + break;
> + case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> + if (info->ats_enabled)
> + qi_flush_dev_iotlb_pasid(iommu, sid, info->pfsid,
> + inv_info->addr_info.pasid, info->ats_qdep,
nit: use pasid directly
> + inv_info->addr_info.addr, size,
> + granu);
> + else
> + pr_warn("Passdown device IOTLB flush w/o ATS!\n");
> + break;
> + case IOMMU_CACHE_INV_TYPE_PASID:
> + qi_flush_pasid_cache(iommu, did, granu, inv_info->pasid_info.pasid);
> + break;
> + default:
> + dev_err(dev, "Unsupported IOMMU invalidation type %d\n",
> + cache_type);
> + ret = -EINVAL;
> + }
> + }
> +out_unlock:
> + spin_unlock(&iommu->lock);
> + spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> + return ret;
> +}
> +#endif
> +
> static int intel_iommu_map(struct iommu_domain *domain,
> unsigned long iova, phys_addr_t hpa,
> size_t size, int iommu_prot, gfp_t gfp)
> @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
> .is_attach_deferred = intel_iommu_is_attach_deferred,
> .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
> #ifdef CONFIG_INTEL_IOMMU_SVM
> + .cache_invalidate = intel_iommu_sva_invalidate,
> .sva_bind_gpasid = intel_svm_bind_gpasid,
> .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> #endif
>
Thanks
Eric
Hi Jacob,
On 4/3/20 8:42 PM, Jacob Pan wrote:
> Virtual command registers are used in the guest only, to prevent
> vmexit cost, we cache the capability and store it during initialization.
>
> Signed-off-by: Jacob Pan <[email protected]>
> Reviewed-by: Eric Auger <[email protected]>
> Reviewed-by: Lu Baolu <[email protected]>
nit: following Joerg's comment on 02/10, may be worth squashing it into
the patch that uses it (10/10)
Thanks
Eric
>
> ---
> v7 Reviewed by Eric & Baolu
> ---
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/dmar.c | 1 +
> include/linux/intel-iommu.h | 5 +++++
> 2 files changed, 6 insertions(+)
>
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 4d6b7b5b37ee..3b36491c8bbb 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -963,6 +963,7 @@ static int map_iommu(struct intel_iommu *iommu, u64 phys_addr)
> warn_invalid_dmar(phys_addr, " returns all ones");
> goto unmap;
> }
> + iommu->vccap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
>
> /* the registers might be more than one page */
> map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index f775bb825343..e02a96848586 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -194,6 +194,9 @@
> #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
> #define ecap_sc_support(e) ((e >> 7) & 0x1) /* Snooping Control */
>
> +/* Virtual command interface capability */
> +#define vccap_pasid(v) ((v & DMA_VCS_PAS)) /* PASID allocation */
> +
> /* IOTLB_REG */
> #define DMA_TLB_FLUSH_GRANU_OFFSET 60
> #define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
> @@ -287,6 +290,7 @@
>
> /* PRS_REG */
> #define DMA_PRS_PPR ((u32)1)
> +#define DMA_VCS_PAS ((u64)1)
>
> #define IOMMU_WAIT_OP(iommu, offset, op, cond, sts) \
> do { \
> @@ -562,6 +566,7 @@ struct intel_iommu {
> u64 reg_size; /* size of hw register set */
> u64 cap;
> u64 ecap;
> + u64 vccap;
> u32 gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF */
> raw_spinlock_t register_lock; /* protect register handling */
> int seq_id; /* sequence id of the iommu */
>
Hi Eric,
On Thu, 9 Apr 2020 09:41:32 +0200
Auger Eric <[email protected]> wrote:
> Hi Jacob,
>
> On 4/3/20 8:42 PM, Jacob Pan wrote:
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> >
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> >
> > .-------------. .---------------------------.
> > | vIOMMU | | Guest process CR3, FL only|
> > | | '---------------------------'
> > .----------------/
> > | PASID Entry |--- PASID cache flush -
> > '-------------' |
> > | | V
> > | | CR3 in GPA
> > '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> > v v v
> > Host
> > .-------------. .----------------------.
> > | pIOMMU | | Bind FL for GVA-GPA |
> > | | '----------------------'
> > .----------------/ |
> > | PASID Entry | V (Nested xlate)
> > '----------------\.------------------------------.
> > | | |SL for GPA-HPA, default domain|
> > | | '------------------------------'
> > '-------------'
> > Where:
> > - FL = First level/stage one page tables
> > - SL = Second level/stage two page tables
> >
> > ---
> > v11: Fixed locking, avoid duplicated paging mode check, added
> > helper to free svm if device list is empty. Use rate limited error
> > message since the bind gpasid call comes from user space.
> > ---
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > Signed-off-by: Liu, Yi L <[email protected]>
> > ---
> > drivers/iommu/intel-iommu.c | 4 +
> > drivers/iommu/intel-svm.c | 206
> > ++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h | 8 +- include/linux/intel-svm.h |
> > 17 ++++ 4 files changed, 234 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index c0dadec5a6b3..94c7993dac6a
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
> > .dev_disable_feat = intel_iommu_dev_disable_feat,
> > .is_attach_deferred =
> > intel_iommu_is_attach_deferred, .pgsize_bitmap =
> > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > + .sva_bind_gpasid = intel_svm_bind_gpasid,
> > + .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> > +#endif
> > };
> >
> > static void quirk_iommu_igfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index d7f2a5358900..7cf711318b87 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
> > list_for_each_entry((sdev), &(svm)->devs, list) \
> > if ((d) != (sdev)->dev) {} else
> >
> > +
> > +static inline void intel_svm_free_if_empty(struct intel_svm *svm,
> > u64 pasid) +{
> > + if (list_empty(&svm->devs)) {
> > + ioasid_set_data(pasid, NULL);
> > + kfree(svm);
> > + }
> > +}
> > +
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > + struct device *dev,
> > + struct iommu_gpasid_bind_data *data)
> > +{
> > + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > + struct dmar_domain *dmar_domain;
> > + struct intel_svm_dev *sdev;
> > + struct intel_svm *svm;
> > + int ret = 0;
> > +
> > + if (WARN_ON(!iommu) || !data)
> > + return -EINVAL;
> > +
> > + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > + return -EINVAL;
> > +
> > + if (dev_is_pci(dev)) {
> > + /* VT-d supports devices with full 20 bit PASIDs
> > only */
> > + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > + return -EINVAL;
> > + } else {
> > + return -ENOTSUPP;
> > + }
> > +
> > + /*
> > + * We only check host PASID range, we have no knowledge to
> > check
> > + * guest PASID range.
> > + */
> > + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > + return -EINVAL;
> > +
> > + dmar_domain = to_dmar_domain(domain);
> > +
> > + mutex_lock(&pasid_mutex);
> > + svm = ioasid_find(NULL, data->hpasid, NULL);
> > + if (IS_ERR(svm)) {
> > + ret = PTR_ERR(svm);
> > + goto out;
> > + }
> > +
> > + if (svm) {
> > + /*
> > + * If we found svm for the PASID, there must be at
> > + * least one device bond, otherwise svm should be
> > freed.
> > + */
> > + if (WARN_ON(list_empty(&svm->devs))) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + for_each_svm_dev(sdev, svm, dev) {
> > + /* In case of multiple sub-devices of the
> > same pdev
> > + * assigned, we should allow multiple bind
> > calls with
> > + * the same PASID and pdev.
> > + */
> > + sdev->users++;
> What if this is not an mdev device. Is it also allowed?
Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
example of normal use case. You can bind the same PCI device (PF or
SRIOV VF) more than once to the same PASID. Just need to unbind also.
Do you see any issues?
> > + goto out;
> > + }
> > + } else {
> > + /* We come here when PASID has never been bond to
> > a device. */
> > + svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > + if (!svm) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + /* REVISIT: upper layer/VFIO can track host
> > process that bind the PASID.
> > + * ioasid_set = mm might be sufficient for vfio to
> > check pasid VMM
> > + * ownership. We can drop the following line once
> > VFIO and IOASID set
> > + * check is in place.
> > + */
> > + svm->mm = get_task_mm(current);
> > + svm->pasid = data->hpasid;
> > + if (data->flags & IOMMU_SVA_GPASID_VAL) {
> > + svm->gpasid = data->gpasid;
> > + svm->flags |= SVM_FLAG_GUEST_PASID;
> > + }
> > + ioasid_set_data(data->hpasid, svm);
> > + INIT_LIST_HEAD_RCU(&svm->devs);
> > + mmput(svm->mm);
> > + }
> > + sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
> > + if (!sdev) {
> > + /*
> > + * If this is a new PASID that never bond to a
> > device, then
> > + * the device list must be empty which indicates
> > struct svm
> > + * was allocated in this function.
> > + */
> > + intel_svm_free_if_empty(svm, data->hpasid);
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + sdev->dev = dev;
> > + sdev->users = 1;
> > +
> > + /* Set up device context entry for PASID if not enabled
> > already */
> > + ret = intel_iommu_enable_pasid(iommu, sdev->dev);
> > + if (ret) {
> > + dev_err_ratelimited(dev, "Failed to enable PASID
> > capability\n");
> > + kfree(sdev);
> > + intel_svm_free_if_empty(svm, data->hpasid);
> > + goto out;
> > + }
> > +
> > + /*
> > + * PASID table is per device for better security.
> > Therefore, for
> > + * each bind of a new device even with an existing PASID,
> > we need to
> > + * call the nested mode setup function here.
> > + */
> > + spin_lock(&iommu->lock);
> > + ret = intel_pasid_setup_nested(iommu,
> > + dev,
> > + (pgd_t *)data->gpgd,
> > + data->hpasid,
> > + &data->vtd,
> > + dmar_domain,
> > + data->addr_width);
> > + if (ret) {
> > + dev_err_ratelimited(dev, "Failed to set up PASID
> > %llu in nested mode, Err %d\n",
> > + data->hpasid, ret);
> > + /*
> > + * PASID entry should be in cleared state if
> > nested mode
> > + * set up failed. So we only need to clear IOASID
> > tracking
> > + * data such that free call will succeed.
> > + */
> > + kfree(sdev);
> > + intel_svm_free_if_empty(svm, data->hpasid);
> > + spin_unlock(&iommu->lock);
> > + goto out;
> > + }
> > + spin_unlock(&iommu->lock);
> > + svm->flags |= SVM_FLAG_GUEST_MODE;
> > +
> > + init_rcu_head(&sdev->rcu);
> > + list_add_rcu(&sdev->list, &svm->devs);
> > + out:
> > + mutex_unlock(&pasid_mutex);
> > + return ret;
> > +}
> > +
> > +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +{
> This function's code looks very very similar to intel_svm_unbind_mm()
> besides to the mmu_notifier_unregister and the ioasid_free(). Don't we
> have means to factorize the code by checking the SVM_FLAG_GUEST_MODE
> flag?
>
Yes, I agree there is room to consolidate. But the same time there are
two potential changes coming:
1. I also try to consolidate with Jean's generic sva_bind_device() for
native bind.
https://lkml.org/lkml/2020/2/24/1351
2. Support enqcmd and lazy PASID free.
https://lkml.org/lkml/2020/3/30/910
Perhaps we can revisit after those get resolved?
> Please could you explain again what does happen if the guest process
> dies/VM dies. How do we make sure the svm get freed. I fail to
> understand the whole picture at the moment as you cannot fully rely on
> the userspace to call unbind_gpasid.
The fault handling needs to be better documented. Here is the
current plan as I proposed in the IOASID extension set with IOASID
notifier.
https://lkml.org/lkml/2020/3/25/874
I will do a document in the next version. In short On VT-d, IOASID has
the following subscribers:
1. IOMMU driver, when free comes before unbind, since there is no way
to hook up with guest mmu_notifier directly.
2. KVM, needs to update VMCS PASID translation table. This is needed
when we support non-identity G-H PASID mapping.
3. VDCM (Virtual device composition module), similar to SRIOV PF
driver, also needs G-H PASID translation.
So when a guest goes away, based on FD close VFIO needs to free all
the PASIDs belong to the guest. The proposal is to call
ioasid_free_set() where the ioasid_set contains all the IOASIDs
for the VM.
A blocking notifier will be called, IOMMU gets notified and performs
all the clean up on IOMMU side as follows:
1.clear PASID table entry, flush PASID and IOTLB, devTLB. Drain page
requests for the PASID. The same thing you would do during unbind.
2.mark PASID defunct such that we can drop further DMA faults and PRQ
For individual guest PASID, the normal order is
1. alloc (via virtual cmd)
2. bind_gpasid
3. unbind_gpasid
4. free (via virtual cmd)
If 4 comes before 3, free will call ioasid notifier. IOMMU gets
notified and perform cleanup as above. When unbind comes, there is no
PASID entry then exit.
One challenge is the cleanup can be lengthy and requires thread
context. SVA code needed to call ioasid_free in atomic context
(mmu_notifier release, SRCU callback). This might be changing with
Jean's patch to defer ioasid_free, this restriction may go away.
https://www.spinics.net/lists/iommu/msg43065.html
I still feel the clean up work in notifier handler is too heavy.
Two options I can think of:
1. let VFIO enforce the unbind->free order
2. introduce some reclaim phase for IOASID instead of putting the
IOASID back to the pool immediately.
Any suggestions?
> > + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > + struct intel_svm_dev *sdev;
> > + struct intel_svm *svm;
> > + int ret = -EINVAL;
> > +
> > + if (WARN_ON(!iommu))
> > + return -EINVAL;
> > +
> > + mutex_lock(&pasid_mutex);
> > + svm = ioasid_find(NULL, pasid, NULL);
> > + if (!svm) {
> > + ret = -EINVAL;
> As per the discussion on the VFIO series, shall we return an error in
> that case? Same below?
You mean make unbind a void function? I agree, there is really no way
to recover. Same for invalidate, which cannot fail.
Perhaps another series?
> > + goto out;
> > + }
> > +
> > + if (IS_ERR(svm)) {
> > + ret = PTR_ERR(svm);
> > + goto out;
> > + }
> > +
> > + for_each_svm_dev(sdev, svm, dev) {
> > + ret = 0;
> > + sdev->users--;
> > + if (!sdev->users) {
> > + list_del_rcu(&sdev->list);
> > + intel_pasid_tear_down_entry(iommu, dev,
> > svm->pasid);
> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
> > + /* TODO: Drain in flight PRQ for the PASID
> > since it
> > + * may get reused soon, we don't want to
> > + * confuse with its previous life.
> > + * intel_svm_drain_prq(dev, pasid);
> > + */
> > + kfree_rcu(sdev, rcu);
> > +
> > + if (list_empty(&svm->devs)) {
> > + /*
> > + * We do not free the IOASID here
> > in that
> > + * IOMMU driver did not allocate
> > it.
> s/in/as?
> > + * Unlike native SVM, IOASID for
> > guest use was
> > + * allocated prior to the bind
> > call.> + * In any case, if the free
> > call comes before
> > + * the unbind, IOMMU driver will
> > get notified
> > + * and perform cleanup.
> > + */
> > + ioasid_set_data(pasid, NULL);
> > + kfree(svm);
> > + }
> nit: you may use intel_svm_free_if_empty()
> > + }
> > + break;
> > + }
> > +out:
> > + mutex_unlock(&pasid_mutex);
> > +
> nit : spare new line
Sounds good.
> > + return ret;
> > +}
> > +
> > int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > struct svm_dev_ops *ops) {
> > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 6da03f627ba3..a5bd53cf190c
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device
> > *dev); extern void intel_svm_check(struct intel_iommu *iommu);
> > extern int intel_svm_enable_prq(struct intel_iommu *iommu);
> > extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > -
> > +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > + struct device *dev, struct iommu_gpasid_bind_data
> > *data); +extern int intel_svm_unbind_gpasid(struct device *dev, int
> > pasid); struct svm_dev_ops;
> >
> > struct intel_svm_dev {
> > @@ -723,9 +725,13 @@ struct intel_svm_dev {
> > struct intel_svm {
> > struct mmu_notifier notifier;
> > struct mm_struct *mm;
> > +
> > struct intel_iommu *iommu;
> > int flags;
> > int pasid;
> > + int gpasid; /* Guest PASID in case of vSVA bind with
> > non-identity host
> > + * to guest PASID mapping.
> > + */
> > struct list_head devs;
> > struct list_head list;
> > };
> > diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
> > index d7c403d0dd27..c19690937540 100644
> > --- a/include/linux/intel-svm.h
> > +++ b/include/linux/intel-svm.h
> > @@ -44,6 +44,23 @@ struct svm_dev_ops {
> > * do such IOTLB flushes automatically.
> > */
> > #define SVM_FLAG_SUPERVISOR_MODE (1<<1)
> > +/*
> > + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
> > to a device.
> > + * In this case the mm_struct is in the guest kernel or userspace,
> > its life
> > + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
> > API provides
> > + * means to bind/unbind guest CR3 with PASIDs allocated for a
> > device.
> > + */
> > +#define SVM_FLAG_GUEST_MODE (1<<2)
> > +/*
> > + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
> > PASID space,
> > + * which requires guest and host PASID translation at both
> > directions. We keep
> > + * track of guest PASID in order to provide lookup service to
> > device drivers.
> > + * One such example is a physical function (PF) driver that
> > supports mediated
> > + * device (mdev) assignment. Guest programming of mdev
> > configuration space can
> > + * only be done with guest PASID, therefore PF driver needs to
> > find the matching
> > + * host PASID to program the real hardware.
> > + */
> > +#define SVM_FLAG_GUEST_PASID (1<<3)
> >
> > #ifdef CONFIG_INTEL_IOMMU_SVM
> >
> >
> Thanks
>
> Eric
>
[Jacob Pan]
Hi Eric,
Missed a few things in the last reply.
On Thu, 9 Apr 2020 09:41:32 +0200
Auger Eric <[email protected]> wrote:
> > + intel_pasid_tear_down_entry(iommu, dev,
> > svm->pasid);
> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
Right, pasid tear down should always include (DEV-)IOTLB flush, I
initially thought it is taken care of by intel_pasid_tear_down_entry().
> > + /* TODO: Drain in flight PRQ for the PASID
> > since it
> > + * may get reused soon, we don't want to
> > + * confuse with its previous life.
> > + * intel_svm_drain_prq(dev, pasid);
> > + */
> > + kfree_rcu(sdev, rcu);
> > +
> > + if (list_empty(&svm->devs)) {
> > + /*
> > + * We do not free the IOASID here
> > in that
> > + * IOMMU driver did not allocate
> > it.
> s/in/as?
I meant to say "in that" as "for the reason that"
> > + * Unlike native SVM, IOASID for
> > guest use was
> > + * allocated prior to the bind
> > call.> + * In any case, if the free
> > call comes before
> > + * the unbind, IOMMU driver will
> > get notified
> > + * and perform cleanup.
> > + */
> > + ioasid_set_data(pasid, NULL);
> > + kfree(svm);
> > + }
> nit: you may use intel_svm_free_if_empty()
True, but I meant to insert ioasid_notifier under the list_empty
condition in the following ioasid patch.
Thanks,
Jacob
On Thu, 9 Apr 2020 10:50:34 +0200
Auger Eric <[email protected]> wrote:
> Hi Jacob,
>
> On 4/3/20 8:42 PM, Jacob Pan wrote:
> > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > vIOMMU, we need to provide invalidation support at IOMMU API and
> > driver level. This patch adds Intel VT-d specific function to
> > implement iommu passdown invalidate API for shared virtual address.
> >
> > The use case is for supporting caching structure invalidation
> > of assigned SVM capable devices. Emulated IOMMU exposes queue
> > invalidation capability and passes down all descriptors from the
> > guest to the physical IOMMU.
> >
> > The assumption is that guest to host device ID mapping should be
> > resolved prior to calling IOMMU driver. Based on the device handle,
> > host IOMMU driver can replace certain fields before submit to the
> > invalidation queue.
> >
> > ---
> > v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
> > Fixed devTLB invalidation granularity mapping. Disregard G=1
> > case and use address selective invalidation only.
> >
> > v7 review fixed in v10
> > ---
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > Signed-off-by: Ashok Raj <[email protected]>
> > Signed-off-by: Liu, Yi L <[email protected]>
> > ---
> > drivers/iommu/intel-iommu.c | 158
> > ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158
> > insertions(+)
> >
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index 94c7993dac6a..045c5c08d71d
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5594,6 +5594,163 @@ static void
> > intel_iommu_aux_detach_device(struct iommu_domain *domain,
> > aux_domain_remove_dev(to_dmar_domain(domain), dev); }
> >
> > +/*
> > + * 2D array for converting and sanitizing IOMMU generic TLB
> > granularity to
> > + * VT-d granularity. Invalidation is typically included in the
> > unmap operation
> > + * as a result of DMA or VFIO unmap. However, for assigned devices
> > guest
> > + * owns the first level page tables. Invalidations of translation
> > caches in the
> > + * guest are trapped and passed down to the host.
> > + *
> > + * vIOMMU in the guest will only expose first level page tables,
> > therefore
> > + * we do not support IOTLB granularity for request without PASID
> > (second level).
> > + *
> > + * For example, to find the VT-d granularity encoding for IOTLB
> > + * type and page selective granularity within PASID:
> > + * X: indexed by iommu cache type
> > + * Y: indexed by enum iommu_inv_granularity
> > + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> > + */
> > +
> > +const static int
> > inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] =
> > {
> > + /*
> > + * PASID based IOTLB invalidation: PASID selective (per
> > PASID),
> > + * page selective (address granularity)
> > + */
> > + {-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> > + /* PASID based dev TLBs */
> > + {-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
> > + /* PASID cache */
> > + {-EINVAL, -EINVAL, -EINVAL}
> > +};
> > +
> > +static inline int to_vtd_granularity(int type, int granu)
> > +{
> > + return inv_type_granu_table[type][granu];
> > +}
> > +
> > +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> > +{
> > + u64 nr_pages = (granu_size * nr_granules) >>
> > VTD_PAGE_SHIFT; +
> > + /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
> > for 2MB, etc.
> > + * IOMMU cache invalidate API passes granu_size in bytes,
> > and number of
> > + * granu size in contiguous memory.
> > + */
> > + return order_base_2(nr_pages);
> > +}
> > +
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> > + struct device *dev, struct
> > iommu_cache_invalidate_info *inv_info) +{
> > + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > + struct device_domain_info *info;
> > + struct intel_iommu *iommu;
> > + unsigned long flags;
> > + int cache_type;
> > + u8 bus, devfn;
> > + u16 did, sid;
> > + int ret = 0;
> > + u64 size = 0;
> > +
> > + if (!inv_info || !dmar_domain ||
> > + inv_info->version !=
> > IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > + return -EINVAL;
> > +
> > + if (!dev || !dev_is_pci(dev))
> > + return -ENODEV;
>
> Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?
Good point
> > +
> > + iommu = device_to_iommu(dev, &bus, &devfn);
> > + if (!iommu)
> > + return -ENODEV;
> > +
> > + spin_lock_irqsave(&device_domain_lock, flags);
> > + spin_lock(&iommu->lock);
> > + info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> > devfn);
> > + if (!info) {
> > + ret = -EINVAL;
> > + goto out_unlock;
> > + }
> > + did = dmar_domain->iommu_did[iommu->seq_id];
> > + sid = PCI_DEVID(bus, devfn);
> > +
> > + /* Size is only valid in non-PASID selective invalidation
> > */
> > + if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> > + size =
> > to_vtd_size(inv_info->addr_info.granule_size,
> > +
> > inv_info->addr_info.nb_granules); +
> > + for_each_set_bit(cache_type, (unsigned long
> > *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> > + int granu = 0;
> > + u64 pasid = 0;
> > +
> > + granu = to_vtd_granularity(cache_type,
> > inv_info->granularity);
> > + if (granu == -EINVAL) {
> > + pr_err("Invalid cache type and granu
> > combination %d/%d\n", cache_type,
> > + inv_info->granularity);
> rate unlimited traces here and after.
> ret = -EINVAL?
> also related to the discussion we had on the VFIO series. What is the
> error policy?
Invalidation should not fail, there is no way to tell the guest. this
can be a void function or VFIO can ignore it similar to what is done to
iommu_unmap.
> > + break;
> > + }
> > +
> > + /* PASID is stored in different locations based on
> > granularity */
> not sure the above comment really is requested
Just to explain the situation where PASID can be in pasid_info or
addr_info.
> > + if (inv_info->granularity == IOMMU_INV_GRANU_PASID
> > &&
> > + inv_info->pasid_info.flags &
> > IOMMU_INV_PASID_FLAGS_PASID)
> > + pasid = inv_info->pasid_info.pasid;
> > + else if (inv_info->granularity ==
> > IOMMU_INV_GRANU_ADDR &&
> > + inv_info->addr_info.flags &
> > IOMMU_INV_ADDR_FLAGS_PASID)
> > + pasid = inv_info->addr_info.pasid;
> > +
> > + switch (BIT(cache_type)) {
> > + case IOMMU_CACHE_INV_TYPE_IOTLB:
> > + if ((inv_info->granularity ==
> > IOMMU_INV_GRANU_ADDR) &&
> > + size && (inv_info->addr_info.addr
> > & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> > + pr_err("Address out of range,
> > 0x%llx, size order %llu\n",
> > + inv_info->addr_info.addr,
> > size);
> > + ret = -ERANGE;
> > + goto out_unlock;
> > + }
> > +
> > + qi_flush_piotlb(iommu, did,
> > + pasid,
> > +
> > mm_to_dma_pfn(inv_info->addr_info.addr),
> This does not sound correct to me:
> inv_info->addr_info.addr and inv_info->addr_info.flags are not valid
> in case of PASID-selective invalidation
If granu is PASID-selective, this address is ignored. Since npages = -1.
> > + (granu ==
> > QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> > + inv_info->addr_info.flags
> > & IOMMU_INV_ADDR_FLAGS_LEAF); +
> > + /*
> > + * Always flush device IOTLB if ATS is
> > enabled. vIOMMU
> > + * in the guest may assume IOTLB flush is
> > inclusive,
> > + * which is more efficient.
> > + */
> > + if (info->ats_enabled)
> > + qi_flush_dev_iotlb_pasid(iommu,
> > sid, info->pfsid,
> > + pasid,
> > info->ats_qdep,
> > +
> > inv_info->addr_info.addr, size,
> same
> > + granu);
> > + break;
> > + case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> > + if (info->ats_enabled)
> > + qi_flush_dev_iotlb_pasid(iommu,
> > sid, info->pfsid,
> > +
> > inv_info->addr_info.pasid, info->ats_qdep,
> nit: use pasid directly
will do.
> > +
> > inv_info->addr_info.addr, size,
> > + granu);
> > + else
> > + pr_warn("Passdown device IOTLB
> > flush w/o ATS!\n");
> > + break;
> > + case IOMMU_CACHE_INV_TYPE_PASID:
> > + qi_flush_pasid_cache(iommu, did, granu,
> > inv_info->pasid_info.pasid);
> > + break;
> > + default:
> > + dev_err(dev, "Unsupported IOMMU
> > invalidation type %d\n",
> > + cache_type);
> > + ret = -EINVAL;
> > + }
> > + }
> > +out_unlock:
> > + spin_unlock(&iommu->lock);
> > + spin_unlock_irqrestore(&device_domain_lock, flags);
> > +
> > + return ret;
> > +}
> > +#endif
> > +
> > static int intel_iommu_map(struct iommu_domain *domain,
> > unsigned long iova, phys_addr_t hpa,
> > size_t size, int iommu_prot, gfp_t gfp)
> > @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
> > .is_attach_deferred =
> > intel_iommu_is_attach_deferred, .pgsize_bitmap =
> > INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> > + .cache_invalidate = intel_iommu_sva_invalidate,
> > .sva_bind_gpasid = intel_svm_bind_gpasid,
> > .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> > #endif
> >
> Thanks
>
> Eric
>
[Jacob Pan]
Hi Jacob,
On 4/10/20 9:45 PM, Jacob Pan wrote:
> Hi Eric,
>
> On Thu, 9 Apr 2020 09:41:32 +0200
> Auger Eric <[email protected]> wrote:
>
>> Hi Jacob,
>>
>> On 4/3/20 8:42 PM, Jacob Pan wrote:
>>> When supporting guest SVA with emulated IOMMU, the guest PASID
>>> table is shadowed in VMM. Updates to guest vIOMMU PASID table
>>> will result in PASID cache flush which will be passed down to
>>> the host as bind guest PASID calls.
>>>
>>> For the SL page tables, it will be harvested from device's
>>> default domain (request w/o PASID), or aux domain in case of
>>> mediated device.
>>>
>>> .-------------. .---------------------------.
>>> | vIOMMU | | Guest process CR3, FL only|
>>> | | '---------------------------'
>>> .----------------/
>>> | PASID Entry |--- PASID cache flush -
>>> '-------------' |
>>> | | V
>>> | | CR3 in GPA
>>> '-------------'
>>> Guest
>>> ------| Shadow |--------------------------|--------
>>> v v v
>>> Host
>>> .-------------. .----------------------.
>>> | pIOMMU | | Bind FL for GVA-GPA |
>>> | | '----------------------'
>>> .----------------/ |
>>> | PASID Entry | V (Nested xlate)
>>> '----------------\.------------------------------.
>>> | | |SL for GPA-HPA, default domain|
>>> | | '------------------------------'
>>> '-------------'
>>> Where:
>>> - FL = First level/stage one page tables
>>> - SL = Second level/stage two page tables
>>>
>>> ---
>>> v11: Fixed locking, avoid duplicated paging mode check, added
>>> helper to free svm if device list is empty. Use rate limited error
>>> message since the bind gpasid call comes from user space.
>>> ---
>>>
>>> Signed-off-by: Jacob Pan <[email protected]>
>>> Signed-off-by: Liu, Yi L <[email protected]>
>>> ---
>>> drivers/iommu/intel-iommu.c | 4 +
>>> drivers/iommu/intel-svm.c | 206
>>> ++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/intel-iommu.h | 8 +- include/linux/intel-svm.h |
>>> 17 ++++ 4 files changed, 234 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index c0dadec5a6b3..94c7993dac6a
>>> 100644 --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -6178,6 +6178,10 @@ const struct iommu_ops intel_iommu_ops = {
>>> .dev_disable_feat = intel_iommu_dev_disable_feat,
>>> .is_attach_deferred =
>>> intel_iommu_is_attach_deferred, .pgsize_bitmap =
>>> INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
>>> + .sva_bind_gpasid = intel_svm_bind_gpasid,
>>> + .sva_unbind_gpasid = intel_svm_unbind_gpasid,
>>> +#endif
>>> };
>>>
>>> static void quirk_iommu_igfx(struct pci_dev *dev)
>>> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
>>> index d7f2a5358900..7cf711318b87 100644
>>> --- a/drivers/iommu/intel-svm.c
>>> +++ b/drivers/iommu/intel-svm.c
>>> @@ -226,6 +226,212 @@ static LIST_HEAD(global_svm_list);
>>> list_for_each_entry((sdev), &(svm)->devs, list) \
>>> if ((d) != (sdev)->dev) {} else
>>>
>>> +
>>> +static inline void intel_svm_free_if_empty(struct intel_svm *svm,
>>> u64 pasid) +{
>>> + if (list_empty(&svm->devs)) {
>>> + ioasid_set_data(pasid, NULL);
>>> + kfree(svm);
>>> + }
>>> +}
>>> +
>>> +int intel_svm_bind_gpasid(struct iommu_domain *domain,
>>> + struct device *dev,
>>> + struct iommu_gpasid_bind_data *data)
>>> +{
>>> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> + struct dmar_domain *dmar_domain;
>>> + struct intel_svm_dev *sdev;
>>> + struct intel_svm *svm;
>>> + int ret = 0;
>>> +
>>> + if (WARN_ON(!iommu) || !data)
>>> + return -EINVAL;
>>> +
>>> + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
>>> + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
>>> + return -EINVAL;
>>> +
>>> + if (dev_is_pci(dev)) {
>>> + /* VT-d supports devices with full 20 bit PASIDs
>>> only */
>>> + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
>>> + return -EINVAL;
>>> + } else {
>>> + return -ENOTSUPP;
>>> + }
>>> +
>>> + /*
>>> + * We only check host PASID range, we have no knowledge to
>>> check
>>> + * guest PASID range.
>>> + */
>>> + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
>>> + return -EINVAL;
>>> +
>>> + dmar_domain = to_dmar_domain(domain);
>>> +
>>> + mutex_lock(&pasid_mutex);
>>> + svm = ioasid_find(NULL, data->hpasid, NULL);
>>> + if (IS_ERR(svm)) {
>>> + ret = PTR_ERR(svm);
>>> + goto out;
>>> + }
>>> +
>>> + if (svm) {
>>> + /*
>>> + * If we found svm for the PASID, there must be at
>>> + * least one device bond, otherwise svm should be
>>> freed.
>>> + */
>>> + if (WARN_ON(list_empty(&svm->devs))) {
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + for_each_svm_dev(sdev, svm, dev) {
>>> + /* In case of multiple sub-devices of the
>>> same pdev
>>> + * assigned, we should allow multiple bind
>>> calls with
>>> + * the same PASID and pdev.
>>> + */
>>> + sdev->users++;
>> What if this is not an mdev device. Is it also allowed?
> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> example of normal use case. You can bind the same PCI device (PF or
> SRIOV VF) more than once to the same PASID. Just need to unbind also.
I don't get the point of binding a non mdev device several times with
the same PASID. Do you intend to allow that at userspace level or
prevent this from happening in VFIO?
Besides, the comment is a bit misleading as it gives the impression it
is only true for mdev and there is no associated check.
>
> Do you see any issues?
no I don't
>
>>> + goto out;
>>> + }
>>> + } else {
>>> + /* We come here when PASID has never been bond to
>>> a device. */
>>> + svm = kzalloc(sizeof(*svm), GFP_KERNEL);
>>> + if (!svm) {
>>> + ret = -ENOMEM;
>>> + goto out;
>>> + }
>>> + /* REVISIT: upper layer/VFIO can track host
>>> process that bind the PASID.
>>> + * ioasid_set = mm might be sufficient for vfio to
>>> check pasid VMM
>>> + * ownership. We can drop the following line once
>>> VFIO and IOASID set
>>> + * check is in place.
>>> + */
>>> + svm->mm = get_task_mm(current);
>>> + svm->pasid = data->hpasid;
>>> + if (data->flags & IOMMU_SVA_GPASID_VAL) {
>>> + svm->gpasid = data->gpasid;
>>> + svm->flags |= SVM_FLAG_GUEST_PASID;
>>> + }
>>> + ioasid_set_data(data->hpasid, svm);
>>> + INIT_LIST_HEAD_RCU(&svm->devs);
>>> + mmput(svm->mm);
>>> + }
>>> + sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
>>> + if (!sdev) {
>>> + /*
>>> + * If this is a new PASID that never bond to a
>>> device, then
>>> + * the device list must be empty which indicates
>>> struct svm
>>> + * was allocated in this function.
>>> + */
>>> + intel_svm_free_if_empty(svm, data->hpasid);
>>> + ret = -ENOMEM;
>>> + goto out;
>>> + }
>>> + sdev->dev = dev;
>>> + sdev->users = 1;
>>> +
>>> + /* Set up device context entry for PASID if not enabled
>>> already */
>>> + ret = intel_iommu_enable_pasid(iommu, sdev->dev);
>>> + if (ret) {
>>> + dev_err_ratelimited(dev, "Failed to enable PASID
>>> capability\n");
>>> + kfree(sdev);
>>> + intel_svm_free_if_empty(svm, data->hpasid);
>>> + goto out;
>>> + }
>>> +
>>> + /*
>>> + * PASID table is per device for better security.
>>> Therefore, for
>>> + * each bind of a new device even with an existing PASID,
>>> we need to
>>> + * call the nested mode setup function here.
>>> + */
>>> + spin_lock(&iommu->lock);
>>> + ret = intel_pasid_setup_nested(iommu,
>>> + dev,
>>> + (pgd_t *)data->gpgd,
>>> + data->hpasid,
>>> + &data->vtd,
>>> + dmar_domain,
>>> + data->addr_width);
>>> + if (ret) {
>>> + dev_err_ratelimited(dev, "Failed to set up PASID
>>> %llu in nested mode, Err %d\n",
>>> + data->hpasid, ret);
>>> + /*
>>> + * PASID entry should be in cleared state if
>>> nested mode
>>> + * set up failed. So we only need to clear IOASID
>>> tracking
>>> + * data such that free call will succeed.
>>> + */
>>> + kfree(sdev);
>>> + intel_svm_free_if_empty(svm, data->hpasid);
>>> + spin_unlock(&iommu->lock);
>>> + goto out;
>>> + }
>>> + spin_unlock(&iommu->lock);
>>> + svm->flags |= SVM_FLAG_GUEST_MODE;
>>> +
>>> + init_rcu_head(&sdev->rcu);
>>> + list_add_rcu(&sdev->list, &svm->devs);
>>> + out:
>>> + mutex_unlock(&pasid_mutex);
>>> + return ret;
>>> +}
>>> +
>>> +int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>>> +{
>> This function's code looks very very similar to intel_svm_unbind_mm()
>> besides to the mmu_notifier_unregister and the ioasid_free(). Don't we
>> have means to factorize the code by checking the SVM_FLAG_GUEST_MODE
>> flag?
>>
> Yes, I agree there is room to consolidate. But the same time there are
> two potential changes coming:
> 1. I also try to consolidate with Jean's generic sva_bind_device() for
> native bind.
> https://lkml.org/lkml/2020/2/24/1351
> 2. Support enqcmd and lazy PASID free.
> https://lkml.org/lkml/2020/3/30/910
>
> Perhaps we can revisit after those get resolved?
yep
>
>> Please could you explain again what does happen if the guest process
>> dies/VM dies. How do we make sure the svm get freed. I fail to
>> understand the whole picture at the moment as you cannot fully rely on
>> the userspace to call unbind_gpasid.
>
> The fault handling needs to be better documented. Here is the
> current plan as I proposed in the IOASID extension set with IOASID
> notifier.
> https://lkml.org/lkml/2020/3/25/874
> I will do a document in the next version. In short On VT-d, IOASID has
> the following subscribers:
> 1. IOMMU driver, when free comes before unbind, since there is no way
> to hook up with guest mmu_notifier directly.
> 2. KVM, needs to update VMCS PASID translation table. This is needed
> when we support non-identity G-H PASID mapping.
> 3. VDCM (Virtual device composition module), similar to SRIOV PF
> driver, also needs G-H PASID translation.
>
> So when a guest goes away, based on FD close VFIO needs to free all
> the PASIDs belong to the guest. The proposal is to call
> ioasid_free_set() where the ioasid_set contains all the IOASIDs
> for the VM.
> A blocking notifier will be called, IOMMU gets notified and performs
> all the clean up on IOMMU side as follows:
> 1.clear PASID table entry, flush PASID and IOTLB, devTLB. Drain page
> requests for the PASID. The same thing you would do during unbind.
> 2.mark PASID defunct such that we can drop further DMA faults and PRQ
>
> For individual guest PASID, the normal order is
> 1. alloc (via virtual cmd)
> 2. bind_gpasid
> 3. unbind_gpasid
> 4. free (via virtual cmd)
>
> If 4 comes before 3, free will call ioasid notifier. IOMMU gets
> notified and perform cleanup as above. When unbind comes, there is no
> PASID entry then exit.
[*] could you detail when the free comes before unbind and the opposite?
>
> One challenge is the cleanup can be lengthy and requires thread
> context. SVA code needed to call ioasid_free in atomic context
> (mmu_notifier release, SRCU callback). This might be changing with
> Jean's patch to defer ioasid_free, this restriction may go away.
> https://www.spinics.net/lists/iommu/msg43065.html
>
> I still feel the clean up work in notifier handler is too heavy.
> Two options I can think of:
> 1. let VFIO enforce the unbind->free order
depends on [*]
> 2. introduce some reclaim phase for IOASID instead of putting the
> IOASID back to the pool immediately.
what do you mean by reclain phase?
Thanks
Eric
>
> Any suggestions?
>
>>> + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> + struct intel_svm_dev *sdev;
>>> + struct intel_svm *svm;
>>> + int ret = -EINVAL;
>>> +
>>> + if (WARN_ON(!iommu))
>>> + return -EINVAL;
>>> +
>>> + mutex_lock(&pasid_mutex);
>>> + svm = ioasid_find(NULL, pasid, NULL);
>>> + if (!svm) {
>>> + ret = -EINVAL;
>> As per the discussion on the VFIO series, shall we return an error in
>> that case? Same below?
> You mean make unbind a void function? I agree, there is really no way
> to recover. Same for invalidate, which cannot fail.
> Perhaps another series?
>
>>> + goto out;
>>> + }
>>> +
>>> + if (IS_ERR(svm)) {
>>> + ret = PTR_ERR(svm);
>>> + goto out;
>>> + }
>>> +
>>> + for_each_svm_dev(sdev, svm, dev) {
>>> + ret = 0;
>>> + sdev->users--;
>>> + if (!sdev->users) {
>>> + list_del_rcu(&sdev->list);
>>> + intel_pasid_tear_down_entry(iommu, dev,
>>> svm->pasid);
>> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
>> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
>>> + /* TODO: Drain in flight PRQ for the PASID
>>> since it
>>> + * may get reused soon, we don't want to
>>> + * confuse with its previous life.
>>> + * intel_svm_drain_prq(dev, pasid);
>>> + */
>>> + kfree_rcu(sdev, rcu);
>>> +
>>> + if (list_empty(&svm->devs)) {
>>> + /*
>>> + * We do not free the IOASID here
>>> in that
>>> + * IOMMU driver did not allocate
>>> it.
>> s/in/as?
>>> + * Unlike native SVM, IOASID for
>>> guest use was
>>> + * allocated prior to the bind
>>> call.> + * In any case, if the free
>>> call comes before
>>> + * the unbind, IOMMU driver will
>>> get notified
>>> + * and perform cleanup.
>>> + */
>>> + ioasid_set_data(pasid, NULL);
>>> + kfree(svm);
>>> + }
>> nit: you may use intel_svm_free_if_empty()
>>> + }
>>> + break;
>>> + }
>>> +out:
>>> + mutex_unlock(&pasid_mutex);
>>> +
>> nit : spare new line
> Sounds good.
>
>>> + return ret;
>>> +}
>>> +
>>> int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
>>> struct svm_dev_ops *ops) {
>>> struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>> diff --git a/include/linux/intel-iommu.h
>>> b/include/linux/intel-iommu.h index 6da03f627ba3..a5bd53cf190c
>>> 100644 --- a/include/linux/intel-iommu.h
>>> +++ b/include/linux/intel-iommu.h
>>> @@ -706,7 +706,9 @@ struct dmar_domain *find_domain(struct device
>>> *dev); extern void intel_svm_check(struct intel_iommu *iommu);
>>> extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>>> extern int intel_svm_finish_prq(struct intel_iommu *iommu);
>>> -
>>> +extern int intel_svm_bind_gpasid(struct iommu_domain *domain,
>>> + struct device *dev, struct iommu_gpasid_bind_data
>>> *data); +extern int intel_svm_unbind_gpasid(struct device *dev, int
>>> pasid); struct svm_dev_ops;
>>>
>>> struct intel_svm_dev {
>>> @@ -723,9 +725,13 @@ struct intel_svm_dev {
>>> struct intel_svm {
>>> struct mmu_notifier notifier;
>>> struct mm_struct *mm;
>>> +
>>> struct intel_iommu *iommu;
>>> int flags;
>>> int pasid;
>>> + int gpasid; /* Guest PASID in case of vSVA bind with
>>> non-identity host
>>> + * to guest PASID mapping.
>>> + */
>>> struct list_head devs;
>>> struct list_head list;
>>> };
>>> diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
>>> index d7c403d0dd27..c19690937540 100644
>>> --- a/include/linux/intel-svm.h
>>> +++ b/include/linux/intel-svm.h
>>> @@ -44,6 +44,23 @@ struct svm_dev_ops {
>>> * do such IOTLB flushes automatically.
>>> */
>>> #define SVM_FLAG_SUPERVISOR_MODE (1<<1)
>>> +/*
>>> + * The SVM_FLAG_GUEST_MODE flag is used when a guest process bind
>>> to a device.
>>> + * In this case the mm_struct is in the guest kernel or userspace,
>>> its life
>>> + * cycle is managed by VMM and VFIO layer. For IOMMU driver, this
>>> API provides
>>> + * means to bind/unbind guest CR3 with PASIDs allocated for a
>>> device.
>>> + */
>>> +#define SVM_FLAG_GUEST_MODE (1<<2)
>>> +/*
>>> + * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own
>>> PASID space,
>>> + * which requires guest and host PASID translation at both
>>> directions. We keep
>>> + * track of guest PASID in order to provide lookup service to
>>> device drivers.
>>> + * One such example is a physical function (PF) driver that
>>> supports mediated
>>> + * device (mdev) assignment. Guest programming of mdev
>>> configuration space can
>>> + * only be done with guest PASID, therefore PF driver needs to
>>> find the matching
>>> + * host PASID to program the real hardware.
>>> + */
>>> +#define SVM_FLAG_GUEST_PASID (1<<3)
>>>
>>> #ifdef CONFIG_INTEL_IOMMU_SVM
>>>
>>>
>> Thanks
>>
>> Eric
>>
>
> [Jacob Pan]
>
Hi Jacob,
On 4/10/20 11:06 PM, Jacob Pan wrote:
> Hi Eric,
>
> Missed a few things in the last reply.
>
> On Thu, 9 Apr 2020 09:41:32 +0200
> Auger Eric <[email protected]> wrote:
>
>>> + intel_pasid_tear_down_entry(iommu, dev,
>>> svm->pasid);
>> intel_svm_unbind_mm() calls intel_flush_svm_range_dev(svm, sdev, 0,
>> -1, 0); Don't we need to flush the (DEV-)IOTLBs as well?
> Right, pasid tear down should always include (DEV-)IOTLB flush, I
> initially thought it is taken care of by intel_pasid_tear_down_entry().
>
>>> + /* TODO: Drain in flight PRQ for the PASID
>>> since it
>>> + * may get reused soon, we don't want to
>>> + * confuse with its previous life.
>>> + * intel_svm_drain_prq(dev, pasid);
>>> + */
>>> + kfree_rcu(sdev, rcu);
>>> +
>>> + if (list_empty(&svm->devs)) {
>>> + /*
>>> + * We do not free the IOASID here
>>> in that
>>> + * IOMMU driver did not allocate
>>> it.
>> s/in/as?
> I meant to say "in that" as "for the reason that"
ok sorry
>
>>> + * Unlike native SVM, IOASID for
>>> guest use was
>>> + * allocated prior to the bind
>>> call.> + * In any case, if the free
>>> call comes before
>>> + * the unbind, IOMMU driver will
>>> get notified
>>> + * and perform cleanup.
>>> + */
>>> + ioasid_set_data(pasid, NULL);
>>> + kfree(svm);
>>> + }
>> nit: you may use intel_svm_free_if_empty()
> True, but I meant to insert ioasid_notifier under the list_empty
> condition in the following ioasid patch.
ok
Thanks
Eric
>
>
> Thanks,
>
> Jacob
>
Hi Jacob,
On 4/10/20 11:56 PM, Jacob Pan wrote:
> On Thu, 9 Apr 2020 10:50:34 +0200
> Auger Eric <[email protected]> wrote:
>
>> Hi Jacob,
>>
>> On 4/3/20 8:42 PM, Jacob Pan wrote:
>>> When Shared Virtual Address (SVA) is enabled for a guest OS via
>>> vIOMMU, we need to provide invalidation support at IOMMU API and
>>> driver level. This patch adds Intel VT-d specific function to
>>> implement iommu passdown invalidate API for shared virtual address.
>>>
>>> The use case is for supporting caching structure invalidation
>>> of assigned SVM capable devices. Emulated IOMMU exposes queue
>>> invalidation capability and passes down all descriptors from the
>>> guest to the physical IOMMU.
>>>
>>> The assumption is that guest to host device ID mapping should be
>>> resolved prior to calling IOMMU driver. Based on the device handle,
>>> host IOMMU driver can replace certain fields before submit to the
>>> invalidation queue.
>>>
>>> ---
>>> v11 - Removed 2D map array, use -EINVAL in granularity lookup array.
>>> Fixed devTLB invalidation granularity mapping. Disregard G=1
>>> case and use address selective invalidation only.
>>>
>>> v7 review fixed in v10
>>> ---
>>>
>>> Signed-off-by: Jacob Pan <[email protected]>
>>> Signed-off-by: Ashok Raj <[email protected]>
>>> Signed-off-by: Liu, Yi L <[email protected]>
>>> ---
>>> drivers/iommu/intel-iommu.c | 158
>>> ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158
>>> insertions(+)
>>>
>>> diff --git a/drivers/iommu/intel-iommu.c
>>> b/drivers/iommu/intel-iommu.c index 94c7993dac6a..045c5c08d71d
>>> 100644 --- a/drivers/iommu/intel-iommu.c
>>> +++ b/drivers/iommu/intel-iommu.c
>>> @@ -5594,6 +5594,163 @@ static void
>>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
>>> aux_domain_remove_dev(to_dmar_domain(domain), dev); }
>>>
>>> +/*
>>> + * 2D array for converting and sanitizing IOMMU generic TLB
>>> granularity to
>>> + * VT-d granularity. Invalidation is typically included in the
>>> unmap operation
>>> + * as a result of DMA or VFIO unmap. However, for assigned devices
>>> guest
>>> + * owns the first level page tables. Invalidations of translation
>>> caches in the
>>> + * guest are trapped and passed down to the host.
>>> + *
>>> + * vIOMMU in the guest will only expose first level page tables,
>>> therefore
>>> + * we do not support IOTLB granularity for request without PASID
>>> (second level).
>>> + *
>>> + * For example, to find the VT-d granularity encoding for IOTLB
>>> + * type and page selective granularity within PASID:
>>> + * X: indexed by iommu cache type
>>> + * Y: indexed by enum iommu_inv_granularity
>>> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
>>> + */
>>> +
>>> +const static int
>>> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR] =
>>> {
>>> + /*
>>> + * PASID based IOTLB invalidation: PASID selective (per
>>> PASID),
>>> + * page selective (address granularity)
>>> + */
>>> + {-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
>>> + /* PASID based dev TLBs */
>>> + {-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
>>> + /* PASID cache */
>>> + {-EINVAL, -EINVAL, -EINVAL}
>>> +};
>>> +
>>> +static inline int to_vtd_granularity(int type, int granu)
>>> +{
>>> + return inv_type_granu_table[type][granu];
>>> +}
>>> +
>>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
>>> +{
>>> + u64 nr_pages = (granu_size * nr_granules) >>
>>> VTD_PAGE_SHIFT; +
>>> + /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k, 9
>>> for 2MB, etc.
>>> + * IOMMU cache invalidate API passes granu_size in bytes,
>>> and number of
>>> + * granu size in contiguous memory.
>>> + */
>>> + return order_base_2(nr_pages);
>>> +}
>>> +
>>> +#ifdef CONFIG_INTEL_IOMMU_SVM
>>> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
>>> + struct device *dev, struct
>>> iommu_cache_invalidate_info *inv_info) +{
>>> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>>> + struct device_domain_info *info;
>>> + struct intel_iommu *iommu;
>>> + unsigned long flags;
>>> + int cache_type;
>>> + u8 bus, devfn;
>>> + u16 did, sid;
>>> + int ret = 0;
>>> + u64 size = 0;
>>> +
>>> + if (!inv_info || !dmar_domain ||
>>> + inv_info->version !=
>>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
>>> + return -EINVAL;
>>> +
>>> + if (!dev || !dev_is_pci(dev))
>>> + return -ENODEV;
>>
>> Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?
> Good point
>
>>> +
>>> + iommu = device_to_iommu(dev, &bus, &devfn);
>>> + if (!iommu)
>>> + return -ENODEV;
>>> +
>>> + spin_lock_irqsave(&device_domain_lock, flags);
>>> + spin_lock(&iommu->lock);
>>> + info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
>>> devfn);
>>> + if (!info) {
>>> + ret = -EINVAL;
>>> + goto out_unlock;
>>> + }
>>> + did = dmar_domain->iommu_did[iommu->seq_id];
>>> + sid = PCI_DEVID(bus, devfn);
>>> +
>>> + /* Size is only valid in non-PASID selective invalidation
>>> */
>>> + if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
>>> + size =
>>> to_vtd_size(inv_info->addr_info.granule_size,
>>> +
>>> inv_info->addr_info.nb_granules); +
>>> + for_each_set_bit(cache_type, (unsigned long
>>> *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
>>> + int granu = 0;
>>> + u64 pasid = 0;
>>> +
>>> + granu = to_vtd_granularity(cache_type,
>>> inv_info->granularity);
>>> + if (granu == -EINVAL) {
>>> + pr_err("Invalid cache type and granu
>>> combination %d/%d\n", cache_type,
>>> + inv_info->granularity);
>> rate unlimited traces here and after.
still to be fixed
>> ret = -EINVAL?
>> also related to the discussion we had on the VFIO series. What is the
>> error policy?
> Invalidation should not fail, there is no way to tell the guest. this
> can be a void function or VFIO can ignore it similar to what is done to
> iommu_unmap.
OK it was just for reminder
>
>
>>> + break;
>>> + }
>>> +
>>> + /* PASID is stored in different locations based on
>>> granularity */
>> not sure the above comment really is requested
> Just to explain the situation where PASID can be in pasid_info or
> addr_info.
>
>>> + if (inv_info->granularity == IOMMU_INV_GRANU_PASID
>>> &&
>>> + inv_info->pasid_info.flags &
>>> IOMMU_INV_PASID_FLAGS_PASID)
>>> + pasid = inv_info->pasid_info.pasid;
>>> + else if (inv_info->granularity ==
>>> IOMMU_INV_GRANU_ADDR &&
>>> + inv_info->addr_info.flags &
>>> IOMMU_INV_ADDR_FLAGS_PASID)
>>> + pasid = inv_info->addr_info.pasid;
>>> +
>>> + switch (BIT(cache_type)) {
>>> + case IOMMU_CACHE_INV_TYPE_IOTLB:
>>> + if ((inv_info->granularity ==
>>> IOMMU_INV_GRANU_ADDR) &&
>>> + size && (inv_info->addr_info.addr
>>> & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
>>> + pr_err("Address out of range,
>>> 0x%llx, size order %llu\n",
>>> + inv_info->addr_info.addr,
>>> size);
>>> + ret = -ERANGE;
>>> + goto out_unlock;
>>> + }
>>> +
>>> + qi_flush_piotlb(iommu, did,
>>> + pasid,
>>> +
>>> mm_to_dma_pfn(inv_info->addr_info.addr),
>> This does not sound correct to me:
>> inv_info->addr_info.addr and inv_info->addr_info.flags are not valid
>> in case of PASID-selective invalidation
> If granu is PASID-selective, this address is ignored. Since npages = -1.
Ah OK. Maybe add a comment since it is a bit strange to see irrelevant
fields to be called in PASID-selective case.
Thanks
Eric
>
>>> + (granu ==
>>> QI_GRAN_NONG_PASID) ? -1 : 1 << size,
>>> + inv_info->addr_info.flags
>>> & IOMMU_INV_ADDR_FLAGS_LEAF); +
>>> + /*
>>> + * Always flush device IOTLB if ATS is
>>> enabled. vIOMMU
>>> + * in the guest may assume IOTLB flush is
>>> inclusive,
>>> + * which is more efficient.
>>> + */
>>> + if (info->ats_enabled)
>>> + qi_flush_dev_iotlb_pasid(iommu,
>>> sid, info->pfsid,
>>> + pasid,
>>> info->ats_qdep,
>>> +
>>> inv_info->addr_info.addr, size,
>> same
>>> + granu);
>>> + break;
>>> + case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
>>> + if (info->ats_enabled)
>>> + qi_flush_dev_iotlb_pasid(iommu,
>>> sid, info->pfsid,
>>> +
>>> inv_info->addr_info.pasid, info->ats_qdep,
>> nit: use pasid directly
> will do.
>
>>> +
>>> inv_info->addr_info.addr, size,
>>> + granu);
>>> + else
>>> + pr_warn("Passdown device IOTLB
>>> flush w/o ATS!\n");
>>> + break;
>>> + case IOMMU_CACHE_INV_TYPE_PASID:
>>> + qi_flush_pasid_cache(iommu, did, granu,
>>> inv_info->pasid_info.pasid);
>>> + break;
>>> + default:
>>> + dev_err(dev, "Unsupported IOMMU
>>> invalidation type %d\n",
>>> + cache_type);
>>> + ret = -EINVAL;
>>> + }
>>> + }
>>> +out_unlock:
>>> + spin_unlock(&iommu->lock);
>>> + spin_unlock_irqrestore(&device_domain_lock, flags);
>>> +
>>> + return ret;
>>> +}
>>> +#endif
>>> +
>>> static int intel_iommu_map(struct iommu_domain *domain,
>>> unsigned long iova, phys_addr_t hpa,
>>> size_t size, int iommu_prot, gfp_t gfp)
>>> @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops = {
>>> .is_attach_deferred =
>>> intel_iommu_is_attach_deferred, .pgsize_bitmap =
>>> INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
>>> + .cache_invalidate = intel_iommu_sva_invalidate,
>>> .sva_bind_gpasid = intel_svm_bind_gpasid,
>>> .sva_unbind_gpasid = intel_svm_unbind_gpasid,
>>> #endif
>>>
>> Thanks
>>
>> Eric
>>
>
> [Jacob Pan]
>
> From: Auger Eric
> Sent: Thursday, April 16, 2020 6:43 PM
>
[...]
> >>> + if (svm) {
> >>> + /*
> >>> + * If we found svm for the PASID, there must be at
> >>> + * least one device bond, otherwise svm should be
> >>> freed.
> >>> + */
> >>> + if (WARN_ON(list_empty(&svm->devs))) {
> >>> + ret = -EINVAL;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + for_each_svm_dev(sdev, svm, dev) {
> >>> + /* In case of multiple sub-devices of the
> >>> same pdev
> >>> + * assigned, we should allow multiple bind
> >>> calls with
> >>> + * the same PASID and pdev.
> >>> + */
> >>> + sdev->users++;
> >> What if this is not an mdev device. Is it also allowed?
> > Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> > example of normal use case. You can bind the same PCI device (PF or
> > SRIOV VF) more than once to the same PASID. Just need to unbind also.
>
> I don't get the point of binding a non mdev device several times with
> the same PASID. Do you intend to allow that at userspace level or
> prevent this from happening in VFIO?
I feel it's better to prevent this from happening, otherwise VFIO also
needs to track the bind count and do multiple unbinds at mm_exit.
But it's not necessary to prevent it in VFIO. We can check here
upon whether aux_domain is valid, and if not return -EBUSY.
>
> Besides, the comment is a bit misleading as it gives the impression it
> is only true for mdev and there is no associated check.
Thanks
Kevin
Hi Kevin,
On 4/17/20 4:45 AM, Tian, Kevin wrote:
>> From: Auger Eric
>> Sent: Thursday, April 16, 2020 6:43 PM
>>
> [...]
>>>>> + if (svm) {
>>>>> + /*
>>>>> + * If we found svm for the PASID, there must be at
>>>>> + * least one device bond, otherwise svm should be
>>>>> freed.
>>>>> + */
>>>>> + if (WARN_ON(list_empty(&svm->devs))) {
>>>>> + ret = -EINVAL;
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + for_each_svm_dev(sdev, svm, dev) {
>>>>> + /* In case of multiple sub-devices of the
>>>>> same pdev
>>>>> + * assigned, we should allow multiple bind
>>>>> calls with
>>>>> + * the same PASID and pdev.
>>>>> + */
>>>>> + sdev->users++;
>>>> What if this is not an mdev device. Is it also allowed?
>>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
>>> example of normal use case. You can bind the same PCI device (PF or
>>> SRIOV VF) more than once to the same PASID. Just need to unbind also.
>>
>> I don't get the point of binding a non mdev device several times with
>> the same PASID. Do you intend to allow that at userspace level or
>> prevent this from happening in VFIO?
>
> I feel it's better to prevent this from happening, otherwise VFIO also
> needs to track the bind count and do multiple unbinds at mm_exit.
> But it's not necessary to prevent it in VFIO. We can check here
> upon whether aux_domain is valid, and if not return -EBUSY.
Ah OK. So if we can detect the case here it is even better
Thanks
Eric
>
>>
>> Besides, the comment is a bit misleading as it gives the impression it
>> is only true for mdev and there is no associated check.
>
> Thanks
> Kevin
>
On Fri, 17 Apr 2020 09:46:55 +0200
Auger Eric <[email protected]> wrote:
> Hi Kevin,
> On 4/17/20 4:45 AM, Tian, Kevin wrote:
> >> From: Auger Eric
> >> Sent: Thursday, April 16, 2020 6:43 PM
> >>
> > [...]
> >>>>> + if (svm) {
> >>>>> + /*
> >>>>> + * If we found svm for the PASID, there must
> >>>>> be at
> >>>>> + * least one device bond, otherwise svm should
> >>>>> be freed.
> >>>>> + */
> >>>>> + if (WARN_ON(list_empty(&svm->devs))) {
> >>>>> + ret = -EINVAL;
> >>>>> + goto out;
> >>>>> + }
> >>>>> +
> >>>>> + for_each_svm_dev(sdev, svm, dev) {
> >>>>> + /* In case of multiple sub-devices of
> >>>>> the same pdev
> >>>>> + * assigned, we should allow multiple
> >>>>> bind calls with
> >>>>> + * the same PASID and pdev.
> >>>>> + */
> >>>>> + sdev->users++;
> >>>> What if this is not an mdev device. Is it also allowed?
> >>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> >>> example of normal use case. You can bind the same PCI device (PF
> >>> or SRIOV VF) more than once to the same PASID. Just need to
> >>> unbind also.
> >>
> >> I don't get the point of binding a non mdev device several times
> >> with the same PASID. Do you intend to allow that at userspace
> >> level or prevent this from happening in VFIO?
> >
> > I feel it's better to prevent this from happening, otherwise VFIO
> > also needs to track the bind count and do multiple unbinds at
> > mm_exit. But it's not necessary to prevent it in VFIO. We can check
> > here upon whether aux_domain is valid, and if not return -EBUSY.
> Ah OK. So if we can detect the case here it is even better
>
I don't understand why VFIO cannot track, since it is mdev aware. if we
don;t refcount the users, one mdev unbind will result unbind for all
mdev under the same pdev. That may not be the right thing to do.
> Thanks
>
> Eric
> >
> >>
> >> Besides, the comment is a bit misleading as it gives the
> >> impression it is only true for mdev and there is no associated
> >> check.
> >
> > Thanks
> > Kevin
> >
>
[Jacob Pan]
On Thu, 16 Apr 2020 12:59:14 +0200
Auger Eric <[email protected]> wrote:
> Hi Jacob,
> On 4/10/20 11:56 PM, Jacob Pan wrote:
> > On Thu, 9 Apr 2020 10:50:34 +0200
> > Auger Eric <[email protected]> wrote:
> >
> >> Hi Jacob,
> >>
> >> On 4/3/20 8:42 PM, Jacob Pan wrote:
> >>> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >>> vIOMMU, we need to provide invalidation support at IOMMU API and
> >>> driver level. This patch adds Intel VT-d specific function to
> >>> implement iommu passdown invalidate API for shared virtual
> >>> address.
> >>>
> >>> The use case is for supporting caching structure invalidation
> >>> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >>> invalidation capability and passes down all descriptors from the
> >>> guest to the physical IOMMU.
> >>>
> >>> The assumption is that guest to host device ID mapping should be
> >>> resolved prior to calling IOMMU driver. Based on the device
> >>> handle, host IOMMU driver can replace certain fields before
> >>> submit to the invalidation queue.
> >>>
> >>> ---
> >>> v11 - Removed 2D map array, use -EINVAL in granularity lookup
> >>> array. Fixed devTLB invalidation granularity mapping. Disregard
> >>> G=1 case and use address selective invalidation only.
> >>>
> >>> v7 review fixed in v10
> >>> ---
> >>>
> >>> Signed-off-by: Jacob Pan <[email protected]>
> >>> Signed-off-by: Ashok Raj <[email protected]>
> >>> Signed-off-by: Liu, Yi L <[email protected]>
> >>> ---
> >>> drivers/iommu/intel-iommu.c | 158
> >>> ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158
> >>> insertions(+)
> >>>
> >>> diff --git a/drivers/iommu/intel-iommu.c
> >>> b/drivers/iommu/intel-iommu.c index 94c7993dac6a..045c5c08d71d
> >>> 100644 --- a/drivers/iommu/intel-iommu.c
> >>> +++ b/drivers/iommu/intel-iommu.c
> >>> @@ -5594,6 +5594,163 @@ static void
> >>> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>> aux_domain_remove_dev(to_dmar_domain(domain), dev); }
> >>>
> >>> +/*
> >>> + * 2D array for converting and sanitizing IOMMU generic TLB
> >>> granularity to
> >>> + * VT-d granularity. Invalidation is typically included in the
> >>> unmap operation
> >>> + * as a result of DMA or VFIO unmap. However, for assigned
> >>> devices guest
> >>> + * owns the first level page tables. Invalidations of translation
> >>> caches in the
> >>> + * guest are trapped and passed down to the host.
> >>> + *
> >>> + * vIOMMU in the guest will only expose first level page tables,
> >>> therefore
> >>> + * we do not support IOTLB granularity for request without PASID
> >>> (second level).
> >>> + *
> >>> + * For example, to find the VT-d granularity encoding for IOTLB
> >>> + * type and page selective granularity within PASID:
> >>> + * X: indexed by iommu cache type
> >>> + * Y: indexed by enum iommu_inv_granularity
> >>> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >>> + */
> >>> +
> >>> +const static int
> >>> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_NR]
> >>> = {
> >>> + /*
> >>> + * PASID based IOTLB invalidation: PASID selective (per
> >>> PASID),
> >>> + * page selective (address granularity)
> >>> + */
> >>> + {-EINVAL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >>> + /* PASID based dev TLBs */
> >>> + {-EINVAL, -EINVAL, QI_DEV_IOTLB_GRAN_PASID_SEL},
> >>> + /* PASID cache */
> >>> + {-EINVAL, -EINVAL, -EINVAL}
> >>> +};
> >>> +
> >>> +static inline int to_vtd_granularity(int type, int granu)
> >>> +{
> >>> + return inv_type_granu_table[type][granu];
> >>> +}
> >>> +
> >>> +static inline u64 to_vtd_size(u64 granu_size, u64 nr_granules)
> >>> +{
> >>> + u64 nr_pages = (granu_size * nr_granules) >>
> >>> VTD_PAGE_SHIFT; +
> >>> + /* VT-d size is encoded as 2^size of 4K pages, 0 for 4k,
> >>> 9 for 2MB, etc.
> >>> + * IOMMU cache invalidate API passes granu_size in bytes,
> >>> and number of
> >>> + * granu size in contiguous memory.
> >>> + */
> >>> + return order_base_2(nr_pages);
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_INTEL_IOMMU_SVM
> >>> +static int intel_iommu_sva_invalidate(struct iommu_domain
> >>> *domain,
> >>> + struct device *dev, struct
> >>> iommu_cache_invalidate_info *inv_info) +{
> >>> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> >>> + struct device_domain_info *info;
> >>> + struct intel_iommu *iommu;
> >>> + unsigned long flags;
> >>> + int cache_type;
> >>> + u8 bus, devfn;
> >>> + u16 did, sid;
> >>> + int ret = 0;
> >>> + u64 size = 0;
> >>> +
> >>> + if (!inv_info || !dmar_domain ||
> >>> + inv_info->version !=
> >>> IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> >>> + return -EINVAL;
> >>> +
> >>> + if (!dev || !dev_is_pci(dev))
> >>> + return -ENODEV;
> >>
> >> Check (domain->flags & DOMAIN_FLAG_NESTING_MODE)?
> > Good point
> >
> >>> +
> >>> + iommu = device_to_iommu(dev, &bus, &devfn);
> >>> + if (!iommu)
> >>> + return -ENODEV;
> >>> +
> >>> + spin_lock_irqsave(&device_domain_lock, flags);
> >>> + spin_lock(&iommu->lock);
> >>> + info = iommu_support_dev_iotlb(dmar_domain, iommu, bus,
> >>> devfn);
> >>> + if (!info) {
> >>> + ret = -EINVAL;
> >>> + goto out_unlock;
> >>> + }
> >>> + did = dmar_domain->iommu_did[iommu->seq_id];
> >>> + sid = PCI_DEVID(bus, devfn);
> >>> +
> >>> + /* Size is only valid in non-PASID selective invalidation
> >>> */
> >>> + if (inv_info->granularity != IOMMU_INV_GRANU_PASID)
> >>> + size =
> >>> to_vtd_size(inv_info->addr_info.granule_size,
> >>> +
> >>> inv_info->addr_info.nb_granules); +
> >>> + for_each_set_bit(cache_type, (unsigned long
> >>> *)&inv_info->cache, IOMMU_CACHE_INV_TYPE_NR) {
> >>> + int granu = 0;
> >>> + u64 pasid = 0;
> >>> +
> >>> + granu = to_vtd_granularity(cache_type,
> >>> inv_info->granularity);
> >>> + if (granu == -EINVAL) {
> >>> + pr_err("Invalid cache type and granu
> >>> combination %d/%d\n", cache_type,
> >>> + inv_info->granularity);
> >> rate unlimited traces here and after.
> still to be fixed
Yes. will do.
> >> ret = -EINVAL?
> >> also related to the discussion we had on the VFIO series. What is
> >> the error policy?
> > Invalidation should not fail, there is no way to tell the guest.
> > this can be a void function or VFIO can ignore it similar to what
> > is done to iommu_unmap.
> OK it was just for reminder
It will also be documented in an IOMMU UAPI doc, I plan to send out
along with the feedback on UAPI version and sizing patchset.
Could you check if the following statement is true.
"TLB invalidations should always succeed from vIOMMU's
perspective. There is no architectual way to report back the vIOMMU if
the UAPI data is not compatible. For this reason the following IOMMU
UAPIs cannot fail:
1. Free PASID
2. Unbind guest PASID
3. Unbind guest PASID table (SMMU)
4. Cache invalidate
5. Page response
"
> >
> >
> >>> + break;
> >>> + }
> >>> +
> >>> + /* PASID is stored in different locations based
> >>> on granularity */
> >> not sure the above comment really is requested
> > Just to explain the situation where PASID can be in pasid_info or
> > addr_info.
> >
> >>> + if (inv_info->granularity ==
> >>> IOMMU_INV_GRANU_PASID &&
> >>> + inv_info->pasid_info.flags &
> >>> IOMMU_INV_PASID_FLAGS_PASID)
> >>> + pasid = inv_info->pasid_info.pasid;
> >>> + else if (inv_info->granularity ==
> >>> IOMMU_INV_GRANU_ADDR &&
> >>> + inv_info->addr_info.flags &
> >>> IOMMU_INV_ADDR_FLAGS_PASID)
> >>> + pasid = inv_info->addr_info.pasid;
> >>> +
> >>> + switch (BIT(cache_type)) {
> >>> + case IOMMU_CACHE_INV_TYPE_IOTLB:
> >>> + if ((inv_info->granularity ==
> >>> IOMMU_INV_GRANU_ADDR) &&
> >>> + size && (inv_info->addr_info.addr
> >>> & ((BIT(VTD_PAGE_SHIFT + size)) - 1))) {
> >>> + pr_err("Address out of range,
> >>> 0x%llx, size order %llu\n",
> >>> + inv_info->addr_info.addr,
> >>> size);
> >>> + ret = -ERANGE;
> >>> + goto out_unlock;
> >>> + }
> >>> +
> >>> + qi_flush_piotlb(iommu, did,
> >>> + pasid,
> >>> +
> >>> mm_to_dma_pfn(inv_info->addr_info.addr),
> >> This does not sound correct to me:
> >> inv_info->addr_info.addr and inv_info->addr_info.flags are not
> >> valid in case of PASID-selective invalidation
> > If granu is PASID-selective, this address is ignored. Since npages
> > = -1.
> Ah OK. Maybe add a comment since it is a bit strange to see irrelevant
> fields to be called in PASID-selective case.
>
OK, will do.
Thanks!
> Thanks
>
> Eric
> >
> >>> + (granu ==
> >>> QI_GRAN_NONG_PASID) ? -1 : 1 << size,
> >>> + inv_info->addr_info.flags
> >>> & IOMMU_INV_ADDR_FLAGS_LEAF); +
> >>> + /*
> >>> + * Always flush device IOTLB if ATS is
> >>> enabled. vIOMMU
> >>> + * in the guest may assume IOTLB flush is
> >>> inclusive,
> >>> + * which is more efficient.
> >>> + */
> >>> + if (info->ats_enabled)
> >>> + qi_flush_dev_iotlb_pasid(iommu,
> >>> sid, info->pfsid,
> >>> + pasid,
> >>> info->ats_qdep,
> >>> +
> >>> inv_info->addr_info.addr, size,
> >> same
> >>> + granu);
> >>> + break;
> >>> + case IOMMU_CACHE_INV_TYPE_DEV_IOTLB:
> >>> + if (info->ats_enabled)
> >>> + qi_flush_dev_iotlb_pasid(iommu,
> >>> sid, info->pfsid,
> >>> +
> >>> inv_info->addr_info.pasid, info->ats_qdep,
> >> nit: use pasid directly
> > will do.
> >
> >>> +
> >>> inv_info->addr_info.addr, size,
> >>> + granu);
> >>> + else
> >>> + pr_warn("Passdown device IOTLB
> >>> flush w/o ATS!\n");
> >>> + break;
> >>> + case IOMMU_CACHE_INV_TYPE_PASID:
> >>> + qi_flush_pasid_cache(iommu, did, granu,
> >>> inv_info->pasid_info.pasid);
> >>> + break;
> >>> + default:
> >>> + dev_err(dev, "Unsupported IOMMU
> >>> invalidation type %d\n",
> >>> + cache_type);
> >>> + ret = -EINVAL;
> >>> + }
> >>> + }
> >>> +out_unlock:
> >>> + spin_unlock(&iommu->lock);
> >>> + spin_unlock_irqrestore(&device_domain_lock, flags);
> >>> +
> >>> + return ret;
> >>> +}
> >>> +#endif
> >>> +
> >>> static int intel_iommu_map(struct iommu_domain *domain,
> >>> unsigned long iova, phys_addr_t hpa,
> >>> size_t size, int iommu_prot, gfp_t
> >>> gfp) @@ -6179,6 +6336,7 @@ const struct iommu_ops intel_iommu_ops
> >>> = { .is_attach_deferred =
> >>> intel_iommu_is_attach_deferred, .pgsize_bitmap =
> >>> INTEL_IOMMU_PGSIZES, #ifdef CONFIG_INTEL_IOMMU_SVM
> >>> + .cache_invalidate = intel_iommu_sva_invalidate,
> >>> .sva_bind_gpasid = intel_svm_bind_gpasid,
> >>> .sva_unbind_gpasid = intel_svm_unbind_gpasid,
> >>> #endif
> >>>
> >> Thanks
> >>
> >> Eric
> >>
> >
> > [Jacob Pan]
> >
>
[Jacob Pan]
> From: Jacob Pan <[email protected]>
> Sent: Friday, April 17, 2020 11:29 PM
>
> On Fri, 17 Apr 2020 09:46:55 +0200
> Auger Eric <[email protected]> wrote:
>
> > Hi Kevin,
> > On 4/17/20 4:45 AM, Tian, Kevin wrote:
> > >> From: Auger Eric
> > >> Sent: Thursday, April 16, 2020 6:43 PM
> > >>
> > > [...]
> > >>>>> + if (svm) {
> > >>>>> + /*
> > >>>>> + * If we found svm for the PASID, there must
> > >>>>> be at
> > >>>>> + * least one device bond, otherwise svm should
> > >>>>> be freed.
> > >>>>> + */
> > >>>>> + if (WARN_ON(list_empty(&svm->devs))) {
> > >>>>> + ret = -EINVAL;
> > >>>>> + goto out;
> > >>>>> + }
> > >>>>> +
> > >>>>> + for_each_svm_dev(sdev, svm, dev) {
> > >>>>> + /* In case of multiple sub-devices of
> > >>>>> the same pdev
> > >>>>> + * assigned, we should allow multiple
> > >>>>> bind calls with
> > >>>>> + * the same PASID and pdev.
> > >>>>> + */
> > >>>>> + sdev->users++;
> > >>>> What if this is not an mdev device. Is it also allowed?
> > >>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is just an
> > >>> example of normal use case. You can bind the same PCI device (PF
> > >>> or SRIOV VF) more than once to the same PASID. Just need to
> > >>> unbind also.
> > >>
> > >> I don't get the point of binding a non mdev device several times
> > >> with the same PASID. Do you intend to allow that at userspace
> > >> level or prevent this from happening in VFIO?
> > >
> > > I feel it's better to prevent this from happening, otherwise VFIO
> > > also needs to track the bind count and do multiple unbinds at
> > > mm_exit. But it's not necessary to prevent it in VFIO. We can check
> > > here upon whether aux_domain is valid, and if not return -EBUSY.
> > Ah OK. So if we can detect the case here it is even better
> >
> I don't understand why VFIO cannot track, since it is mdev aware. if we
> don;t refcount the users, one mdev unbind will result unbind for all
> mdev under the same pdev. That may not be the right thing to do.
>
The open here is not for mdev, which refcount is still required. Eric's
point is for non-mdev endpoints. It's meaningless and not intuitive
to allow binding a PASID multiple-times to the same device.
Thanks
Kevin
On Fri, 17 Apr 2020 23:46:13 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Jacob Pan <[email protected]>
> > Sent: Friday, April 17, 2020 11:29 PM
> >
> > On Fri, 17 Apr 2020 09:46:55 +0200
> > Auger Eric <[email protected]> wrote:
> >
> > > Hi Kevin,
> > > On 4/17/20 4:45 AM, Tian, Kevin wrote:
> > > >> From: Auger Eric
> > > >> Sent: Thursday, April 16, 2020 6:43 PM
> > > >>
> > > > [...]
> > > >>>>> + if (svm) {
> > > >>>>> + /*
> > > >>>>> + * If we found svm for the PASID, there
> > > >>>>> must be at
> > > >>>>> + * least one device bond, otherwise svm
> > > >>>>> should be freed.
> > > >>>>> + */
> > > >>>>> + if (WARN_ON(list_empty(&svm->devs))) {
> > > >>>>> + ret = -EINVAL;
> > > >>>>> + goto out;
> > > >>>>> + }
> > > >>>>> +
> > > >>>>> + for_each_svm_dev(sdev, svm, dev) {
> > > >>>>> + /* In case of multiple sub-devices
> > > >>>>> of the same pdev
> > > >>>>> + * assigned, we should allow
> > > >>>>> multiple bind calls with
> > > >>>>> + * the same PASID and pdev.
> > > >>>>> + */
> > > >>>>> + sdev->users++;
> > > >>>> What if this is not an mdev device. Is it also allowed?
> > > >>> Yes. IOMMU and VT-d driver is not mdev aware. Here mdev is
> > > >>> just an example of normal use case. You can bind the same PCI
> > > >>> device (PF or SRIOV VF) more than once to the same PASID.
> > > >>> Just need to unbind also.
> > > >>
> > > >> I don't get the point of binding a non mdev device several
> > > >> times with the same PASID. Do you intend to allow that at
> > > >> userspace level or prevent this from happening in VFIO?
> > > >
> > > > I feel it's better to prevent this from happening, otherwise
> > > > VFIO also needs to track the bind count and do multiple unbinds
> > > > at mm_exit. But it's not necessary to prevent it in VFIO. We
> > > > can check here upon whether aux_domain is valid, and if not
> > > > return -EBUSY.
> > > Ah OK. So if we can detect the case here it is even better
> > >
> > I don't understand why VFIO cannot track, since it is mdev aware.
> > if we don;t refcount the users, one mdev unbind will result unbind
> > for all mdev under the same pdev. That may not be the right thing
> > to do.
>
> The open here is not for mdev, which refcount is still required.
> Eric's point is for non-mdev endpoints. It's meaningless and not
> intuitive to allow binding a PASID multiple-times to the same device.
>
That seems contradictory. The refcount here is intended/required for sub
devices such as mdev. Since IOMMU driver is not mdev aware, we cannot
treat devices differently.
Perhaps Yi can clarify if this case is handled within VFIO, then I can
drop the refcount here.
> Thanks
> Kevin
[Jacob Pan]