Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.
This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes. For IOMMU and QEMU changes, they are in separate series (listed
in the "Related series").
The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.
.-------------. .---------------------------.
| vIOMMU | | Guest process CR3, FL only|
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush -
'-------------' |
| | V
| | CR3 in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .----------------------.
| pIOMMU | | Bind FL for GVA-GPA |
| | '----------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.------------------------------.
| | |SL for GPA-HPA, default domain|
| | '------------------------------'
'-------------'
Where:
- FL = First level/stage one page tables
- SL = Second level/stage two page tables
Patch Overview:
1. a refactor to vfio_iommu_type1 ioctl (patch 0001)
2. reports IOMMU nesting info to userspace ( patch 0002, 0003, 0004 and 0015)
3. vfio support for PASID allocation and free for VMs (patch 0005, 0006, 0007)
4. vfio support for binding guest page table to host (patch 0008, 0009, 0010)
5. vfio support for IOMMU cache invalidation from VMs (patch 0011)
6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012)
7. expose PASID capability to VM (patch 0013)
8. add doc for VFIO dual stage control (patch 0014)
The complete vSVA kernel upstream patches are divided into three phases:
1. Common APIs and PCI device direct assignment
2. IOMMU-backed Mediated Device assignment
3. Page Request Services (PRS) support
This patchset is aiming for the phase 1 and phase 2, and based on Jacob's
below series.
*) [PATCH v4 0/5] IOMMU user API enhancement - wip
https://lore.kernel.org/linux-iommu/[email protected]/
*) [PATCH 00/10] IOASID extensions for guest SVA - wip
https://lkml.org/lkml/2020/3/25/874
The latest IOASID code added below new interface for itertate all PASIDs of an
ioasid_set. The implementation is not sent out yet as Jacob needs some cleanup,
it can be found in branch vsva-linux-5.8-rc3-v5 on github (mentioned below):
int ioasid_set_for_each_ioasid(int sid, void (*fn)(ioasid_t id, void *data), void *data);
Complete set for current vSVA can be found in below branch.
https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.8-rc3-v5
The corresponding QEMU patch series is included in below branch:
https://github.com/luxis1999/qemu.git: vsva_5.8_rc3_qemu_rfcv8
Regards,
Yi Liu
Changelog:
- Patch v4 -> Patch v5:
a) Address comments against v4
Patch v4: https://lore.kernel.org/kvm/[email protected]/
- Patch v3 -> Patch v4:
a) Address comments against v3
b) Add rb from Stefan on patch 14/15
Patch v3: https://lore.kernel.org/linux-iommu/[email protected]/
- Patch v2 -> Patch v3:
a) Rebase on top of Jacob's v3 iommu uapi patchset
b) Address comments from Kevin and Stefan Hajnoczi
c) Reuse DOMAIN_ATTR_NESTING to get iommu nesting info
d) Drop [PATCH v2 07/15] iommu/uapi: Add iommu_gpasid_unbind_data
Patch v2: https://lore.kernel.org/linux-iommu/[email protected]/#r
- Patch v1 -> Patch v2:
a) Refactor vfio_iommu_type1_ioctl() per suggestion from Christoph
Hellwig.
b) Re-sequence the patch series for better bisect support.
c) Report IOMMU nesting cap info in detail instead of a format in
v1.
d) Enforce one group per nesting type container for vfio iommu type1
driver.
e) Build the vfio_mm related code from vfio.c to be a separate
vfio_pasid.ko.
f) Add PASID ownership check in IOMMU driver.
g) Adopted to latest IOMMU UAPI design. Removed IOMMU UAPI version
check. Added iommu_gpasid_unbind_data for unbind requests from
userspace.
h) Define a single ioctl:VFIO_IOMMU_NESTING_OP for bind/unbind_gtbl
and cahce_invld.
i) Document dual stage control in vfio.rst.
Patch v1: https://lore.kernel.org/linux-iommu/[email protected]/
- RFC v3 -> Patch v1:
a) Address comments to the PASID request(alloc/free) path
b) Report PASID alloc/free availabitiy to user-space
c) Add a vfio_iommu_type1 parameter to support pasid quota tuning
d) Adjusted to latest ioasid code implementation. e.g. remove the
code for tracking the allocated PASIDs as latest ioasid code
will track it, VFIO could use ioasid_free_set() to free all
PASIDs.
RFC v3: https://lore.kernel.org/linux-iommu/[email protected]/
- RFC v2 -> v3:
a) Refine the whole patchset to fit the roughly parts in this series
b) Adds complete vfio PASID management framework. e.g. pasid alloc,
free, reclaim in VM crash/down and per-VM PASID quota to prevent
PASID abuse.
c) Adds IOMMU uAPI version check and page table format check to ensure
version compatibility and hardware compatibility.
d) Adds vSVA vfio support for IOMMU-backed mdevs.
RFC v2: https://lore.kernel.org/linux-iommu/[email protected]/
- RFC v1 -> v2:
Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.
RFC v1: https://lore.kernel.org/linux-iommu/[email protected]/
---
Eric Auger (1):
vfio: Document dual stage control
Liu Yi L (13):
vfio/type1: Refactor vfio_iommu_type1_ioctl()
iommu: Report domain nesting info
iommu/smmu: Report empty domain nesting info
vfio/type1: Report iommu nesting info to userspace
vfio: Add PASID allocation/free support
iommu/vt-d: Support setting ioasid set to domain
vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)
iommu/vt-d: Check ownership for PASIDs from user-space
vfio/type1: Support binding guest page tables to PASID
vfio/type1: Allow invalidating first-level/stage IOMMU cache
vfio/type1: Add vSVA support for IOMMU-backed mdevs
vfio/pci: Expose PCIe PASID capability to guest
iommu/vt-d: Support reporting nesting capability info
Yi Sun (1):
iommu: Pass domain to sva_unbind_gpasid()
Documentation/driver-api/vfio.rst | 67 +++
drivers/iommu/arm-smmu-v3.c | 29 +-
drivers/iommu/arm-smmu.c | 29 +-
drivers/iommu/intel/iommu.c | 113 ++++-
drivers/iommu/intel/svm.c | 10 +-
drivers/iommu/iommu.c | 2 +-
drivers/vfio/Kconfig | 6 +
drivers/vfio/Makefile | 1 +
drivers/vfio/pci/vfio_pci_config.c | 2 +-
drivers/vfio/vfio_iommu_type1.c | 818 ++++++++++++++++++++++++++++---------
drivers/vfio/vfio_pasid.c | 271 ++++++++++++
include/linux/intel-iommu.h | 23 +-
include/linux/iommu.h | 4 +-
include/linux/vfio.h | 54 +++
include/uapi/linux/iommu.h | 77 ++++
include/uapi/linux/vfio.h | 90 ++++
16 files changed, 1395 insertions(+), 201 deletions(-)
create mode 100644 drivers/vfio/vfio_pasid.c
--
2.7.4
This patch exposes PCIe PASID capability to guest for assigned devices.
Existing vfio_pci driver hides it from guest by setting the capability
length as 0 in pci_ext_cap_length[].
And this patch only exposes PASID capability for devices which has PCIe
PASID extended struture in its configuration space. So VFs, will will
not see PASID capability on VFs as VF doesn't implement PASID extended
structure in its configuration space. For VF, it is a TODO in future.
Related discussion can be found in below link:
https://lkml.org/lkml/2020/4/7/693
Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Lu Baolu <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
---
v1 -> v2:
*) added in v2, but it was sent in a separate patchseries before
---
drivers/vfio/pci/vfio_pci_config.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index d98843f..07ff2e6 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
[PCI_EXT_CAP_ID_LTR] = PCI_EXT_CAP_LTR_SIZEOF,
[PCI_EXT_CAP_ID_SECPCI] = 0, /* not yet */
[PCI_EXT_CAP_ID_PMUX] = 0, /* not yet */
- [PCI_EXT_CAP_ID_PASID] = 0, /* not yet */
+ [PCI_EXT_CAP_ID_PASID] = PCI_EXT_CAP_PASID_SIZEOF,
};
/*
--
2.7.4
From IOMMU p.o.v., PASIDs allocated and managed by external components
(e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
needs some knowledge to check the PASID ownership, hence add an interface
for those components to tell the PASID owner.
In latest kernel design, PASID ownership is managed by IOASID set where
the PASID is allocated from. This patch adds support for setting ioasid
set ID to the domains used for nesting/vSVA. Subsequent SVA operations
on the PASID will be checked against its IOASID set for proper ownership.
Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Lu Baolu <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
v4 -> v5:
*) address comments from Eric Auger.
---
drivers/iommu/intel/iommu.c | 22 ++++++++++++++++++++++
include/linux/intel-iommu.h | 4 ++++
include/linux/iommu.h | 1 +
3 files changed, 27 insertions(+)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 72ae6a2..4d54198 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
if (first_level_by_default())
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
+ domain->ioasid_sid = INVALID_IOASID_SET;
INIT_LIST_HEAD(&domain->devices);
return domain;
@@ -6039,6 +6040,27 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
}
spin_unlock_irqrestore(&device_domain_lock, flags);
break;
+ case DOMAIN_ATTR_IOASID_SID:
+ {
+ int sid = *(int *)data;
+
+ if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
+ ret = -ENODEV;
+ break;
+ }
+ spin_lock_irqsave(&device_domain_lock, flags);
+ if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
+ dmar_domain->ioasid_sid != sid) {
+ pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
+ dmar_domain->ioasid_sid, sid);
+ ret = -EBUSY;
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+ break;
+ }
+ dmar_domain->ioasid_sid = sid;
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+ break;
+ }
default:
ret = -EINVAL;
break;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 3f23c26..0d0ab32 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -549,6 +549,10 @@ struct dmar_domain {
2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
u64 max_addr; /* maximum mapped address */
+ int ioasid_sid; /*
+ * the ioasid set which tracks all
+ * PASIDs used by the domain.
+ */
int default_pasid; /*
* The default pasid used for non-SVM
* traffic on mediated devices.
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7ca9d48..e84a1d5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -124,6 +124,7 @@ enum iommu_attr {
DOMAIN_ATTR_FSL_PAMUV1,
DOMAIN_ATTR_NESTING, /* two stages of translation */
DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
+ DOMAIN_ATTR_IOASID_SID,
DOMAIN_ATTR_MAX,
};
--
2.7.4
This patch allows user space to request PASID allocation/free, e.g. when
serving the request from the guest.
PASIDs that are not freed by userspace are automatically freed when the
IOASID set is destroyed when process exits.
Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Lu Baolu <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
v4 -> v5:
*) address comments from Eric Auger.
*) the comments for the PASID_FREE request is addressed in patch 5/15 of
this series.
v3 -> v4:
*) address comments from v3, except the below comment against the range
of PASID_FREE request. needs more help on it.
"> +if (req.range.min > req.range.max)
Is it exploitable that a user can spin the kernel for a long time in
the case of a free by calling this with [0, MAX_UINT] regardless of
their actual allocations?"
https://lore.kernel.org/linux-iommu/[email protected]/
v1 -> v2:
*) move the vfio_mm related code to be a seprate module
*) use a single structure for alloc/free, could support a range of PASIDs
*) fetch vfio_mm at group_attach time instead of at iommu driver open time
---
drivers/vfio/Kconfig | 1 +
drivers/vfio/vfio_iommu_type1.c | 85 +++++++++++++++++++++++++++++++++++++++++
drivers/vfio/vfio_pasid.c | 10 +++++
include/linux/vfio.h | 6 +++
include/uapi/linux/vfio.h | 37 ++++++++++++++++++
5 files changed, 139 insertions(+)
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 3d8a108..95d90c6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -2,6 +2,7 @@
config VFIO_IOMMU_TYPE1
tristate
depends on VFIO
+ select VFIO_PASID if (X86)
default n
config VFIO_IOMMU_SPAPR_TCE
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ed80104..55b4065 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -76,6 +76,7 @@ struct vfio_iommu {
bool dirty_page_tracking;
bool pinned_page_dirty_scope;
struct iommu_nesting_info *nesting_info;
+ struct vfio_mm *vmm;
};
struct vfio_domain {
@@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
{
+ if (iommu->vmm) {
+ vfio_mm_put(iommu->vmm);
+ iommu->vmm = NULL;
+ }
+
kfree(iommu->nesting_info);
iommu->nesting_info = NULL;
}
@@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
iommu->nesting_info);
if (ret)
goto out_detach;
+
+ if (iommu->nesting_info->features &
+ IOMMU_NESTING_FEAT_SYSWIDE_PASID) {
+ struct vfio_mm *vmm;
+ int sid;
+
+ vmm = vfio_mm_get_from_task(current);
+ if (IS_ERR(vmm)) {
+ ret = PTR_ERR(vmm);
+ goto out_detach;
+ }
+ iommu->vmm = vmm;
+
+ sid = vfio_mm_ioasid_sid(vmm);
+ ret = iommu_domain_set_attr(domain->domain,
+ DOMAIN_ATTR_IOASID_SID,
+ &sid);
+ if (ret)
+ goto out_detach;
+ }
}
/* Get aperture info */
@@ -2855,6 +2881,63 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
return -EINVAL;
}
+static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
+ unsigned int min,
+ unsigned int max)
+{
+ int ret = -EOPNOTSUPP;
+
+ mutex_lock(&iommu->lock);
+ if (iommu->vmm)
+ ret = vfio_pasid_alloc(iommu->vmm, min, max);
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
+ unsigned int min,
+ unsigned int max)
+{
+ int ret = -EOPNOTSUPP;
+
+ mutex_lock(&iommu->lock);
+ if (iommu->vmm) {
+ vfio_pasid_free_range(iommu->vmm, min, max);
+ ret = 0;
+ }
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
+ unsigned long arg)
+{
+ struct vfio_iommu_type1_pasid_request req;
+ unsigned long minsz;
+
+ minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range);
+
+ if (copy_from_user(&req, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK))
+ return -EINVAL;
+
+ if (req.range.min > req.range.max)
+ return -EINVAL;
+
+ switch (req.flags & VFIO_PASID_REQUEST_MASK) {
+ case VFIO_IOMMU_FLAG_ALLOC_PASID:
+ return vfio_iommu_type1_pasid_alloc(iommu,
+ req.range.min, req.range.max);
+ case VFIO_IOMMU_FLAG_FREE_PASID:
+ return vfio_iommu_type1_pasid_free(iommu,
+ req.range.min, req.range.max);
+ default:
+ return -EINVAL;
+ }
+}
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -2871,6 +2954,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
return vfio_iommu_type1_unmap_dma(iommu, arg);
case VFIO_IOMMU_DIRTY_PAGES:
return vfio_iommu_type1_dirty_pages(iommu, arg);
+ case VFIO_IOMMU_PASID_REQUEST:
+ return vfio_iommu_type1_pasid_request(iommu, arg);
default:
return -ENOTTY;
}
diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
index 66e6054e..ebec244 100644
--- a/drivers/vfio/vfio_pasid.c
+++ b/drivers/vfio/vfio_pasid.c
@@ -61,6 +61,7 @@ void vfio_mm_put(struct vfio_mm *vmm)
{
kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock);
}
+EXPORT_SYMBOL_GPL(vfio_mm_put);
static void vfio_mm_get(struct vfio_mm *vmm)
{
@@ -114,6 +115,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
mmput(mm);
return vmm;
}
+EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
+
+int vfio_mm_ioasid_sid(struct vfio_mm *vmm)
+{
+ return vmm->ioasid_sid;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_ioasid_sid);
/*
* Find PASID within @min and @max
@@ -197,6 +205,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
return pasid;
}
+EXPORT_SYMBOL_GPL(vfio_pasid_alloc);
void vfio_pasid_free_range(struct vfio_mm *vmm,
ioasid_t min, ioasid_t max)
@@ -213,6 +222,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
vfio_remove_pasid(vmm, vid);
mutex_unlock(&vmm->pasid_lock);
}
+EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
static int __init vfio_pasid_init(void)
{
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 31472a9..a111108 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -101,6 +101,7 @@ struct vfio_mm;
#if IS_ENABLED(CONFIG_VFIO_PASID)
extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
extern void vfio_mm_put(struct vfio_mm *vmm);
+int vfio_mm_ioasid_sid(struct vfio_mm *vmm);
extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
extern void vfio_pasid_free_range(struct vfio_mm *vmm,
ioasid_t min, ioasid_t max);
@@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm)
{
}
+static inline int vfio_mm_ioasid_sid(struct vfio_mm *vmm)
+{
+ return -ENOTTY;
+}
+
static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
{
return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 46a78af..96a115f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1172,6 +1172,43 @@ struct vfio_iommu_type1_dirty_bitmap_get {
#define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17)
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
+ * struct vfio_iommu_type1_pasid_request)
+ *
+ * PASID (Processor Address Space ID) is a PCIe concept for tagging
+ * address spaces in DMA requests. When system-wide PASID allocation
+ * is required by underlying iommu driver (e.g. Intel VT-d), this
+ * provides an interface for userspace to request pasid alloc/free
+ * for its assigned devices. Userspace should check the availability
+ * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through
+ * VFIO_IOMMU_GET_INFO.
+ *
+ * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within @range.
+ * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range.
+ * @range is [min, max], which means both @min and @max are inclusive.
+ * ALLOC_PASID and FREE_PASID are mutually exclusive.
+ *
+ * returns: allocated PASID value on success, -errno on failure for
+ * ALLOC_PASID;
+ * 0 for FREE_PASID operation;
+ */
+struct vfio_iommu_type1_pasid_request {
+ __u32 argsz;
+#define VFIO_IOMMU_FLAG_ALLOC_PASID (1 << 0)
+#define VFIO_IOMMU_FLAG_FREE_PASID (1 << 1)
+ __u32 flags;
+ struct {
+ __u32 min;
+ __u32 max;
+ } range;
+};
+
+#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_FLAG_ALLOC_PASID | \
+ VFIO_IOMMU_FLAG_FREE_PASID)
+
+#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18)
+
/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
/*
--
2.7.4
This patch exports iommu nesting capability info to user space through
VFIO. User space is expected to check this info for supported uAPIs (e.g.
PASID alloc/free, bind page table, and cache invalidation) and the vendor
specific format information for first level/stage page table that will be
bound to.
The nesting info is available only after the nesting iommu type is set
for a container. Current implementation imposes one limitation - one
nesting container should include at most one group. The philosophy of
vfio container is having all groups/devices within the container share
the same IOMMU context. When vSVA is enabled, one IOMMU context could
include one 2nd-level address space and multiple 1st-level address spaces.
While the 2nd-level address space is reasonably sharable by multiple groups
, blindly sharing 1st-level address spaces across all groups within the
container might instead break the guest expectation. In the future sub/
super container concept might be introduced to allow partial address space
sharing within an IOMMU context. But for now let's go with this restriction
by requiring singleton container for using nesting iommu features. Below
link has the related discussion about this decision.
https://lkml.org/lkml/2020/5/15/1028
Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Lu Baolu <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
---
v4 -> v5:
*) address comments from Eric Auger.
*) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
cap is much "cheap", if needs extension in future, just define another cap.
https://lore.kernel.org/kvm/[email protected]/
v3 -> v4:
*) address comments against v3.
v1 -> v2:
*) added in v2
---
drivers/vfio/vfio_iommu_type1.c | 102 +++++++++++++++++++++++++++++++++++-----
include/uapi/linux/vfio.h | 19 ++++++++
2 files changed, 109 insertions(+), 12 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3bd70ff..ed80104 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
"Maximum number of user DMA mappings per container (65535).");
struct vfio_iommu {
- struct list_head domain_list;
- struct list_head iova_list;
- struct vfio_domain *external_domain; /* domain for external user */
- struct mutex lock;
- struct rb_root dma_list;
- struct blocking_notifier_head notifier;
- unsigned int dma_avail;
- uint64_t pgsize_bitmap;
- bool v2;
- bool nesting;
- bool dirty_page_tracking;
- bool pinned_page_dirty_scope;
+ struct list_head domain_list;
+ struct list_head iova_list;
+ /* domain for external user */
+ struct vfio_domain *external_domain;
+ struct mutex lock;
+ struct rb_root dma_list;
+ struct blocking_notifier_head notifier;
+ unsigned int dma_avail;
+ uint64_t pgsize_bitmap;
+ bool v2;
+ bool nesting;
+ bool dirty_page_tracking;
+ bool pinned_page_dirty_scope;
+ struct iommu_nesting_info *nesting_info;
};
struct vfio_domain {
@@ -130,6 +132,9 @@ struct vfio_regions {
#define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
(!list_empty(&iommu->domain_list))
+#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)->external_domain) || \
+ (!list_empty(&(iommu)->domain_list)))
+
#define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE)
/*
@@ -1929,6 +1934,13 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
}
+
+static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
+{
+ kfree(iommu->nesting_info);
+ iommu->nesting_info = NULL;
+}
+
static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
@@ -1959,6 +1971,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
}
}
+ /* Nesting type container can include only one group */
+ if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
+ mutex_unlock(&iommu->lock);
+ return -EINVAL;
+ }
+
group = kzalloc(sizeof(*group), GFP_KERNEL);
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!group || !domain) {
@@ -2029,6 +2047,32 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
if (ret)
goto out_domain;
+ /* Nesting cap info is available only after attaching */
+ if (iommu->nesting) {
+ struct iommu_nesting_info tmp = { .size = 0, };
+
+ /* First get the size of vendor specific nesting info */
+ ret = iommu_domain_get_attr(domain->domain,
+ DOMAIN_ATTR_NESTING,
+ &tmp);
+ if (ret)
+ goto out_detach;
+
+ iommu->nesting_info = kzalloc(tmp.size, GFP_KERNEL);
+ if (!iommu->nesting_info) {
+ ret = -ENOMEM;
+ goto out_detach;
+ }
+
+ /* Now get the nesting info */
+ iommu->nesting_info->size = tmp.size;
+ ret = iommu_domain_get_attr(domain->domain,
+ DOMAIN_ATTR_NESTING,
+ iommu->nesting_info);
+ if (ret)
+ goto out_detach;
+ }
+
/* Get aperture info */
iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
@@ -2138,6 +2182,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
return 0;
out_detach:
+ vfio_iommu_release_nesting_info(iommu);
vfio_iommu_detach_group(domain, group);
out_domain:
iommu_domain_free(domain->domain);
@@ -2338,6 +2383,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
vfio_iommu_unmap_unpin_all(iommu);
else
vfio_iommu_unmap_unpin_reaccount(iommu);
+
+ vfio_iommu_release_nesting_info(iommu);
}
iommu_domain_free(domain->domain);
list_del(&domain->next);
@@ -2546,6 +2593,31 @@ static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
}
+static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_info_cap_header *header;
+ struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
+ size_t size;
+
+ size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
+ iommu->nesting_info->size;
+
+ header = vfio_info_cap_add(caps, size,
+ VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
+ if (IS_ERR(header))
+ return PTR_ERR(header);
+
+ nesting_cap = container_of(header,
+ struct vfio_iommu_type1_info_cap_nesting,
+ header);
+
+ memcpy(&nesting_cap->info, iommu->nesting_info,
+ iommu->nesting_info->size);
+
+ return 0;
+}
+
static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
unsigned long arg)
{
@@ -2581,6 +2653,12 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
if (!ret)
ret = vfio_iommu_iova_build_caps(iommu, &caps);
+ if (iommu->nesting_info) {
+ ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
+ if (ret)
+ return ret;
+ }
+
mutex_unlock(&iommu->lock);
if (ret)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9204705..46a78af 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/ioctl.h>
+#include <linux/iommu.h>
#define VFIO_API_VERSION 0
@@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
__u64 max_dirty_bitmap_size; /* in bytes */
};
+/*
+ * The nesting capability allows to report the related capability
+ * and info for nesting iommu type.
+ *
+ * The structures below define version 1 of this capability.
+ *
+ * User space selected VFIO_TYPE1_NESTING_IOMMU type should check
+ * this capto get supported features.
+ *
+ * @info: the nesting info provided by IOMMU driver.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3
+
+struct vfio_iommu_type1_info_cap_nesting {
+ struct vfio_info_cap_header header;
+ struct iommu_nesting_info info;
+};
+
#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
/**
--
2.7.4
IOMMUs that support nesting translation needs report the capability info
to userspace, e.g. the format of first level/stage paging structures.
This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
nesting info after setting DOMAIN_ATTR_NESTING.
Cc: Kevin Tian <[email protected]>
CC: Jacob Pan <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Lu Baolu <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
v4 -> v5:
*) address comments from Eric Auger.
v3 -> v4:
*) split the SMMU driver changes to be a separate patch
*) move the @addr_width and @pasid_bits from vendor specific
part to generic part.
*) tweak the description for the @features field of struct
iommu_nesting_info.
*) add description on the @data[] field of struct iommu_nesting_info
v2 -> v3:
*) remvoe cap/ecap_mask in iommu_nesting_info.
*) reuse DOMAIN_ATTR_NESTING to get nesting info.
*) return an empty iommu_nesting_info for SMMU drivers per Jean'
suggestion.
---
include/uapi/linux/iommu.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 77 insertions(+)
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 1afc661..d2a47c4 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -332,4 +332,81 @@ struct iommu_gpasid_bind_data {
} vendor;
};
+/*
+ * struct iommu_nesting_info - Information for nesting-capable IOMMU.
+ * user space should check it before using
+ * nesting capability.
+ *
+ * @size: size of the whole structure
+ * @format: PASID table entry format, the same definition as struct
+ * iommu_gpasid_bind_data @format.
+ * @features: supported nesting features.
+ * @flags: currently reserved for future extension.
+ * @addr_width: The output addr width of first level/stage translation
+ * @pasid_bits: Maximum supported PASID bits, 0 represents no PASID
+ * support.
+ * @data: vendor specific cap info. data[] structure type can be deduced
+ * from @format field.
+ *
+ * +===============+======================================================+
+ * | feature | Notes |
+ * +===============+======================================================+
+ * | SYSWIDE_PASID | PASIDs are managed in system-wide, instead of per |
+ * | | device. When a device is assigned to userspace or |
+ * | | VM, proper uAPI (userspace driver framework uAPI, |
+ * | | e.g. VFIO) must be used to allocate/free PASIDs for |
+ * | | the assigned device. |
+ * +---------------+------------------------------------------------------+
+ * | BIND_PGTBL | The owner of the first level/stage page table must |
+ * | | explicitly bind the page table to associated PASID |
+ * | | (either the one specified in bind request or the |
+ * | | default PASID of iommu domain), through userspace |
+ * | | driver framework uAPI (e.g. VFIO_IOMMU_NESTING_OP). |
+ * +---------------+------------------------------------------------------+
+ * | CACHE_INVLD | The owner of the first level/stage page table must |
+ * | | explicitly invalidate the IOMMU cache through uAPI |
+ * | | provided by userspace driver framework (e.g. VFIO) |
+ * | | according to vendor-specific requirement when |
+ * | | changing the page table. |
+ * +---------------+------------------------------------------------------+
+ *
+ * @data[] types defined for @format:
+ * +================================+=====================================+
+ * | @format | @data[] |
+ * +================================+=====================================+
+ * | IOMMU_PASID_FORMAT_INTEL_VTD | struct iommu_nesting_info_vtd |
+ * +--------------------------------+-------------------------------------+
+ *
+ */
+struct iommu_nesting_info {
+ __u32 size;
+ __u32 format;
+#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0)
+#define IOMMU_NESTING_FEAT_BIND_PGTBL (1 << 1)
+#define IOMMU_NESTING_FEAT_CACHE_INVLD (1 << 2)
+ __u32 features;
+ __u32 flags;
+ __u16 addr_width;
+ __u16 pasid_bits;
+ __u32 padding;
+ __u8 data[];
+};
+
+/*
+ * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info
+ *
+ * @flags: VT-d specific flags. Currently reserved for future
+ * extension.
+ * @cap_reg: Describe basic capabilities as defined in VT-d capability
+ * register.
+ * @ecap_reg: Describe the extended capabilities as defined in VT-d
+ * extended capability register.
+ */
+struct iommu_nesting_info_vtd {
+ __u32 flags;
+ __u32 padding;
+ __u64 cap_reg;
+ __u64 ecap_reg;
+};
+
#endif /* _UAPI_IOMMU_H */
--
2.7.4
Hi Yi,
On 7/12/20 1:20 PM, Liu Yi L wrote:
> IOMMUs that support nesting translation needs report the capability info
s/needs/need to report
> to userspace, e.g. the format of first level/stage paging structures.
It gives information about requirements the userspace needs to implement
plus other features characterizing the physical implementation.
>
> This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
> nesting info after setting DOMAIN_ATTR_NESTING.
I guess you meant after selecting VFIO_TYPE1_NESTING_IOMMU?
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Lu Baolu <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> v4 -> v5:
> *) address comments from Eric Auger.
>
> v3 -> v4:
> *) split the SMMU driver changes to be a separate patch
> *) move the @addr_width and @pasid_bits from vendor specific
> part to generic part.
> *) tweak the description for the @features field of struct
> iommu_nesting_info.
> *) add description on the @data[] field of struct iommu_nesting_info
>
> v2 -> v3:
> *) remvoe cap/ecap_mask in iommu_nesting_info.
> *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> *) return an empty iommu_nesting_info for SMMU drivers per Jean'
> suggestion.
> ---
> include/uapi/linux/iommu.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 77 insertions(+)
>
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 1afc661..d2a47c4 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -332,4 +332,81 @@ struct iommu_gpasid_bind_data {
> } vendor;
> };
>
> +/*
> + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> + * user space should check it before using
> + * nesting capability.
> + *
> + * @size: size of the whole structure
> + * @format: PASID table entry format, the same definition as struct
> + * iommu_gpasid_bind_data @format.
> + * @features: supported nesting features.
> + * @flags: currently reserved for future extension.
> + * @addr_width: The output addr width of first level/stage translation
> + * @pasid_bits: Maximum supported PASID bits, 0 represents no PASID
> + * support.
> + * @data: vendor specific cap info. data[] structure type can be deduced
> + * from @format field.
> + *
> + * +===============+======================================================+
> + * | feature | Notes |
> + * +===============+======================================================+
> + * | SYSWIDE_PASID | PASIDs are managed in system-wide, instead of per |
s/in system-wide/system-wide ?
> + * | | device. When a device is assigned to userspace or |
> + * | | VM, proper uAPI (userspace driver framework uAPI, |
> + * | | e.g. VFIO) must be used to allocate/free PASIDs for |
> + * | | the assigned device.
Isn't it possible to be more explicit, something like:
|
System-wide PASID management is mandated by the physical IOMMU. All
PASIDs allocation must be mediated through the TBD API.
> + * +---------------+------------------------------------------------------+
> + * | BIND_PGTBL | The owner of the first level/stage page table must |
> + * | | explicitly bind the page table to associated PASID |
> + * | | (either the one specified in bind request or the |
> + * | | default PASID of iommu domain), through userspace |
> + * | | driver framework uAPI (e.g. VFIO_IOMMU_NESTING_OP). |
As per your answer in https://lkml.org/lkml/2020/7/6/383, I now
understand ARM would not expose that BIND_PGTBL nesting feature, I still
think the above wording is a bit confusing. Maybe you may explicitly
talk about the PASID *entry* that needs to be passed from guest to host.
On ARM we directly pass the PASID table but when reading the above
description I fail to determine if this does not fit that description.
> + * +---------------+------------------------------------------------------+
> + * | CACHE_INVLD | The owner of the first level/stage page table must |
> + * | | explicitly invalidate the IOMMU cache through uAPI |
> + * | | provided by userspace driver framework (e.g. VFIO) |
> + * | | according to vendor-specific requirement when |
> + * | | changing the page table. |
> + * +---------------+------------------------------------------------------+
instead of using the "uAPI provided by userspace driver framework (e.g.
VFIO)", can't we use the so-called IOMMU UAPI terminology which now has
a userspace documentation?
> + *
> + * @data[] types defined for @format:
> + * +================================+=====================================+
> + * | @format | @data[] |
> + * +================================+=====================================+
> + * | IOMMU_PASID_FORMAT_INTEL_VTD | struct iommu_nesting_info_vtd |
> + * +--------------------------------+-------------------------------------+
> + *
> + */
> +struct iommu_nesting_info {
> + __u32 size;
shouldn't it be @argsz to fit the iommu uapi convention and take benefit
to put the flags field just below?
> + __u32 format;
> +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0)
> +#define IOMMU_NESTING_FEAT_BIND_PGTBL (1 << 1)
> +#define IOMMU_NESTING_FEAT_CACHE_INVLD (1 << 2)
> + __u32 features;
> + __u32 flags;
> + __u16 addr_width;
> + __u16 pasid_bits;
> + __u32 padding;
> + __u8 data[];
> +};
> +
> +/*
> + * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info
> + *
> + * @flags: VT-d specific flags. Currently reserved for future
> + * extension.
must be set to 0?
> + * @cap_reg: Describe basic capabilities as defined in VT-d capability
> + * register.
> + * @ecap_reg: Describe the extended capabilities as defined in VT-d
> + * extended capability register.
> + */
> +struct iommu_nesting_info_vtd {
> + __u32 flags;
> + __u32 padding;
> + __u64 cap_reg;
> + __u64 ecap_reg;
> +};
> +
> #endif /* _UAPI_IOMMU_H */
Thanks
Eric
>
Yi,
On 7/12/20 1:20 PM, Liu Yi L wrote:
> This patch exports iommu nesting capability info to user space through
> VFIO. User space is expected to check this info for supported uAPIs (e.g.
it is not only to check the supported uAPIS but rather to know which
callbacks it must call upon vIOMMU events and which features are
supported by the physical IOMMU.
> PASID alloc/free, bind page table, and cache invalidation) and the vendor
> specific format information for first level/stage page table that will be
> bound to.
>
> The nesting info is available only after the nesting iommu type is set
> for a container.
to NESTED type
Current implementation imposes one limitation - one
> nesting container should include at most one group. The philosophy of
> vfio container is having all groups/devices within the container share
> the same IOMMU context. When vSVA is enabled, one IOMMU context could
> include one 2nd-level address space and multiple 1st-level address spaces.
> While the 2nd-level address space is reasonably sharable by multiple groups
> , blindly sharing 1st-level address spaces across all groups within the
> container might instead break the guest expectation. In the future sub/
> super container concept might be introduced to allow partial address space
> sharing within an IOMMU context. But for now let's go with this restriction
> by requiring singleton container for using nesting iommu features. Below
> link has the related discussion about this decision.
Maybe add a note about SMMU related changes spotted by Jean.
>
> https://lkml.org/lkml/2020/5/15/1028
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Lu Baolu <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> ---
> v4 -> v5:
> *) address comments from Eric Auger.
> *) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> cap is much "cheap", if needs extension in future, just define another cap.
> https://lore.kernel.org/kvm/[email protected]/
>
> v3 -> v4:
> *) address comments against v3.
>
> v1 -> v2:
> *) added in v2
> ---
> drivers/vfio/vfio_iommu_type1.c | 102 +++++++++++++++++++++++++++++++++++-----
> include/uapi/linux/vfio.h | 19 ++++++++
> 2 files changed, 109 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 3bd70ff..ed80104 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> "Maximum number of user DMA mappings per container (65535).");
>
> struct vfio_iommu {
> - struct list_head domain_list;
> - struct list_head iova_list;
> - struct vfio_domain *external_domain; /* domain for external user */
> - struct mutex lock;
> - struct rb_root dma_list;
> - struct blocking_notifier_head notifier;
> - unsigned int dma_avail;
> - uint64_t pgsize_bitmap;
> - bool v2;
> - bool nesting;
> - bool dirty_page_tracking;
> - bool pinned_page_dirty_scope;
> + struct list_head domain_list;
> + struct list_head iova_list;
> + /* domain for external user */
> + struct vfio_domain *external_domain;
> + struct mutex lock;
> + struct rb_root dma_list;
> + struct blocking_notifier_head notifier;
> + unsigned int dma_avail;
> + uint64_t pgsize_bitmap;
> + bool v2;
> + bool nesting;
> + bool dirty_page_tracking;
> + bool pinned_page_dirty_scope;
> + struct iommu_nesting_info *nesting_info;
> };
>
> struct vfio_domain {
> @@ -130,6 +132,9 @@ struct vfio_regions {
> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
> (!list_empty(&iommu->domain_list))
>
> +#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)->external_domain) || \
> + (!list_empty(&(iommu)->domain_list)))
> +
> #define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE)
>
> /*
> @@ -1929,6 +1934,13 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>
> list_splice_tail(iova_copy, iova);
> }
> +
> +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> +{
> + kfree(iommu->nesting_info);
> + iommu->nesting_info = NULL;
> +}
> +
> static int vfio_iommu_type1_attach_group(void *iommu_data,
> struct iommu_group *iommu_group)
> {
> @@ -1959,6 +1971,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> }
> }
>
> + /* Nesting type container can include only one group */
> + if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
> + mutex_unlock(&iommu->lock);
> + return -EINVAL;
> + }
> +
> group = kzalloc(sizeof(*group), GFP_KERNEL);
> domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> if (!group || !domain) {
> @@ -2029,6 +2047,32 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> if (ret)
> goto out_domain;
>
> + /* Nesting cap info is available only after attaching */
> + if (iommu->nesting) {
> + struct iommu_nesting_info tmp = { .size = 0, };
> +
> + /* First get the size of vendor specific nesting info */
> + ret = iommu_domain_get_attr(domain->domain,
> + DOMAIN_ATTR_NESTING,
> + &tmp);
> + if (ret)
> + goto out_detach;
> +
> + iommu->nesting_info = kzalloc(tmp.size, GFP_KERNEL);
> + if (!iommu->nesting_info) {
> + ret = -ENOMEM;
> + goto out_detach;
> + }
> +
> + /* Now get the nesting info */
> + iommu->nesting_info->size = tmp.size;
> + ret = iommu_domain_get_attr(domain->domain,
> + DOMAIN_ATTR_NESTING,
> + iommu->nesting_info);
> + if (ret)
> + goto out_detach;
> + }
> +
> /* Get aperture info */
> iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>
> @@ -2138,6 +2182,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> return 0;
>
> out_detach:
> + vfio_iommu_release_nesting_info(iommu);
> vfio_iommu_detach_group(domain, group);
> out_domain:
> iommu_domain_free(domain->domain);
> @@ -2338,6 +2383,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> vfio_iommu_unmap_unpin_all(iommu);
> else
> vfio_iommu_unmap_unpin_reaccount(iommu);
> +
> + vfio_iommu_release_nesting_info(iommu);
> }
> iommu_domain_free(domain->domain);
> list_del(&domain->next);
> @@ -2546,6 +2593,31 @@ static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
> return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
> }
>
> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> + struct vfio_info_cap *caps)
> +{
> + struct vfio_info_cap_header *header;
> + struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> + size_t size;
> +
> + size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
> + iommu->nesting_info->size;
> +
> + header = vfio_info_cap_add(caps, size,
> + VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> + if (IS_ERR(header))
> + return PTR_ERR(header);
> +
> + nesting_cap = container_of(header,
> + struct vfio_iommu_type1_info_cap_nesting,
> + header);
> +
> + memcpy(&nesting_cap->info, iommu->nesting_info,
> + iommu->nesting_info->size);
you must check whether nesting_info is non NULL before doing that.
Besides I agree with Jean on the fact it may be better to not report the
capability if nested is not supported.
> +
> + return 0;
> +}
> +
> static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
> unsigned long arg)
> {
> @@ -2581,6 +2653,12 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
> if (!ret)
> ret = vfio_iommu_iova_build_caps(iommu, &caps);
>
> + if (iommu->nesting_info) {
> + ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> + if (ret)
> + return ret;
> + }
> +
> mutex_unlock(&iommu->lock);
>
> if (ret)
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9204705..46a78af 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
>
> #include <linux/types.h>
> #include <linux/ioctl.h>
> +#include <linux/iommu.h>
>
> #define VFIO_API_VERSION 0
>
> @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
> __u64 max_dirty_bitmap_size; /* in bytes */
> };
>
> +/*
> + * The nesting capability allows to report the related capability
> + * and info for nesting iommu type.
> + *
> + * The structures below define version 1 of this capability.
> + *
> + * User space selected VFIO_TYPE1_NESTING_IOMMU type should check
> + * this capto get supported features.
s/capto/capability to get
> + *
> + * @info: the nesting info provided by IOMMU driver.
> + */
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> + struct vfio_info_cap_header header;
> + struct iommu_nesting_info info;
> +};
> +
> #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>
> /**
>
Thanks
Eric
Yi,
On 7/12/20 1:21 PM, Liu Yi L wrote:
> From IOMMU p.o.v., PASIDs allocated and managed by external components
> (e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
> needs some knowledge to check the PASID ownership, hence add an interface
> for those components to tell the PASID owner.
>
> In latest kernel design, PASID ownership is managed by IOASID set where
> the PASID is allocated from. This patch adds support for setting ioasid
> set ID to the domains used for nesting/vSVA. Subsequent SVA operations
> on the PASID will be checked against its IOASID set for proper ownership.
Subsequent SVA operations will check the PASID against its IOASID set
for proper ownership.
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Lu Baolu <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> v4 -> v5:
> *) address comments from Eric Auger.
> ---
> drivers/iommu/intel/iommu.c | 22 ++++++++++++++++++++++
> include/linux/intel-iommu.h | 4 ++++
> include/linux/iommu.h | 1 +
> 3 files changed, 27 insertions(+)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 72ae6a2..4d54198 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
> if (first_level_by_default())
> domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> domain->has_iotlb_device = false;
> + domain->ioasid_sid = INVALID_IOASID_SET;
> INIT_LIST_HEAD(&domain->devices);
>
> return domain;
> @@ -6039,6 +6040,27 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
> }
> spin_unlock_irqrestore(&device_domain_lock, flags);
> break;
> + case DOMAIN_ATTR_IOASID_SID:
> + {
> + int sid = *(int *)data;
> +
> + if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
> + ret = -ENODEV;
> + break;
> + }
> + spin_lock_irqsave(&device_domain_lock, flags);
I think the lock should be taken before the DOMAIN_FLAG_NESTING_MODE
check. Otherwise, the flags can be theretically changed inbetween the
check and the test below?
Thanks
Eric
> + if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
> + dmar_domain->ioasid_sid != sid) {
> + pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
> + dmar_domain->ioasid_sid, sid);
> + ret = -EBUSY;
> + spin_unlock_irqrestore(&device_domain_lock, flags);
> + break;
> + }
> + dmar_domain->ioasid_sid = sid;
> + spin_unlock_irqrestore(&device_domain_lock, flags);
> + break;
> + }
> default:
> ret = -EINVAL;
> break;
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 3f23c26..0d0ab32 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -549,6 +549,10 @@ struct dmar_domain {
> 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
> u64 max_addr; /* maximum mapped address */
>
> + int ioasid_sid; /*
> + * the ioasid set which tracks all
> + * PASIDs used by the domain.
> + */
> int default_pasid; /*
> * The default pasid used for non-SVM
> * traffic on mediated devices.
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 7ca9d48..e84a1d5 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -124,6 +124,7 @@ enum iommu_attr {
> DOMAIN_ATTR_FSL_PAMUV1,
> DOMAIN_ATTR_NESTING, /* two stages of translation */
> DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> + DOMAIN_ATTR_IOASID_SID,
> DOMAIN_ATTR_MAX,
> };
>
>
Yi,
On 7/12/20 1:21 PM, Liu Yi L wrote:
> This patch allows user space to request PASID allocation/free, e.g. when
> serving the request from the guest.
>
> PASIDs that are not freed by userspace are automatically freed when the
> IOASID set is destroyed when process exits.
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Lu Baolu <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Yi Sun <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> v4 -> v5:
> *) address comments from Eric Auger.
> *) the comments for the PASID_FREE request is addressed in patch 5/15 of
> this series.
>
> v3 -> v4:
> *) address comments from v3, except the below comment against the range
> of PASID_FREE request. needs more help on it.
> "> +if (req.range.min > req.range.max)
>
> Is it exploitable that a user can spin the kernel for a long time in
> the case of a free by calling this with [0, MAX_UINT] regardless of
> their actual allocations?"
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> v1 -> v2:
> *) move the vfio_mm related code to be a seprate module
> *) use a single structure for alloc/free, could support a range of PASIDs
> *) fetch vfio_mm at group_attach time instead of at iommu driver open time
> ---
> drivers/vfio/Kconfig | 1 +
> drivers/vfio/vfio_iommu_type1.c | 85 +++++++++++++++++++++++++++++++++++++++++
> drivers/vfio/vfio_pasid.c | 10 +++++
> include/linux/vfio.h | 6 +++
> include/uapi/linux/vfio.h | 37 ++++++++++++++++++
> 5 files changed, 139 insertions(+)
>
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 3d8a108..95d90c6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -2,6 +2,7 @@
> config VFIO_IOMMU_TYPE1
> tristate
> depends on VFIO
> + select VFIO_PASID if (X86)
> default n
>
> config VFIO_IOMMU_SPAPR_TCE
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ed80104..55b4065 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -76,6 +76,7 @@ struct vfio_iommu {
> bool dirty_page_tracking;
> bool pinned_page_dirty_scope;
> struct iommu_nesting_info *nesting_info;
> + struct vfio_mm *vmm;
> };
>
> struct vfio_domain {
> @@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>
> static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> {
> + if (iommu->vmm) {
> + vfio_mm_put(iommu->vmm);
> + iommu->vmm = NULL;
> + }
> +
> kfree(iommu->nesting_info);
> iommu->nesting_info = NULL;
> }
> @@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> iommu->nesting_info);
> if (ret)
> goto out_detach;
> +
> + if (iommu->nesting_info->features &
> + IOMMU_NESTING_FEAT_SYSWIDE_PASID) {
> + struct vfio_mm *vmm;
> + int sid;
> +
> + vmm = vfio_mm_get_from_task(current);
> + if (IS_ERR(vmm)) {
> + ret = PTR_ERR(vmm);
> + goto out_detach;
> + }
> + iommu->vmm = vmm;
> +
> + sid = vfio_mm_ioasid_sid(vmm);
> + ret = iommu_domain_set_attr(domain->domain,
> + DOMAIN_ATTR_IOASID_SID,
> + &sid);
> + if (ret)
> + goto out_detach;
> + }
> }
>
> /* Get aperture info */
> @@ -2855,6 +2881,63 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> return -EINVAL;
> }
>
> +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> + unsigned int min,
> + unsigned int max)
> +{
> + int ret = -EOPNOTSUPP;
> +
> + mutex_lock(&iommu->lock);
> + if (iommu->vmm)
> + ret = vfio_pasid_alloc(iommu->vmm, min, max);
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> + unsigned int min,
> + unsigned int max)
> +{
> + int ret = -EOPNOTSUPP;
> +
> + mutex_lock(&iommu->lock);
> + if (iommu->vmm) {
> + vfio_pasid_free_range(iommu->vmm, min, max);
> + ret = 0;
> + }
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
> + unsigned long arg)
> +{
> + struct vfio_iommu_type1_pasid_request req;
> + unsigned long minsz;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range);
> +
> + if (copy_from_user(&req, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK))
> + return -EINVAL;
> +
> + if (req.range.min > req.range.max)
> + return -EINVAL;
> +
> + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> + case VFIO_IOMMU_FLAG_ALLOC_PASID:not sure it is worth to introduce both vfio_iommu_type1_pasid_free/alloc
helpers. You could take the lock above, and test iommu->vmm as well above.
> + return vfio_iommu_type1_pasid_alloc(iommu,
> + req.range.min, req.range.max);
> + case VFIO_IOMMU_FLAG_FREE_PASID:
> + return vfio_iommu_type1_pasid_free(iommu,
> + req.range.min, req.range.max);
> + default:
> + return -EINVAL;
> + }
> +}
> +
> static long vfio_iommu_type1_ioctl(void *iommu_data,
> unsigned int cmd, unsigned long arg)
> {
> @@ -2871,6 +2954,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> return vfio_iommu_type1_unmap_dma(iommu, arg);
> case VFIO_IOMMU_DIRTY_PAGES:
> return vfio_iommu_type1_dirty_pages(iommu, arg);
> + case VFIO_IOMMU_PASID_REQUEST:
> + return vfio_iommu_type1_pasid_request(iommu, arg);
> default:
> return -ENOTTY;
> }
> diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
> index 66e6054e..ebec244 100644
> --- a/drivers/vfio/vfio_pasid.c
> +++ b/drivers/vfio/vfio_pasid.c
> @@ -61,6 +61,7 @@ void vfio_mm_put(struct vfio_mm *vmm)
> {
> kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock);
> }
> +EXPORT_SYMBOL_GPL(vfio_mm_put);
>
> static void vfio_mm_get(struct vfio_mm *vmm)
> {
> @@ -114,6 +115,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> mmput(mm);
> return vmm;
> }
> +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> +
> +int vfio_mm_ioasid_sid(struct vfio_mm *vmm)
> +{
> + return vmm->ioasid_sid;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_ioasid_sid);
>
> /*
> * Find PASID within @min and @max
> @@ -197,6 +205,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
>
> return pasid;
> }
> +EXPORT_SYMBOL_GPL(vfio_pasid_alloc);
>
> void vfio_pasid_free_range(struct vfio_mm *vmm,
> ioasid_t min, ioasid_t max)
> @@ -213,6 +222,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
> vfio_remove_pasid(vmm, vid);
> mutex_unlock(&vmm->pasid_lock);
> }
> +EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
>
> static int __init vfio_pasid_init(void)
> {
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 31472a9..a111108 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -101,6 +101,7 @@ struct vfio_mm;
> #if IS_ENABLED(CONFIG_VFIO_PASID)
> extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> extern void vfio_mm_put(struct vfio_mm *vmm);
> +int vfio_mm_ioasid_sid(struct vfio_mm *vmm);
I still don't get why it does not take an extern as the other functions
implemented in vfio_pasid.c and used in other modules. what's the
difference?
> extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> extern void vfio_pasid_free_range(struct vfio_mm *vmm,
> ioasid_t min, ioasid_t max);
> @@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm)
> {
> }
>
> +static inline int vfio_mm_ioasid_sid(struct vfio_mm *vmm)
> +{
> + return -ENOTTY;
> +}
> +
> static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> {
> return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 46a78af..96a115f 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1172,6 +1172,43 @@ struct vfio_iommu_type1_dirty_bitmap_get {
>
> #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17)
>
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
> + * struct vfio_iommu_type1_pasid_request)
> + *
> + * PASID (Processor Address Space ID) is a PCIe concept for tagging
> + * address spaces in DMA requests. When system-wide PASID allocation
> + * is required by underlying iommu driver (e.g. Intel VT-d), this
by the underlying
> + * provides an interface for userspace to request pasid alloc/free
> + * for its assigned devices. Userspace should check the availability
> + * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through
> + * VFIO_IOMMU_GET_INFO.
> + *
> + * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within @range.
> + * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range.
> + * @range is [min, max], which means both @min and @max are inclusive.
> + * ALLOC_PASID and FREE_PASID are mutually exclusive.
> + *
> + * returns: allocated PASID value on success, -errno on failure for
> + * ALLOC_PASID;
> + * 0 for FREE_PASID operation;
> + */
> +struct vfio_iommu_type1_pasid_request {
> + __u32 argsz;
> +#define VFIO_IOMMU_FLAG_ALLOC_PASID (1 << 0)
> +#define VFIO_IOMMU_FLAG_FREE_PASID (1 << 1)
> + __u32 flags;
> + struct {
> + __u32 min;
> + __u32 max;
> + } range;
> +};
> +
> +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_FLAG_ALLOC_PASID | \
> + VFIO_IOMMU_FLAG_FREE_PASID)
> +
> +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18)
> +
> /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>
> /*
>
Otherwise, besides the pieces I would have put directly in the
vfio_pasid module patch (EXPORT_SYMBOL_GPL and vfio_mm_ioasid_sid),
looks good.
Thanks
Eric
Hi Eric,
> From: Auger Eric <[email protected]>
> Sent: Saturday, July 18, 2020 12:29 AM
>
> Hi Yi,
>
> On 7/12/20 1:20 PM, Liu Yi L wrote:
> > IOMMUs that support nesting translation needs report the capability info
> s/needs/need to report
yep.
> > to userspace, e.g. the format of first level/stage paging structures.
> It gives information about requirements the userspace needs to implement
> plus other features characterizing the physical implementation.
got it. will add it in next version.
> >
> > This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
> > nesting info after setting DOMAIN_ATTR_NESTING.
> I guess you meant after selecting VFIO_TYPE1_NESTING_IOMMU?
yes, it is. ok, perhaps, it's better to say get nesting info after selecting
VFIO_TYPE1_NESTING_IOMMU.
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Cc: Joerg Roedel <[email protected]>
> > Cc: Lu Baolu <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> >
> > v3 -> v4:
> > *) split the SMMU driver changes to be a separate patch
> > *) move the @addr_width and @pasid_bits from vendor specific
> > part to generic part.
> > *) tweak the description for the @features field of struct
> > iommu_nesting_info.
> > *) add description on the @data[] field of struct iommu_nesting_info
> >
> > v2 -> v3:
> > *) remvoe cap/ecap_mask in iommu_nesting_info.
> > *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> > *) return an empty iommu_nesting_info for SMMU drivers per Jean'
> > suggestion.
> > ---
> > include/uapi/linux/iommu.h | 77
> ++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 77 insertions(+)
> >
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 1afc661..d2a47c4 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -332,4 +332,81 @@ struct iommu_gpasid_bind_data {
> > } vendor;
> > };
> >
> > +/*
> > + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> > + * user space should check it before using
> > + * nesting capability.
> > + *
> > + * @size: size of the whole structure
> > + * @format: PASID table entry format, the same definition as struct
> > + * iommu_gpasid_bind_data @format.
> > + * @features: supported nesting features.
> > + * @flags: currently reserved for future extension.
> > + * @addr_width: The output addr width of first level/stage translation
> > + * @pasid_bits: Maximum supported PASID bits, 0 represents no PASID
> > + * support.
> > + * @data: vendor specific cap info. data[] structure type can be deduced
> > + * from @format field.
> > + *
> > + *
> +===============+===================================================
> ===+
> > + * | feature | Notes |
> > + *
> +===============+===================================================
> ===+
> > + * | SYSWIDE_PASID | PASIDs are managed in system-wide, instead of per |
> s/in system-wide/system-wide ?
got it.
> > + * | | device. When a device is assigned to userspace or |
> > + * | | VM, proper uAPI (userspace driver framework uAPI, |
> > + * | | e.g. VFIO) must be used to allocate/free PASIDs for |
> > + * | | the assigned device.
> Isn't it possible to be more explicit, something like:
> |
> System-wide PASID management is mandated by the physical IOMMU. All
> PASIDs allocation must be mediated through the TBD API.
yep, I can add it.
> > + * +---------------+------------------------------------------------------+
> > + * | BIND_PGTBL | The owner of the first level/stage page table must |
> > + * | | explicitly bind the page table to associated PASID |
> > + * | | (either the one specified in bind request or the |
> > + * | | default PASID of iommu domain), through userspace |
> > + * | | driver framework uAPI (e.g. VFIO_IOMMU_NESTING_OP). |
> As per your answer in https://lkml.org/lkml/2020/7/6/383, I now
> understand ARM would not expose that BIND_PGTBL nesting feature,
yes, that's my point.
> I still
> think the above wording is a bit confusing. Maybe you may explicitly
> talk about the PASID *entry* that needs to be passed from guest to host.
> On ARM we directly pass the PASID table but when reading the above
> description I fail to determine if this does not fit that description.
yes, I can do it.
> > + * +---------------+------------------------------------------------------+
> > + * | CACHE_INVLD | The owner of the first level/stage page table must |
> > + * | | explicitly invalidate the IOMMU cache through uAPI |
> > + * | | provided by userspace driver framework (e.g. VFIO) |
> > + * | | according to vendor-specific requirement when |
> > + * | | changing the page table. |
> > + * +---------------+------------------------------------------------------+
>
> instead of using the "uAPI provided by userspace driver framework (e.g.
> VFIO)", can't we use the so-called IOMMU UAPI terminology which now has
> a userspace documentation?
the problem is current IOMMU UAPI definitions is actually embedded in
other VFIO UAPI. if it can make the description more clear, I can follow
your suggestion. :-)
>
> > + *
> > + * @data[] types defined for @format:
> > + *
> +================================+==================================
> ===+
> > + * | @format | @data[] |
> > + *
> +================================+==================================
> ===+
> > + * | IOMMU_PASID_FORMAT_INTEL_VTD | struct iommu_nesting_info_vtd |
> > + * +--------------------------------+-------------------------------------+
> > + *
> > + */
> > +struct iommu_nesting_info {
> > + __u32 size;
> shouldn't it be @argsz to fit the iommu uapi convention and take benefit
> to put the flags field just below?
make sense.
> > + __u32 format;
> > +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID (1 << 0)
> > +#define IOMMU_NESTING_FEAT_BIND_PGTBL (1 << 1)
> > +#define IOMMU_NESTING_FEAT_CACHE_INVLD (1 << 2)
> > + __u32 features;
> > + __u32 flags;
> > + __u16 addr_width;
> > + __u16 pasid_bits;
> > + __u32 padding;
> > + __u8 data[];
> > +};
> > +
> > +/*
> > + * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info
> > + *
> > + * @flags: VT-d specific flags. Currently reserved for future
> > + * extension.
> must be set to 0?
yes. will add it.
Thanks,
Yi Liu
> > + * @cap_reg: Describe basic capabilities as defined in VT-d capability
> > + * register.
> > + * @ecap_reg: Describe the extended capabilities as defined in VT-d
> > + * extended capability register.
> > + */
> > +struct iommu_nesting_info_vtd {
> > + __u32 flags;
> > + __u32 padding;
> > + __u64 cap_reg;
> > + __u64 ecap_reg;
> > +};
> > +
> > #endif /* _UAPI_IOMMU_H */
> Thanks
>
> Eric
> >
Hi Eric,
> From: Auger Eric <[email protected]>
> Sent: Saturday, July 18, 2020 1:34 AM
>
> Yi,
>
> On 7/12/20 1:20 PM, Liu Yi L wrote:
> > This patch exports iommu nesting capability info to user space through
> > VFIO. User space is expected to check this info for supported uAPIs (e.g.
> it is not only to check the supported uAPIS but rather to know which
> callbacks it must call upon vIOMMU events and which features are
> supported by the physical IOMMU.
yes, will refine the description per your comment.
> > PASID alloc/free, bind page table, and cache invalidation) and the vendor
> > specific format information for first level/stage page table that will be
> > bound to.
> >
> > The nesting info is available only after the nesting iommu type is set
> > for a container.
> to NESTED type
you mean "The nesting info is available only after container set to be NESTED type."
right?
> Current implementation imposes one limitation - one
> > nesting container should include at most one group. The philosophy of
> > vfio container is having all groups/devices within the container share
> > the same IOMMU context. When vSVA is enabled, one IOMMU context could
> > include one 2nd-level address space and multiple 1st-level address spaces.
> > While the 2nd-level address space is reasonably sharable by multiple groups
> > , blindly sharing 1st-level address spaces across all groups within the
> > container might instead break the guest expectation. In the future sub/
> > super container concept might be introduced to allow partial address space
> > sharing within an IOMMU context. But for now let's go with this restriction
> > by requiring singleton container for using nesting iommu features. Below
> > link has the related discussion about this decision.
>
> Maybe add a note about SMMU related changes spotted by Jean.
I guess you mean the comments Jean gave in patch 3/15, right? I'll
copy his comments and also give the below link as well.
https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> >
> > https://lkml.org/lkml/2020/5/15/1028
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Cc: Joerg Roedel <[email protected]>
> > Cc: Lu Baolu <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) return struct iommu_nesting_info for
> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> > cap is much "cheap", if needs extension in future, just define another cap.
> > https://lore.kernel.org/kvm/[email protected]/
> >
> > v3 -> v4:
> > *) address comments against v3.
> >
> > v1 -> v2:
> > *) added in v2
> > ---
> > drivers/vfio/vfio_iommu_type1.c | 102
> +++++++++++++++++++++++++++++++++++-----
> > include/uapi/linux/vfio.h | 19 ++++++++
> > 2 files changed, 109 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 3bd70ff..ed80104 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> > "Maximum number of user DMA mappings per container (65535).");
> >
> > struct vfio_iommu {
> > - struct list_head domain_list;
> > - struct list_head iova_list;
> > - struct vfio_domain *external_domain; /* domain for external user */
> > - struct mutex lock;
> > - struct rb_root dma_list;
> > - struct blocking_notifier_head notifier;
> > - unsigned int dma_avail;
> > - uint64_t pgsize_bitmap;
> > - bool v2;
> > - bool nesting;
> > - bool dirty_page_tracking;
> > - bool pinned_page_dirty_scope;
> > + struct list_head domain_list;
> > + struct list_head iova_list;
> > + /* domain for external user */
> > + struct vfio_domain *external_domain;
> > + struct mutex lock;
> > + struct rb_root dma_list;
> > + struct blocking_notifier_head notifier;
> > + unsigned int dma_avail;
> > + uint64_t pgsize_bitmap;
> > + bool v2;
> > + bool nesting;
> > + bool dirty_page_tracking;
> > + bool pinned_page_dirty_scope;
> > + struct iommu_nesting_info *nesting_info;
> > };
> >
> > struct vfio_domain {
> > @@ -130,6 +132,9 @@ struct vfio_regions {
> > #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
> > (!list_empty(&iommu->domain_list))
> >
> > +#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)->external_domain) || \
> > + (!list_empty(&(iommu)->domain_list)))
> > +
> > #define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) /
> BITS_PER_BYTE)
> >
> > /*
> > @@ -1929,6 +1934,13 @@ static void vfio_iommu_iova_insert_copy(struct
> vfio_iommu *iommu,
> >
> > list_splice_tail(iova_copy, iova);
> > }
> > +
> > +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> > +{
> > + kfree(iommu->nesting_info);
> > + iommu->nesting_info = NULL;
> > +}
> > +
> > static int vfio_iommu_type1_attach_group(void *iommu_data,
> > struct iommu_group *iommu_group)
> > {
> > @@ -1959,6 +1971,12 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> > }
> > }
> >
> > + /* Nesting type container can include only one group */
> > + if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
> > + mutex_unlock(&iommu->lock);
> > + return -EINVAL;
> > + }
> > +
> > group = kzalloc(sizeof(*group), GFP_KERNEL);
> > domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > if (!group || !domain) {
> > @@ -2029,6 +2047,32 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> > if (ret)
> > goto out_domain;
> >
> > + /* Nesting cap info is available only after attaching */
> > + if (iommu->nesting) {
> > + struct iommu_nesting_info tmp = { .size = 0, };
> > +
> > + /* First get the size of vendor specific nesting info */
> > + ret = iommu_domain_get_attr(domain->domain,
> > + DOMAIN_ATTR_NESTING,
> > + &tmp);
> > + if (ret)
> > + goto out_detach;
> > +
> > + iommu->nesting_info = kzalloc(tmp.size, GFP_KERNEL);
> > + if (!iommu->nesting_info) {
> > + ret = -ENOMEM;
> > + goto out_detach;
> > + }
> > +
> > + /* Now get the nesting info */
> > + iommu->nesting_info->size = tmp.size;
> > + ret = iommu_domain_get_attr(domain->domain,
> > + DOMAIN_ATTR_NESTING,
> > + iommu->nesting_info);
> > + if (ret)
> > + goto out_detach;
get nesting_info failure will result in group_attach failure.[1]
> > + }
> > +
> > /* Get aperture info */
> > iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY,
> &geo);
> >
> > @@ -2138,6 +2182,7 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> > return 0;
> >
> > out_detach:
> > + vfio_iommu_release_nesting_info(iommu);
> > vfio_iommu_detach_group(domain, group);
> > out_domain:
> > iommu_domain_free(domain->domain);
> > @@ -2338,6 +2383,8 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
> > vfio_iommu_unmap_unpin_all(iommu);
> > else
> >
> vfio_iommu_unmap_unpin_reaccount(iommu);
> > +
> > + vfio_iommu_release_nesting_info(iommu);
> > }
> > iommu_domain_free(domain->domain);
> > list_del(&domain->next);
> > @@ -2546,6 +2593,31 @@ static int vfio_iommu_migration_build_caps(struct
> vfio_iommu *iommu,
> > return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
> > }
> >
> > +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > + struct vfio_info_cap *caps)
> > +{
> > + struct vfio_info_cap_header *header;
> > + struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > + size_t size;
> > +
> > + size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
> > + iommu->nesting_info->size;
> > +
> > + header = vfio_info_cap_add(caps, size,
> > + VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> > + if (IS_ERR(header))
> > + return PTR_ERR(header);
> > +
> > + nesting_cap = container_of(header,
> > + struct vfio_iommu_type1_info_cap_nesting,
> > + header);
> > +
> > + memcpy(&nesting_cap->info, iommu->nesting_info,
> > + iommu->nesting_info->size);
> you must check whether nesting_info is non NULL before doing that.
it's now validated in the caller of this function. :-)
> Besides I agree with Jean on the fact it may be better to not report the
> capability if nested is not supported.
I see. in this patch, just few lines below [2], vfio only reports nesting
cap when iommu->nesting_info is non-null. so even if userspace selected
nesting type, it may fail at the VFIO_SET_IOMMU phase since group_attach
will be failed for NESTED type container if host iommu doesn’t support
nesting. the code is marked as [1] in this email.
> > +
> > + return 0;
> > +}
> > +
> > static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
> > unsigned long arg)
> > {
> > @@ -2581,6 +2653,12 @@ static int vfio_iommu_type1_get_info(struct
> vfio_iommu *iommu,
> > if (!ret)
> > ret = vfio_iommu_iova_build_caps(iommu, &caps);
> >
> > + if (iommu->nesting_info) {
> > + ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> > + if (ret)
> > + return ret;
here checks nesting_info before reporting nesting cap. [2]
> > + }
> > +
> > mutex_unlock(&iommu->lock);
> >
> > if (ret)
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9204705..46a78af 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -14,6 +14,7 @@
> >
> > #include <linux/types.h>
> > #include <linux/ioctl.h>
> > +#include <linux/iommu.h>
> >
> > #define VFIO_API_VERSION 0
> >
> > @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
> > __u64 max_dirty_bitmap_size; /* in bytes */
> > };
> >
> > +/*
> > + * The nesting capability allows to report the related capability
> > + * and info for nesting iommu type.
> > + *
> > + * The structures below define version 1 of this capability.
> > + *
> > + * User space selected VFIO_TYPE1_NESTING_IOMMU type should check
> > + * this capto get supported features.
> s/capto/capability to get
got it.
Regards,
Yi Liu
> > + *
> > + * @info: the nesting info provided by IOMMU driver.
> > + */
> > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3
> > +
> > +struct vfio_iommu_type1_info_cap_nesting {
> > + struct vfio_info_cap_header header;
> > + struct iommu_nesting_info info;
> > +};
> > +
> > #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> >
> > /**
> >
>
> Thanks
>
> Eric
Hi Eric,
> From: Auger Eric <[email protected]>
> Sent: Sunday, July 19, 2020 11:38 PM
>
> Yi,
>
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > From IOMMU p.o.v., PASIDs allocated and managed by external components
> > (e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
> > needs some knowledge to check the PASID ownership, hence add an
> > interface for those components to tell the PASID owner.
> >
> > In latest kernel design, PASID ownership is managed by IOASID set
> > where the PASID is allocated from. This patch adds support for setting
> > ioasid set ID to the domains used for nesting/vSVA. Subsequent SVA
> > operations on the PASID will be checked against its IOASID set for proper
> ownership.
> Subsequent SVA operations will check the PASID against its IOASID set for proper
> ownership.
got it.
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Cc: Joerg Roedel <[email protected]>
> > Cc: Lu Baolu <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> > drivers/iommu/intel/iommu.c | 22 ++++++++++++++++++++++
> > include/linux/intel-iommu.h | 4 ++++
> > include/linux/iommu.h | 1 +
> > 3 files changed, 27 insertions(+)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 72ae6a2..4d54198 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
> > if (first_level_by_default())
> > domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> > domain->has_iotlb_device = false;
> > + domain->ioasid_sid = INVALID_IOASID_SET;
> > INIT_LIST_HEAD(&domain->devices);
> >
> > return domain;
> > @@ -6039,6 +6040,27 @@ intel_iommu_domain_set_attr(struct iommu_domain
> *domain,
> > }
> > spin_unlock_irqrestore(&device_domain_lock, flags);
> > break;
> > + case DOMAIN_ATTR_IOASID_SID:
> > + {
> > + int sid = *(int *)data;
> > +
> > + if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
> > + ret = -ENODEV;
> > + break;
> > + }
> > + spin_lock_irqsave(&device_domain_lock, flags);
> I think the lock should be taken before the DOMAIN_FLAG_NESTING_MODE check.
> Otherwise, the flags can be theretically changed inbetween the check and the test
> below?
I see. will correct it.
Thanks,
Yi Liu
> Thanks
>
> Eric
> > + if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
> > + dmar_domain->ioasid_sid != sid) {
> > + pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
> > + dmar_domain->ioasid_sid, sid);
> > + ret = -EBUSY;
> > + spin_unlock_irqrestore(&device_domain_lock, flags);
> > + break;
> > + }
> > + dmar_domain->ioasid_sid = sid;
> > + spin_unlock_irqrestore(&device_domain_lock, flags);
> > + break;
> > + }
> > default:
> > ret = -EINVAL;
> > break;
> > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> > index 3f23c26..0d0ab32 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -549,6 +549,10 @@ struct dmar_domain {
> > 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
> > u64 max_addr; /* maximum mapped address */
> >
> > + int ioasid_sid; /*
> > + * the ioasid set which tracks all
> > + * PASIDs used by the domain.
> > + */
> > int default_pasid; /*
> > * The default pasid used for non-SVM
> > * traffic on mediated devices.
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > 7ca9d48..e84a1d5 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -124,6 +124,7 @@ enum iommu_attr {
> > DOMAIN_ATTR_FSL_PAMUV1,
> > DOMAIN_ATTR_NESTING, /* two stages of translation */
> > DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> > + DOMAIN_ATTR_IOASID_SID,
> > DOMAIN_ATTR_MAX,
> > };
> >
> >
Yi,
On 7/20/20 9:51 AM, Liu, Yi L wrote:
> Hi Eric,
>
>> From: Auger Eric <[email protected]>
>> Sent: Saturday, July 18, 2020 1:34 AM
>>
>> Yi,
>>
>> On 7/12/20 1:20 PM, Liu Yi L wrote:
>>> This patch exports iommu nesting capability info to user space through
>>> VFIO. User space is expected to check this info for supported uAPIs (e.g.
>> it is not only to check the supported uAPIS but rather to know which
>> callbacks it must call upon vIOMMU events and which features are
>> supported by the physical IOMMU.
>
> yes, will refine the description per your comment.
>
>>> PASID alloc/free, bind page table, and cache invalidation) and the vendor
>>> specific format information for first level/stage page table that will be
>>> bound to.
>>>
>>> The nesting info is available only after the nesting iommu type is set
>>> for a container.
>> to NESTED type
>
> you mean "The nesting info is available only after container set to be NESTED type."
>
> right?
correct
>
>> Current implementation imposes one limitation - one
>>> nesting container should include at most one group. The philosophy of
>>> vfio container is having all groups/devices within the container share
>>> the same IOMMU context. When vSVA is enabled, one IOMMU context could
>>> include one 2nd-level address space and multiple 1st-level address spaces.
>>> While the 2nd-level address space is reasonably sharable by multiple groups
>>> , blindly sharing 1st-level address spaces across all groups within the
>>> container might instead break the guest expectation. In the future sub/
>>> super container concept might be introduced to allow partial address space
>>> sharing within an IOMMU context. But for now let's go with this restriction
>>> by requiring singleton container for using nesting iommu features. Below
>>> link has the related discussion about this decision.
>>
>> Maybe add a note about SMMU related changes spotted by Jean.
>
> I guess you mean the comments Jean gave in patch 3/15, right? I'll
> copy his comments and also give the below link as well.
>
> https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
correct
Thanks
Eric
>
>>>
>>> https://lkml.org/lkml/2020/5/15/1028
>>>
>>> Cc: Kevin Tian <[email protected]>
>>> CC: Jacob Pan <[email protected]>
>>> Cc: Alex Williamson <[email protected]>
>>> Cc: Eric Auger <[email protected]>
>>> Cc: Jean-Philippe Brucker <[email protected]>
>>> Cc: Joerg Roedel <[email protected]>
>>> Cc: Lu Baolu <[email protected]>
>>> Signed-off-by: Liu Yi L <[email protected]>
>>> ---
>>> v4 -> v5:
>>> *) address comments from Eric Auger.
>>> *) return struct iommu_nesting_info for
>> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
>>> cap is much "cheap", if needs extension in future, just define another cap.
>>> https://lore.kernel.org/kvm/[email protected]/
>>>
>>> v3 -> v4:
>>> *) address comments against v3.
>>>
>>> v1 -> v2:
>>> *) added in v2
>>> ---
>>> drivers/vfio/vfio_iommu_type1.c | 102
>> +++++++++++++++++++++++++++++++++++-----
>>> include/uapi/linux/vfio.h | 19 ++++++++
>>> 2 files changed, 109 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 3bd70ff..ed80104 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
>>> "Maximum number of user DMA mappings per container (65535).");
>>>
>>> struct vfio_iommu {
>>> - struct list_head domain_list;
>>> - struct list_head iova_list;
>>> - struct vfio_domain *external_domain; /* domain for external user */
>>> - struct mutex lock;
>>> - struct rb_root dma_list;
>>> - struct blocking_notifier_head notifier;
>>> - unsigned int dma_avail;
>>> - uint64_t pgsize_bitmap;
>>> - bool v2;
>>> - bool nesting;
>>> - bool dirty_page_tracking;
>>> - bool pinned_page_dirty_scope;
>>> + struct list_head domain_list;
>>> + struct list_head iova_list;
>>> + /* domain for external user */
>>> + struct vfio_domain *external_domain;
>>> + struct mutex lock;
>>> + struct rb_root dma_list;
>>> + struct blocking_notifier_head notifier;
>>> + unsigned int dma_avail;
>>> + uint64_t pgsize_bitmap;
>>> + bool v2;
>>> + bool nesting;
>>> + bool dirty_page_tracking;
>>> + bool pinned_page_dirty_scope;
>>> + struct iommu_nesting_info *nesting_info;
>>> };
>>>
>>> struct vfio_domain {
>>> @@ -130,6 +132,9 @@ struct vfio_regions {
>>> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
>>> (!list_empty(&iommu->domain_list))
>>>
>>> +#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)->external_domain) || \
>>> + (!list_empty(&(iommu)->domain_list)))
>>> +
>>> #define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) /
>> BITS_PER_BYTE)
>>>
>>> /*
>>> @@ -1929,6 +1934,13 @@ static void vfio_iommu_iova_insert_copy(struct
>> vfio_iommu *iommu,
>>>
>>> list_splice_tail(iova_copy, iova);
>>> }
>>> +
>>> +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
>>> +{
>>> + kfree(iommu->nesting_info);
>>> + iommu->nesting_info = NULL;
>>> +}
>>> +
>>> static int vfio_iommu_type1_attach_group(void *iommu_data,
>>> struct iommu_group *iommu_group)
>>> {
>>> @@ -1959,6 +1971,12 @@ static int vfio_iommu_type1_attach_group(void
>> *iommu_data,
>>> }
>>> }
>>>
>>> + /* Nesting type container can include only one group */
>>> + if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
>>> + mutex_unlock(&iommu->lock);
>>> + return -EINVAL;
>>> + }
>>> +
>>> group = kzalloc(sizeof(*group), GFP_KERNEL);
>>> domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>>> if (!group || !domain) {
>>> @@ -2029,6 +2047,32 @@ static int vfio_iommu_type1_attach_group(void
>> *iommu_data,
>>> if (ret)
>>> goto out_domain;
>>>
>>> + /* Nesting cap info is available only after attaching */
>>> + if (iommu->nesting) {
>>> + struct iommu_nesting_info tmp = { .size = 0, };
>>> +
>>> + /* First get the size of vendor specific nesting info */
>>> + ret = iommu_domain_get_attr(domain->domain,
>>> + DOMAIN_ATTR_NESTING,
>>> + &tmp);
>>> + if (ret)
>>> + goto out_detach;
>>> +
>>> + iommu->nesting_info = kzalloc(tmp.size, GFP_KERNEL);
>>> + if (!iommu->nesting_info) {
>>> + ret = -ENOMEM;
>>> + goto out_detach;
>>> + }
>>> +
>>> + /* Now get the nesting info */
>>> + iommu->nesting_info->size = tmp.size;
>>> + ret = iommu_domain_get_attr(domain->domain,
>>> + DOMAIN_ATTR_NESTING,
>>> + iommu->nesting_info);
>>> + if (ret)
>>> + goto out_detach;
>
> get nesting_info failure will result in group_attach failure.[1]
>
>>> + }
>>> +
>>> /* Get aperture info */
>>> iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY,
>> &geo);
>>>
>>> @@ -2138,6 +2182,7 @@ static int vfio_iommu_type1_attach_group(void
>> *iommu_data,
>>> return 0;
>>>
>>> out_detach:
>>> + vfio_iommu_release_nesting_info(iommu);
>>> vfio_iommu_detach_group(domain, group);
>>> out_domain:
>>> iommu_domain_free(domain->domain);
>>> @@ -2338,6 +2383,8 @@ static void vfio_iommu_type1_detach_group(void
>> *iommu_data,
>>> vfio_iommu_unmap_unpin_all(iommu);
>>> else
>>>
>> vfio_iommu_unmap_unpin_reaccount(iommu);
>>> +
>>> + vfio_iommu_release_nesting_info(iommu);
>>> }
>>> iommu_domain_free(domain->domain);
>>> list_del(&domain->next);
>>> @@ -2546,6 +2593,31 @@ static int vfio_iommu_migration_build_caps(struct
>> vfio_iommu *iommu,
>>> return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
>>> }
>>>
>>> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>>> + struct vfio_info_cap *caps)
>>> +{
>>> + struct vfio_info_cap_header *header;
>>> + struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
>>> + size_t size;
>>> +
>>> + size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
>>> + iommu->nesting_info->size;
>>> +
>>> + header = vfio_info_cap_add(caps, size,
>>> + VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
>>> + if (IS_ERR(header))
>>> + return PTR_ERR(header);
>>> +
>>> + nesting_cap = container_of(header,
>>> + struct vfio_iommu_type1_info_cap_nesting,
>>> + header);
>>> +
>>> + memcpy(&nesting_cap->info, iommu->nesting_info,
>>> + iommu->nesting_info->size);
>> you must check whether nesting_info is non NULL before doing that.
>
> it's now validated in the caller of this function. :-)
ah ok sorry I missed that.
>
>> Besides I agree with Jean on the fact it may be better to not report the
>> capability if nested is not supported.
>
> I see. in this patch, just few lines below [2], vfio only reports nesting
> cap when iommu->nesting_info is non-null. so even if userspace selected
> nesting type, it may fail at the VFIO_SET_IOMMU phase since group_attach
> will be failed for NESTED type container if host iommu doesn’t support
> nesting. the code is marked as [1] in this email.
OK that's already there then.
Thanks
Eric
>
>>> +
>>> + return 0;
>>> +}
>>> +
>>> static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
>>> unsigned long arg)
>>> {
>>> @@ -2581,6 +2653,12 @@ static int vfio_iommu_type1_get_info(struct
>> vfio_iommu *iommu,
>>> if (!ret)
>>> ret = vfio_iommu_iova_build_caps(iommu, &caps);
>>>
>>> + if (iommu->nesting_info) {
>>> + ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
>>> + if (ret)
>>> + return ret;
>
> here checks nesting_info before reporting nesting cap. [2]
>
>>> + }
>>> +
>>> mutex_unlock(&iommu->lock);
>>>
>>> if (ret)
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index 9204705..46a78af 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -14,6 +14,7 @@
>>>
>>> #include <linux/types.h>
>>> #include <linux/ioctl.h>
>>> +#include <linux/iommu.h>
>>>
>>> #define VFIO_API_VERSION 0
>>>
>>> @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
>>> __u64 max_dirty_bitmap_size; /* in bytes */
>>> };
>>>
>>> +/*
>>> + * The nesting capability allows to report the related capability
>>> + * and info for nesting iommu type.
>>> + *
>>> + * The structures below define version 1 of this capability.
>>> + *
>>> + * User space selected VFIO_TYPE1_NESTING_IOMMU type should check
>>> + * this capto get supported features.
>> s/capto/capability to get
>
> got it.
>
> Regards,
> Yi Liu
>
>>> + *
>>> + * @info: the nesting info provided by IOMMU driver.
>>> + */
>>> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3
>>> +
>>> +struct vfio_iommu_type1_info_cap_nesting {
>>> + struct vfio_info_cap_header header;
>>> + struct iommu_nesting_info info;
>>> +};
>>> +
>>> #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>>>
>>> /**
>>>
>>
>> Thanks
>>
>> Eric
>
Hi Eric,
> From: Auger Eric <[email protected]>
> Sent: Monday, July 20, 2020 4:33 PM
>
> Yi,
>
> On 7/20/20 9:51 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric <[email protected]>
> >> Sent: Saturday, July 18, 2020 1:34 AM
> >>
> >> Yi,
> >>
> >> On 7/12/20 1:20 PM, Liu Yi L wrote:
> >>> This patch exports iommu nesting capability info to user space
> >>> through VFIO. User space is expected to check this info for supported uAPIs (e.g.
> >> it is not only to check the supported uAPIS but rather to know which
> >> callbacks it must call upon vIOMMU events and which features are
> >> supported by the physical IOMMU.
> >
> > yes, will refine the description per your comment.
> >
> >>> PASID alloc/free, bind page table, and cache invalidation) and the
> >>> vendor specific format information for first level/stage page table
> >>> that will be bound to.
> >>>
> >>> The nesting info is available only after the nesting iommu type is
> >>> set for a container.
> >> to NESTED type
> >
> > you mean "The nesting info is available only after container set to be NESTED
> type."
> >
> > right?
> correct
got you.
> >
> >> Current implementation imposes one limitation - one
> >>> nesting container should include at most one group. The philosophy
> >>> of vfio container is having all groups/devices within the container
> >>> share the same IOMMU context. When vSVA is enabled, one IOMMU
> >>> context could include one 2nd-level address space and multiple 1st-level address
> spaces.
> >>> While the 2nd-level address space is reasonably sharable by multiple
> >>> groups , blindly sharing 1st-level address spaces across all groups
> >>> within the container might instead break the guest expectation. In
> >>> the future sub/ super container concept might be introduced to allow
> >>> partial address space sharing within an IOMMU context. But for now
> >>> let's go with this restriction by requiring singleton container for
> >>> using nesting iommu features. Below link has the related discussion about this
> decision.
> >>
> >> Maybe add a note about SMMU related changes spotted by Jean.
> >
> > I guess you mean the comments Jean gave in patch 3/15, right? I'll
> > copy his comments and also give the below link as well.
> >
> > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> correct
I see.
Regards,
Yi Liu
> Thanks
>
> Eric
> >
> >>>
> >>> https://lkml.org/lkml/2020/5/15/1028
> >>>
> >>> Cc: Kevin Tian <[email protected]>
> >>> CC: Jacob Pan <[email protected]>
> >>> Cc: Alex Williamson <[email protected]>
> >>> Cc: Eric Auger <[email protected]>
> >>> Cc: Jean-Philippe Brucker <[email protected]>
> >>> Cc: Joerg Roedel <[email protected]>
> >>> Cc: Lu Baolu <[email protected]>
> >>> Signed-off-by: Liu Yi L <[email protected]>
> >>> ---
> >>> v4 -> v5:
> >>> *) address comments from Eric Auger.
> >>> *) return struct iommu_nesting_info for
> >> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> >>> cap is much "cheap", if needs extension in future, just define another cap.
> >>> https://lore.kernel.org/kvm/[email protected]/
> >>>
> >>> v3 -> v4:
> >>> *) address comments against v3.
> >>>
> >>> v1 -> v2:
> >>> *) added in v2
> >>> ---
> >>> drivers/vfio/vfio_iommu_type1.c | 102
> >> +++++++++++++++++++++++++++++++++++-----
> >>> include/uapi/linux/vfio.h | 19 ++++++++
> >>> 2 files changed, 109 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c
> >>> b/drivers/vfio/vfio_iommu_type1.c index 3bd70ff..ed80104 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> >>> "Maximum number of user DMA mappings per container (65535).");
> >>>
> >>> struct vfio_iommu {
> >>> - struct list_head domain_list;
> >>> - struct list_head iova_list;
> >>> - struct vfio_domain *external_domain; /* domain for external user */
> >>> - struct mutex lock;
> >>> - struct rb_root dma_list;
> >>> - struct blocking_notifier_head notifier;
> >>> - unsigned int dma_avail;
> >>> - uint64_t pgsize_bitmap;
> >>> - bool v2;
> >>> - bool nesting;
> >>> - bool dirty_page_tracking;
> >>> - bool pinned_page_dirty_scope;
> >>> + struct list_head domain_list;
> >>> + struct list_head iova_list;
> >>> + /* domain for external user */
> >>> + struct vfio_domain *external_domain;
> >>> + struct mutex lock;
> >>> + struct rb_root dma_list;
> >>> + struct blocking_notifier_head notifier;
> >>> + unsigned int dma_avail;
> >>> + uint64_t pgsize_bitmap;
> >>> + bool v2;
> >>> + bool nesting;
> >>> + bool dirty_page_tracking;
> >>> + bool pinned_page_dirty_scope;
> >>> + struct iommu_nesting_info *nesting_info;
> >>> };
> >>>
> >>> struct vfio_domain {
> >>> @@ -130,6 +132,9 @@ struct vfio_regions {
> >>> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
> >>> (!list_empty(&iommu->domain_list))
> >>>
> >>> +#define CONTAINER_HAS_DOMAIN(iommu) (((iommu)-
> >external_domain) || \
> >>> + (!list_empty(&(iommu)->domain_list)))
> >>> +
> >>> #define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) /
> >> BITS_PER_BYTE)
> >>>
> >>> /*
> >>> @@ -1929,6 +1934,13 @@ static void
> >>> vfio_iommu_iova_insert_copy(struct
> >> vfio_iommu *iommu,
> >>>
> >>> list_splice_tail(iova_copy, iova); }
> >>> +
> >>> +static void vfio_iommu_release_nesting_info(struct vfio_iommu
> >>> +*iommu) {
> >>> + kfree(iommu->nesting_info);
> >>> + iommu->nesting_info = NULL;
> >>> +}
> >>> +
> >>> static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>> struct iommu_group *iommu_group)
> { @@ -1959,6 +1971,12 @@
> >>> static int vfio_iommu_type1_attach_group(void
> >> *iommu_data,
> >>> }
> >>> }
> >>>
> >>> + /* Nesting type container can include only one group */
> >>> + if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
> >>> + mutex_unlock(&iommu->lock);
> >>> + return -EINVAL;
> >>> + }
> >>> +
> >>> group = kzalloc(sizeof(*group), GFP_KERNEL);
> >>> domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> >>> if (!group || !domain) {
> >>> @@ -2029,6 +2047,32 @@ static int vfio_iommu_type1_attach_group(void
> >> *iommu_data,
> >>> if (ret)
> >>> goto out_domain;
> >>>
> >>> + /* Nesting cap info is available only after attaching */
> >>> + if (iommu->nesting) {
> >>> + struct iommu_nesting_info tmp = { .size = 0, };
> >>> +
> >>> + /* First get the size of vendor specific nesting info */
> >>> + ret = iommu_domain_get_attr(domain->domain,
> >>> + DOMAIN_ATTR_NESTING,
> >>> + &tmp);
> >>> + if (ret)
> >>> + goto out_detach;
> >>> +
> >>> + iommu->nesting_info = kzalloc(tmp.size, GFP_KERNEL);
> >>> + if (!iommu->nesting_info) {
> >>> + ret = -ENOMEM;
> >>> + goto out_detach;
> >>> + }
> >>> +
> >>> + /* Now get the nesting info */
> >>> + iommu->nesting_info->size = tmp.size;
> >>> + ret = iommu_domain_get_attr(domain->domain,
> >>> + DOMAIN_ATTR_NESTING,
> >>> + iommu->nesting_info);
> >>> + if (ret)
> >>> + goto out_detach;
> >
> > get nesting_info failure will result in group_attach failure.[1]
> >
> >>> + }
> >>> +
> >>> /* Get aperture info */
> >>> iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY,
> >> &geo);
> >>>
> >>> @@ -2138,6 +2182,7 @@ static int vfio_iommu_type1_attach_group(void
> >> *iommu_data,
> >>> return 0;
> >>>
> >>> out_detach:
> >>> + vfio_iommu_release_nesting_info(iommu);
> >>> vfio_iommu_detach_group(domain, group);
> >>> out_domain:
> >>> iommu_domain_free(domain->domain);
> >>> @@ -2338,6 +2383,8 @@ static void vfio_iommu_type1_detach_group(void
> >> *iommu_data,
> >>> vfio_iommu_unmap_unpin_all(iommu);
> >>> else
> >>>
> >> vfio_iommu_unmap_unpin_reaccount(iommu);
> >>> +
> >>> + vfio_iommu_release_nesting_info(iommu);
> >>> }
> >>> iommu_domain_free(domain->domain);
> >>> list_del(&domain->next);
> >>> @@ -2546,6 +2593,31 @@ static int
> >>> vfio_iommu_migration_build_caps(struct
> >> vfio_iommu *iommu,
> >>> return vfio_info_add_capability(caps, &cap_mig.header,
> >>> sizeof(cap_mig)); }
> >>>
> >>> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> >>> + struct vfio_info_cap *caps) {
> >>> + struct vfio_info_cap_header *header;
> >>> + struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> >>> + size_t size;
> >>> +
> >>> + size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
> >>> + iommu->nesting_info->size;
> >>> +
> >>> + header = vfio_info_cap_add(caps, size,
> >>> + VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> >>> + if (IS_ERR(header))
> >>> + return PTR_ERR(header);
> >>> +
> >>> + nesting_cap = container_of(header,
> >>> + struct vfio_iommu_type1_info_cap_nesting,
> >>> + header);
> >>> +
> >>> + memcpy(&nesting_cap->info, iommu->nesting_info,
> >>> + iommu->nesting_info->size);
> >> you must check whether nesting_info is non NULL before doing that.
> >
> > it's now validated in the caller of this function. :-)
> ah ok sorry I missed that.
> >
> >> Besides I agree with Jean on the fact it may be better to not report
> >> the capability if nested is not supported.
> >
> > I see. in this patch, just few lines below [2], vfio only reports
> > nesting cap when iommu->nesting_info is non-null. so even if userspace
> > selected nesting type, it may fail at the VFIO_SET_IOMMU phase since
> > group_attach will be failed for NESTED type container if host iommu
> > doesn’t support nesting. the code is marked as [1] in this email.
> OK that's already there then.
>
> Thanks
>
> Eric
> >
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
> >>> unsigned long arg)
> >>> {
> >>> @@ -2581,6 +2653,12 @@ static int vfio_iommu_type1_get_info(struct
> >> vfio_iommu *iommu,
> >>> if (!ret)
> >>> ret = vfio_iommu_iova_build_caps(iommu, &caps);
> >>>
> >>> + if (iommu->nesting_info) {
> >>> + ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> >>> + if (ret)
> >>> + return ret;
> >
> > here checks nesting_info before reporting nesting cap. [2]
> >
> >>> + }
> >>> +
> >>> mutex_unlock(&iommu->lock);
> >>>
> >>> if (ret)
> >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>> index 9204705..46a78af 100644
> >>> --- a/include/uapi/linux/vfio.h
> >>> +++ b/include/uapi/linux/vfio.h
> >>> @@ -14,6 +14,7 @@
> >>>
> >>> #include <linux/types.h>
> >>> #include <linux/ioctl.h>
> >>> +#include <linux/iommu.h>
> >>>
> >>> #define VFIO_API_VERSION 0
> >>>
> >>> @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
> >>> __u64 max_dirty_bitmap_size; /* in bytes */
> >>> };
> >>>
> >>> +/*
> >>> + * The nesting capability allows to report the related capability
> >>> + * and info for nesting iommu type.
> >>> + *
> >>> + * The structures below define version 1 of this capability.
> >>> + *
> >>> + * User space selected VFIO_TYPE1_NESTING_IOMMU type should check
> >>> + * this capto get supported features.
> >> s/capto/capability to get
> >
> > got it.
> >
> > Regards,
> > Yi Liu
> >
> >>> + *
> >>> + * @info: the nesting info provided by IOMMU driver.
> >>> + */
> >>> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING 3
> >>> +
> >>> +struct vfio_iommu_type1_info_cap_nesting {
> >>> + struct vfio_info_cap_header header;
> >>> + struct iommu_nesting_info info;
> >>> +};
> >>> +
> >>> #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> >>>
> >>> /**
> >>>
> >>
> >> Thanks
> >>
> >> Eric
> >
Hi Eric,
> From: Auger Eric <[email protected]>
> Sent: Sunday, July 19, 2020 11:39 PM
>
> Yi,
>
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > This patch allows user space to request PASID allocation/free, e.g.
> > when serving the request from the guest.
> >
> > PASIDs that are not freed by userspace are automatically freed when
> > the IOASID set is destroyed when process exits.
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Cc: Joerg Roedel <[email protected]>
> > Cc: Lu Baolu <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Yi Sun <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) the comments for the PASID_FREE request is addressed in patch 5/15 of
> > this series.
> >
> > v3 -> v4:
> > *) address comments from v3, except the below comment against the range
> > of PASID_FREE request. needs more help on it.
> > "> +if (req.range.min > req.range.max)
> >
> > Is it exploitable that a user can spin the kernel for a long time in
> > the case of a free by calling this with [0, MAX_UINT] regardless of
> > their actual allocations?"
> >
> > https://lore.kernel.org/linux-iommu/[email protected]/
> >
> > v1 -> v2:
> > *) move the vfio_mm related code to be a seprate module
> > *) use a single structure for alloc/free, could support a range of
> > PASIDs
> > *) fetch vfio_mm at group_attach time instead of at iommu driver open
> > time
> > ---
> > drivers/vfio/Kconfig | 1 +
> > drivers/vfio/vfio_iommu_type1.c | 85
> +++++++++++++++++++++++++++++++++++++++++
> > drivers/vfio/vfio_pasid.c | 10 +++++
> > include/linux/vfio.h | 6 +++
> > include/uapi/linux/vfio.h | 37 ++++++++++++++++++
> > 5 files changed, 139 insertions(+)
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > 3d8a108..95d90c6 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -2,6 +2,7 @@
> > config VFIO_IOMMU_TYPE1
> > tristate
> > depends on VFIO
> > + select VFIO_PASID if (X86)
> > default n
> >
> > config VFIO_IOMMU_SPAPR_TCE
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index ed80104..55b4065 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -76,6 +76,7 @@ struct vfio_iommu {
> > bool dirty_page_tracking;
> > bool pinned_page_dirty_scope;
> > struct iommu_nesting_info *nesting_info;
> > + struct vfio_mm *vmm;
> > };
> >
> > struct vfio_domain {
> > @@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct
> > vfio_iommu *iommu,
> >
> > static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> > {
> > + if (iommu->vmm) {
> > + vfio_mm_put(iommu->vmm);
> > + iommu->vmm = NULL;
> > + }
> > +
> > kfree(iommu->nesting_info);
> > iommu->nesting_info = NULL;
> > }
> > @@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> > iommu->nesting_info);
> > if (ret)
> > goto out_detach;
> > +
> > + if (iommu->nesting_info->features &
> > + IOMMU_NESTING_FEAT_SYSWIDE_PASID)
> {
> > + struct vfio_mm *vmm;
> > + int sid;
> > +
> > + vmm = vfio_mm_get_from_task(current);
> > + if (IS_ERR(vmm)) {
> > + ret = PTR_ERR(vmm);
> > + goto out_detach;
> > + }
> > + iommu->vmm = vmm;
> > +
> > + sid = vfio_mm_ioasid_sid(vmm);
> > + ret = iommu_domain_set_attr(domain->domain,
> > + DOMAIN_ATTR_IOASID_SID,
> > + &sid);
> > + if (ret)
> > + goto out_detach;
> > + }
> > }
> >
> > /* Get aperture info */
> > @@ -2855,6 +2881,63 @@ static int vfio_iommu_type1_dirty_pages(struct
> vfio_iommu *iommu,
> > return -EINVAL;
> > }
> >
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > + unsigned int min,
> > + unsigned int max)
> > +{
> > + int ret = -EOPNOTSUPP;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (iommu->vmm)
> > + ret = vfio_pasid_alloc(iommu->vmm, min, max);
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > + unsigned int min,
> > + unsigned int max)
> > +{
> > + int ret = -EOPNOTSUPP;
> > +
> > + mutex_lock(&iommu->lock);
> > + if (iommu->vmm) {
> > + vfio_pasid_free_range(iommu->vmm, min, max);
> > + ret = 0;
> > + }
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
> > + unsigned long arg)
> > +{
> > + struct vfio_iommu_type1_pasid_request req;
> > + unsigned long minsz;
> > +
> > + minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range);
> > +
> > + if (copy_from_user(&req, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK))
> > + return -EINVAL;
> > +
> > + if (req.range.min > req.range.max)
> > + return -EINVAL;
> > +
> > + switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > + case VFIO_IOMMU_FLAG_ALLOC_PASID:
> not sure it is worth to introduce both vfio_iommu_type1_pasid_free/alloc
> helpers. You could take the lock above, and test iommu->vmm as well above.
nice suggestion. thanks.
> > + return vfio_iommu_type1_pasid_alloc(iommu,
> > + req.range.min, req.range.max);
> > + case VFIO_IOMMU_FLAG_FREE_PASID:
> > + return vfio_iommu_type1_pasid_free(iommu,
> > + req.range.min, req.range.max);
> > + default:
> > + return -EINVAL;
> > + }
> > +}
> > +
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > unsigned int cmd, unsigned long arg) { @@ -
> 2871,6 +2954,8 @@
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > return vfio_iommu_type1_unmap_dma(iommu, arg);
> > case VFIO_IOMMU_DIRTY_PAGES:
> > return vfio_iommu_type1_dirty_pages(iommu, arg);
> > + case VFIO_IOMMU_PASID_REQUEST:
> > + return vfio_iommu_type1_pasid_request(iommu, arg);
> > default:
> > return -ENOTTY;
> > }
> > diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
> > index 66e6054e..ebec244 100644
> > --- a/drivers/vfio/vfio_pasid.c
> > +++ b/drivers/vfio/vfio_pasid.c
> > @@ -61,6 +61,7 @@ void vfio_mm_put(struct vfio_mm *vmm) {
> > kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); }
> > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> >
> > static void vfio_mm_get(struct vfio_mm *vmm)
> > {
> > @@ -114,6 +115,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct
> task_struct *task)
> > mmput(mm);
> > return vmm;
> > }
> > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > +
> > +int vfio_mm_ioasid_sid(struct vfio_mm *vmm)
> > +{
> > + return vmm->ioasid_sid;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_ioasid_sid);
> >
> > /*
> > * Find PASID within @min and @max
> > @@ -197,6 +205,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> >
> > return pasid;
> > }
> > +EXPORT_SYMBOL_GPL(vfio_pasid_alloc);
> >
> > void vfio_pasid_free_range(struct vfio_mm *vmm,
> > ioasid_t min, ioasid_t max)
> > @@ -213,6 +222,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
> > vfio_remove_pasid(vmm, vid);
> > mutex_unlock(&vmm->pasid_lock);
> > }
> > +EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
> >
> > static int __init vfio_pasid_init(void)
> > {
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index 31472a9..a111108 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -101,6 +101,7 @@ struct vfio_mm;
> > #if IS_ENABLED(CONFIG_VFIO_PASID)
> > extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > extern void vfio_mm_put(struct vfio_mm *vmm);
> > +int vfio_mm_ioasid_sid(struct vfio_mm *vmm);
> I still don't get why it does not take an extern as the other functions
> implemented in vfio_pasid.c and used in other modules. what's the
> difference?
it's a mistake from me. should have "extern" :-)
> > extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > extern void vfio_pasid_free_range(struct vfio_mm *vmm,
> > ioasid_t min, ioasid_t max);
> > @@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm)
> > {
> > }
> >
> > +static inline int vfio_mm_ioasid_sid(struct vfio_mm *vmm)
> > +{
> > + return -ENOTTY;
> > +}
> > +
> > static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > {
> > return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 46a78af..96a115f 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -1172,6 +1172,43 @@ struct vfio_iommu_type1_dirty_bitmap_get {
> >
> > #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17)
> >
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
> > + * struct vfio_iommu_type1_pasid_request)
> > + *
> > + * PASID (Processor Address Space ID) is a PCIe concept for tagging
> > + * address spaces in DMA requests. When system-wide PASID allocation
> > + * is required by underlying iommu driver (e.g. Intel VT-d), this
> by the underlying
got it.
> > + * provides an interface for userspace to request pasid alloc/free
> > + * for its assigned devices. Userspace should check the availability
> > + * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through
> > + * VFIO_IOMMU_GET_INFO.
> > + *
> > + * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within
> @range.
> > + * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range.
> > + * @range is [min, max], which means both @min and @max are inclusive.
> > + * ALLOC_PASID and FREE_PASID are mutually exclusive.
> > + *
> > + * returns: allocated PASID value on success, -errno on failure for
> > + * ALLOC_PASID;
> > + * 0 for FREE_PASID operation;
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > + __u32 argsz;
> > +#define VFIO_IOMMU_FLAG_ALLOC_PASID (1 << 0)
> > +#define VFIO_IOMMU_FLAG_FREE_PASID (1 << 1)
> > + __u32 flags;
> > + struct {
> > + __u32 min;
> > + __u32 max;
> > + } range;
> > +};
> > +
> > +#define VFIO_PASID_REQUEST_MASK (VFIO_IOMMU_FLAG_ALLOC_PASID | \
> > + VFIO_IOMMU_FLAG_FREE_PASID)
> > +
> > +#define VFIO_IOMMU_PASID_REQUEST _IO(VFIO_TYPE, VFIO_BASE + 18)
> > +
> > /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >
> > /*
> >
> Otherwise, besides the pieces I would have put directly in the
> vfio_pasid module patch (EXPORT_SYMBOL_GPL and vfio_mm_ioasid_sid),
> looks good.
we discussed in another thread:-).
Regards,
Yi Liu
>
> Thanks
>
> Eric
Yi,
On 7/12/20 1:21 PM, Liu Yi L wrote:
> This patch exposes PCIe PASID capability to guest for assigned devices.
> Existing vfio_pci driver hides it from guest by setting the capability
> length as 0 in pci_ext_cap_length[].
>
> And this patch only exposes PASID capability for devices which has PCIe
> PASID extended struture in its configuration space. So VFs, will will
s/will//
> not see PASID capability on VFs as VF doesn't implement PASID extended
suggested rewording: VFs will not expose the PASID capability as they do
not implement the PASID extended structure in their config space?
> structure in its configuration space. For VF, it is a TODO in future.
> Related discussion can be found in below link:
>
> https://lkml.org/lkml/2020/4/7/693
>
> Cc: Kevin Tian <[email protected]>
> CC: Jacob Pan <[email protected]>
> Cc: Alex Williamson <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Lu Baolu <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> ---
> v1 -> v2:
> *) added in v2, but it was sent in a separate patchseries before
> ---
> drivers/vfio/pci/vfio_pci_config.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index d98843f..07ff2e6 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
> [PCI_EXT_CAP_ID_LTR] = PCI_EXT_CAP_LTR_SIZEOF,
> [PCI_EXT_CAP_ID_SECPCI] = 0, /* not yet */
> [PCI_EXT_CAP_ID_PMUX] = 0, /* not yet */
> - [PCI_EXT_CAP_ID_PASID] = 0, /* not yet */
> + [PCI_EXT_CAP_ID_PASID] = PCI_EXT_CAP_PASID_SIZEOF,
> };
>
> /*
>
Reviewed-by: Eric Auger <[email protected]>
Thanks
Eric
Hi Eric,
> From: Auger Eric <[email protected]>
> Sent: Monday, July 20, 2020 8:35 PM
>
> Yi,
>
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > This patch exposes PCIe PASID capability to guest for assigned devices.
> > Existing vfio_pci driver hides it from guest by setting the capability
> > length as 0 in pci_ext_cap_length[].
> >
> > And this patch only exposes PASID capability for devices which has PCIe
> > PASID extended struture in its configuration space. So VFs, will will
> s/will//
got it.
> > not see PASID capability on VFs as VF doesn't implement PASID extended
> suggested rewording: VFs will not expose the PASID capability as they do
> not implement the PASID extended structure in their config space?
make sense. will modify it.
> > structure in its configuration space. For VF, it is a TODO in future.
> > Related discussion can be found in below link:
> >
> > https://lkml.org/lkml/2020/4/7/693
>
> >
> > Cc: Kevin Tian <[email protected]>
> > CC: Jacob Pan <[email protected]>
> > Cc: Alex Williamson <[email protected]>
> > Cc: Eric Auger <[email protected]>
> > Cc: Jean-Philippe Brucker <[email protected]>
> > Cc: Joerg Roedel <[email protected]>
> > Cc: Lu Baolu <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > ---
> > v1 -> v2:
> > *) added in v2, but it was sent in a separate patchseries before
> > ---
> > drivers/vfio/pci/vfio_pci_config.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> > index d98843f..07ff2e6 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX +
> 1] = {
> > [PCI_EXT_CAP_ID_LTR] = PCI_EXT_CAP_LTR_SIZEOF,
> > [PCI_EXT_CAP_ID_SECPCI] = 0, /* not yet */
> > [PCI_EXT_CAP_ID_PMUX] = 0, /* not yet */
> > - [PCI_EXT_CAP_ID_PASID] = 0, /* not yet */
> > + [PCI_EXT_CAP_ID_PASID] = PCI_EXT_CAP_PASID_SIZEOF,
> > };
> >
> > /*
> >
> Reviewed-by: Eric Auger <[email protected]>
thanks.
Regards,
Yi Liu
> Thanks
>
> Eric