Hi Joerg and All,
(Rebased to 4.17-rc1. resend)
Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security. To enable SVA
in the guest, i.e. shared guest application address space and physical
device DMA address, IOMMU driver must provide some new functionalities.
This patchset is a follow-up on the discussions held at LPC 2017
VFIO/IOMMU/PCI track. Slides and notes can be found here:
https://linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/636
The complete guest SVA support also involves changes in QEMU and VFIO,
which has been posted earlier.
https://www.spinics.net/lists/kvm/msg148798.html
This is the IOMMU portion follow up of the more complete series of the
kernel changes to support vSVA. Please refer to the link below for more
details. https://www.spinics.net/lists/kvm/msg148819.html
Generic APIs are introduced in addition to Intel VT-d specific changes,
the goal is to have common interfaces across IOMMU and device types for
both VFIO and other in-kernel users.
At the top level, new IOMMU interfaces are introduced as follows:
- bind guest PASID table
- passdown invalidations of translation caches
- IOMMU device fault reporting including page request/response and
non-recoverable faults.
For IOMMU detected device fault reporting, struct device is extended to
provide callback and tracking at device level. The original proposal was
discussed here "Error handling for I/O memory management units"
(https://lwn.net/Articles/608914/). I have experimented two alternative
solutions:
1. use a shared group notifier, this does not scale well also causes unwanted
notification traffic when group sibling device is reported with faults.
2. place fault callback at device IOMMU arch data, e.g. device_domain_info
in Intel/FSL IOMMU driver. This will cause code duplication, since per
device fault reporting is generic.
The additional patches are Intel VT-d specific, which either implements or
replaces existing private interfaces with the generic ones.
This patchset is based on the work and ideas from many people, especially:
Ashok Raj <[email protected]>
Liu, Yi L <[email protected]>
Jean-Philippe Brucker <[email protected]>
Thanks,
Jacob
V4
- Futher integrate feedback for iommu_param and iommu_fault_param
from Jean and others.
- Handle fault reporting error and race conditions. Keep tracking per
device pending page requests such that page group response can be
sanitized.
- Added a timer to handle irresponsive guest who does not send page
response on time.
- Use a workqueue for VT-d non-recorverable IRQ fault handling.
- Added trace events for invalidation and fault reporting.
V3
- Consolidated fault reporting data format based on discussions on v2,
including input from ARM and AMD.
- Renamed invalidation APIs from svm to sva based on discussions on v2
- Use a parent pointer under struct device for all iommu per device data
- Simplified device fault callback, allow driver private data to be
registered. This might make it easy to replace domain fault handler.
V2
- Replaced hybrid interface data model (generic data + vendor specific
data) with all generic data. This will have the security benefit where
data passed from user space can be sanitized by all software layers if
needed.
- Addressed review comments from V1
- Use per device fault report data
- Support page request/response communications between host IOMMU and
guest or other in-kernel users.
- Added unrecoverable fault reporting to DMAR
- Use threaded IRQ function for DMAR fault interrupt and fault
reporting
Jacob Pan (21):
iommu: introduce bind_pasid_table API function
iommu/vt-d: move device_domain_info to header
iommu/vt-d: add a flag for pasid table bound status
iommu/vt-d: add bind_pasid_table function
iommu/vt-d: add definitions for PFSID
iommu/vt-d: fix dev iotlb pfsid use
iommu/vt-d: support flushing more translation cache types
iommu/vt-d: add svm/sva invalidate function
iommu: introduce device fault data
driver core: add per device iommu param
iommu: introduce device fault report API
iommu: introduce page response function
iommu: handle page response timeout
iommu/config: add build dependency for dmar
iommu/vt-d: report non-recoverable faults to device
iommu/intel-svm: report device page request
iommu/intel-svm: replace dev ops with fault report API
iommu/intel-svm: do not flush iotlb for viommu
iommu/vt-d: add intel iommu page response function
trace/iommu: add sva trace events
iommu: use sva invalidate and device fault trace event
Liu, Yi L (1):
iommu: introduce iommu invalidate API function
drivers/iommu/Kconfig | 1 +
drivers/iommu/dmar.c | 209 ++++++++++++++++++++++-
drivers/iommu/intel-iommu.c | 376 +++++++++++++++++++++++++++++++++++++++---
drivers/iommu/intel-svm.c | 84 ++++++++--
drivers/iommu/iommu.c | 284 ++++++++++++++++++++++++++++++-
include/linux/device.h | 3 +
include/linux/dma_remapping.h | 1 +
include/linux/dmar.h | 2 +-
include/linux/intel-iommu.h | 52 +++++-
include/linux/intel-svm.h | 20 +--
include/linux/iommu.h | 226 ++++++++++++++++++++++++-
include/trace/events/iommu.h | 112 +++++++++++++
include/uapi/linux/iommu.h | 111 +++++++++++++
13 files changed, 1409 insertions(+), 72 deletions(-)
create mode 100644 include/uapi/linux/iommu.h
--
2.7.4
Device faults detected by IOMMU can be reported outside IOMMU
subsystem for further processing. This patch intends to provide
a generic device fault data such that device drivers can be
communicated with IOMMU faults without model specific knowledge.
The proposed format is the result of discussion at:
https://lkml.org/lkml/2017/11/10/291
Part of the code is based on Jean-Philippe Brucker's patchset
(https://patchwork.kernel.org/patch/9989315/).
The assumption is that model specific IOMMU driver can filter and
handle most of the internal faults if the cause is within IOMMU driver
control. Therefore, the fault reasons can be reported are grouped
and generalized based common specifications such as PCI ATS.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
include/linux/iommu.h | 102 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 100 insertions(+), 2 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e963dbd..8968933 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -49,13 +49,17 @@ struct bus_type;
struct device;
struct iommu_domain;
struct notifier_block;
+struct iommu_fault_event;
/* iommu fault flags */
-#define IOMMU_FAULT_READ 0x0
-#define IOMMU_FAULT_WRITE 0x1
+#define IOMMU_FAULT_READ (1 << 0)
+#define IOMMU_FAULT_WRITE (1 << 1)
+#define IOMMU_FAULT_EXEC (1 << 2)
+#define IOMMU_FAULT_PRIV (1 << 3)
typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
struct device *, unsigned long, int, void *);
+typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
struct iommu_domain_geometry {
dma_addr_t aperture_start; /* First address that can be mapped */
@@ -264,6 +268,99 @@ struct iommu_device {
struct device *dev;
};
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+ IOMMU_FAULT_DMA_UNRECOV = 1, /* unrecoverable fault */
+ IOMMU_FAULT_PAGE_REQ, /* page request fault */
+};
+
+enum iommu_fault_reason {
+ IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+ /* IOMMU internal error, no specific reason to report out */
+ IOMMU_FAULT_REASON_INTERNAL,
+
+ /* Could not access the PASID table */
+ IOMMU_FAULT_REASON_PASID_FETCH,
+
+ /*
+ * PASID is out of range (e.g. exceeds the maximum PASID
+ * supported by the IOMMU) or disabled.
+ */
+ IOMMU_FAULT_REASON_PASID_INVALID,
+
+ /* Could not access the page directory (Invalid PASID entry) */
+ IOMMU_FAULT_REASON_PGD_FETCH,
+
+ /* Could not access the page table entry (Bad address) */
+ IOMMU_FAULT_REASON_PTE_FETCH,
+
+ /* Protection flag check failed */
+ IOMMU_FAULT_REASON_PERMISSION,
+};
+
+/**
+ * struct iommu_fault_event - Generic per device fault data
+ *
+ * - PCI and non-PCI devices
+ * - Recoverable faults (e.g. page request), information based on PCI ATS
+ * and PASID spec.
+ * - Un-recoverable faults of device interest
+ * - DMA remapping and IRQ remapping faults
+
+ * @type contains fault type.
+ * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
+ * faults are not reported
+ * @addr: tells the offending page address
+ * @pasid: contains process address space ID, used in shared virtual memory(SVM)
+ * @rid: requestor ID
+ * @page_req_group_id: page request group index
+ * @last_req: last request in a page request group
+ * @pasid_valid: indicates if the PRQ has a valid PASID
+ * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
+ * @device_private: if present, uniquely identify device-specific
+ * private data for an individual page request.
+ * @iommu_private: used by the IOMMU driver for storing fault-specific
+ * data. Users should not modify this field before
+ * sending the fault response.
+ */
+struct iommu_fault_event {
+ enum iommu_fault_type type;
+ enum iommu_fault_reason reason;
+ u64 addr;
+ u32 pasid;
+ u32 page_req_group_id;
+ u32 last_req : 1;
+ u32 pasid_valid : 1;
+ u32 prot;
+ u64 device_private;
+ u64 iommu_private;
+};
+
+/**
+ * struct iommu_fault_param - per-device IOMMU fault data
+ * @dev_fault_handler: Callback function to handle IOMMU faults at device level
+ * @data: handler private data
+ *
+ */
+struct iommu_fault_param {
+ iommu_dev_fault_handler_t handler;
+ void *data;
+};
+
+/**
+ * struct iommu_param - collection of per-device IOMMU data
+ *
+ * @fault_param: IOMMU detected device fault reporting data
+ *
+ * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
+ * struct iommu_group *iommu_group;
+ * struct iommu_fwspec *iommu_fwspec;
+ */
+struct iommu_param {
+ struct iommu_fault_param *fault_param;
+};
+
int iommu_device_register(struct iommu_device *iommu);
void iommu_device_unregister(struct iommu_device *iommu);
int iommu_device_sysfs_add(struct iommu_device *iommu,
@@ -437,6 +534,7 @@ struct iommu_ops {};
struct iommu_group {};
struct iommu_fwspec {};
struct iommu_device {};
+struct iommu_fault_param {};
static inline bool iommu_present(struct bus_type *bus)
{
--
2.7.4
When Shared Virtual Address (SVA) is enabled for a guest OS via
vIOMMU, we need to provide invalidation support at IOMMU API and driver
level. This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.
The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.
The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.
Signed-off-by: Liu, Yi L <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel-iommu.c | 170 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 170 insertions(+)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index cae4042..c765448 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4973,6 +4973,175 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
}
+/*
+ * 3D array for converting IOMMU generic type-granularity to VT-d granularity
+ * X indexed by enum iommu_inv_type
+ * Y indicates request without and with PASID
+ * Z indexed by enum iommu_inv_granularity
+ *
+ * For an example, if we want to find the VT-d granularity encoding for IOTLB
+ * type, DMA request with PASID, and page selective. The look up indices are:
+ * [1][1][8], where
+ * 1: IOMMU_INV_TYPE_TLB
+ * 1: with PASID
+ * 8: IOMMU_INV_GRANU_PAGE_PASID
+ *
+ * Granu_map array indicates validity of the table. 1: valid, 0: invalid
+ *
+ */
+const static int inv_type_granu_map[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
+ /* extended dev IOTLBs, for dev-IOTLB, only global is valid,
+ for dev-EXIOTLB, two valid granu */
+ {
+ {1},
+ {0, 0, 0, 0, 1, 1, 0, 0, 0}
+ },
+ /* IOTLB and EIOTLB */
+ {
+ {1, 1, 0, 1, 0, 0, 0, 0, 0},
+ {0, 0, 0, 0, 1, 0, 1, 1, 1}
+ },
+ /* PASID cache */
+ {
+ {0},
+ {0, 0, 0, 0, 1, 1, 0, 0, 0}
+ },
+ /* context cache */
+ {
+ {1, 1, 1}
+ }
+};
+
+const static u64 inv_type_granu_table[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
+ /* extended dev IOTLBs, only global is valid */
+ {
+ {QI_DEV_IOTLB_GRAN_ALL},
+ {0, 0, 0, 0, QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0, 0, 0}
+ },
+ /* IOTLB and EIOTLB */
+ {
+ {DMA_TLB_GLOBAL_FLUSH, DMA_TLB_DSI_FLUSH, 0, DMA_TLB_PSI_FLUSH},
+ {0, 0, 0, 0, QI_GRAN_ALL_ALL, 0, QI_GRAN_NONG_ALL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID}
+ },
+ /* PASID cache */
+ {
+ {0},
+ {0, 0, 0, 0, QI_PC_ALL_PASIDS, QI_PC_PASID_SEL}
+ },
+ /* context cache */
+ {
+ {DMA_CCMD_GLOBAL_INVL, DMA_CCMD_DOMAIN_INVL, DMA_CCMD_DEVICE_INVL}
+ }
+};
+
+static inline int to_vtd_granularity(int type, int granu, int with_pasid, u64 *vtd_granu)
+{
+ if (type >= IOMMU_INV_NR_TYPE || granu >= IOMMU_INV_NR_GRANU || with_pasid > 1)
+ return -EINVAL;
+
+ if (inv_type_granu_map[type][with_pasid][granu] == 0)
+ return -EINVAL;
+
+ *vtd_granu = inv_type_granu_table[type][with_pasid][granu];
+
+ return 0;
+}
+
+static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
+ struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+ struct intel_iommu *iommu;
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+ struct device_domain_info *info;
+ u16 did, sid;
+ u8 bus, devfn;
+ int ret = 0;
+ u64 granu;
+ unsigned long flags;
+
+ if (!inv_info || !dmar_domain)
+ return -EINVAL;
+
+ iommu = device_to_iommu(dev, &bus, &devfn);
+ if (!iommu)
+ return -ENODEV;
+
+ if (!dev || !dev_is_pci(dev))
+ return -ENODEV;
+
+ did = dmar_domain->iommu_did[iommu->seq_id];
+ sid = PCI_DEVID(bus, devfn);
+ ret = to_vtd_granularity(inv_info->hdr.type, inv_info->granularity,
+ !!(inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED), &granu);
+ if (ret) {
+ pr_err("Invalid range type %d, granu %d\n", inv_info->hdr.type,
+ inv_info->granularity);
+ return ret;
+ }
+
+ spin_lock(&iommu->lock);
+ spin_lock_irqsave(&device_domain_lock, flags);
+
+ switch (inv_info->hdr.type) {
+ case IOMMU_INV_TYPE_CONTEXT:
+ iommu->flush.flush_context(iommu, did, sid,
+ DMA_CCMD_MASK_NOBIT, granu);
+ break;
+ case IOMMU_INV_TYPE_TLB:
+ /* We need to deal with two scenarios:
+ * - IOTLB for request w/o PASID
+ * - extended IOTLB for request with PASID.
+ */
+ if (inv_info->size &&
+ (inv_info->addr & ((1 << (VTD_PAGE_SHIFT + inv_info->size)) - 1))) {
+ pr_err("Addr out of range, addr 0x%llx, size order %d\n",
+ inv_info->addr, inv_info->size);
+ ret = -ERANGE;
+ goto out_unlock;
+ }
+
+ if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
+ qi_flush_eiotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
+ inv_info->pasid,
+ inv_info->size, granu,
+ inv_info->flags & IOMMU_INVALIDATE_GLOBAL_PAGE);
+ else
+ qi_flush_iotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
+ inv_info->size, granu);
+ /**
+ * Always flush device IOTLB if ATS is enabled since guest
+ * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
+ * down.
+ */
+ info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+ if (info && info->ats_enabled) {
+ if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
+ qi_flush_dev_eiotlb(iommu, sid,
+ inv_info->pasid, info->ats_qdep,
+ inv_info->addr, inv_info->size,
+ granu);
+ else
+ qi_flush_dev_iotlb(iommu, sid, info->pfsid,
+ info->ats_qdep, inv_info->addr,
+ inv_info->size);
+ }
+ break;
+ case IOMMU_INV_TYPE_PASID:
+ qi_flush_pasid(iommu, did, granu, inv_info->pasid);
+
+ break;
+ default:
+ dev_err(dev, "Unknown IOMMU invalidation type %d\n",
+ inv_info->hdr.type);
+ ret = -EINVAL;
+ }
+out_unlock:
+ spin_unlock(&iommu->lock);
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+
+ return ret;
+}
+
static int intel_iommu_map(struct iommu_domain *domain,
unsigned long iova, phys_addr_t hpa,
size_t size, int iommu_prot)
@@ -5398,6 +5567,7 @@ const struct iommu_ops intel_iommu_ops = {
#ifdef CONFIG_INTEL_IOMMU_SVM
.bind_pasid_table = intel_iommu_bind_pasid_table,
.unbind_pasid_table = intel_iommu_unbind_pasid_table,
+ .sva_invalidate = intel_iommu_sva_invalidate,
#endif
.map = intel_iommu_map,
.unmap = intel_iommu_unmap,
--
2.7.4
PFSID should be used in the invalidation descriptor for flushing
device IOTLBs on SRIOV VFs.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/dmar.c | 6 +++---
drivers/iommu/intel-iommu.c | 16 +++++++++++++++-
include/linux/intel-iommu.h | 5 ++---
3 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index accf5838..38bb90f 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1339,8 +1339,8 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
qi_submit_sync(&desc, iommu);
}
-void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
- u64 addr, unsigned mask)
+void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+ u16 qdep, u64 addr, unsigned mask)
{
struct qi_desc desc;
@@ -1355,7 +1355,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
qdep = 0;
desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
- QI_DIOTLB_TYPE;
+ QI_DIOTLB_TYPE | QI_DEV_IOTLB_PFSID(pfsid);
qi_submit_sync(&desc, iommu);
}
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d8058be..cae4042 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1459,6 +1459,19 @@ static void iommu_enable_dev_iotlb(struct device_domain_info *info)
return;
pdev = to_pci_dev(info->dev);
+ /* For IOMMU that supports device IOTLB throttling (DIT), we assign
+ * PFSID to the invalidation desc of a VF such that IOMMU HW can gauge
+ * queue depth at PF level. If DIT is not set, PFSID will be treated as
+ * reserved, which should be set to 0.
+ */
+ if (!ecap_dit(info->iommu->ecap))
+ info->pfsid = 0;
+ else if (pdev && pdev->is_virtfn) {
+ if (ecap_dit(info->iommu->ecap))
+ dev_warn(&pdev->dev, "SRIOV VF device IOTLB enabled without flow control\n");
+ info->pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
+ } else
+ info->pfsid = PCI_DEVID(info->bus, info->devfn);
#ifdef CONFIG_INTEL_IOMMU_SVM
/* The PCIe spec, in its wisdom, declares that the behaviour of
@@ -1524,7 +1537,8 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
sid = info->bus << 8 | info->devfn;
qdep = info->ats_qdep;
- qi_flush_dev_iotlb(info->iommu, sid, qdep, addr, mask);
+ qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+ qdep, addr, mask);
}
spin_unlock_irqrestore(&device_domain_lock, flags);
}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index dfacd49..678a0f4 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -504,9 +504,8 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
u8 fm, u64 type);
extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
unsigned int size_order, u64 type);
-extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
- u64 addr, unsigned mask);
-
+extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+ u16 qdep, u64 addr, unsigned mask);
extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
extern int dmar_ir_support(void);
--
2.7.4
For performance and debugging purposes, these trace events help
analyzing device faults and passdown invalidations that interact
with IOMMU subsystem.
E.g.
IOMMU:0000:00:0a.0 type=2 reason=0 addr=0x00000000007ff000 pasid=1
group=1 last=0 prot=1
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/iommu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f6512692..e2090d2 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -979,6 +979,7 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
mutex_unlock(&fparam->lock);
}
ret = fparam->handler(evt, fparam->data);
+ trace_dev_fault(dev, evt);
done_unlock:
mutex_unlock(&dev->iommu_param->lock);
return ret;
@@ -1547,6 +1548,7 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
return -ENODEV;
ret = domain->ops->sva_invalidate(domain, dev, inv_info);
+ trace_sva_invalidate(dev, inv_info);
return ret;
}
@@ -1584,6 +1586,7 @@ int iommu_page_response(struct device *dev,
if (evt->pasid == msg->pasid &&
msg->page_req_group_id == evt->page_req_group_id) {
msg->private_data = evt->iommu_private;
+ trace_dev_page_response(dev, msg);
ret = domain->ops->page_response(dev, msg);
list_del(&evt->list);
kfree(evt);
--
2.7.4
With the introduction of generic IOMMU device fault reporting API, we
can replace the private fault callback functions with standard function
and event data.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel-svm.c | 7 +------
include/linux/intel-svm.h | 20 +++-----------------
2 files changed, 4 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index a8186f8..bdda1b6 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -299,7 +299,7 @@ static const struct mmu_notifier_ops intel_mmuops = {
static DEFINE_MUTEX(pasid_mutex);
-int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
+int intel_svm_bind_mm(struct device *dev, int *pasid, int flags)
{
struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
struct intel_svm_dev *sdev;
@@ -346,10 +346,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
list_for_each_entry(sdev, &svm->devs, list) {
if (dev == sdev->dev) {
- if (sdev->ops != ops) {
- ret = -EBUSY;
- goto out;
- }
sdev->users++;
goto success;
}
@@ -375,7 +371,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
}
/* Finish the setup now we know we're keeping it */
sdev->users = 1;
- sdev->ops = ops;
init_rcu_head(&sdev->rcu);
if (!svm) {
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index 99bc5b3..a39a502 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -18,18 +18,6 @@
struct device;
-struct svm_dev_ops {
- void (*fault_cb)(struct device *dev, int pasid, u64 address,
- u32 private, int rwxp, int response);
-};
-
-/* Values for rxwp in fault_cb callback */
-#define SVM_REQ_READ (1<<3)
-#define SVM_REQ_WRITE (1<<2)
-#define SVM_REQ_EXEC (1<<1)
-#define SVM_REQ_PRIV (1<<0)
-
-
/*
* The SVM_FLAG_PRIVATE_PASID flag requests a PASID which is *not* the "main"
* PASID for the current process. Even if a PASID already exists, a new one
@@ -60,7 +48,6 @@ struct svm_dev_ops {
* @dev: Device to be granted acccess
* @pasid: Address for allocated PASID
* @flags: Flags. Later for requesting supervisor mode, etc.
- * @ops: Callbacks to device driver
*
* This function attempts to enable PASID support for the given device.
* If the @pasid argument is non-%NULL, a PASID is allocated for access
@@ -82,8 +69,7 @@ struct svm_dev_ops {
* Multiple calls from the same process may result in the same PASID
* being re-used. A reference count is kept.
*/
-extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
- struct svm_dev_ops *ops);
+extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags);
/**
* intel_svm_unbind_mm() - Unbind a specified PASID
@@ -120,7 +106,7 @@ extern int intel_svm_is_pasid_valid(struct device *dev, int pasid);
#else /* CONFIG_INTEL_IOMMU_SVM */
static inline int intel_svm_bind_mm(struct device *dev, int *pasid,
- int flags, struct svm_dev_ops *ops)
+ int flags)
{
return -ENOSYS;
}
@@ -136,6 +122,6 @@ static int intel_svm_is_pasid_valid(struct device *dev, int pasid)
}
#endif /* CONFIG_INTEL_IOMMU_SVM */
-#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))
+#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0))
#endif /* __INTEL_SVM_H__ */
--
2.7.4
vIOMMU passdown invalidation will be inclusive, PASID cache invalidation
includes TLBs. See Intel VT-d Specification Ch 6.5.2.2 for details.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel-svm.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index bdda1b6..697d5c2 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -284,7 +284,9 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
rcu_read_lock();
list_for_each_entry_rcu(sdev, &svm->devs, list) {
intel_flush_pasid_dev(svm, sdev, svm->pasid);
- intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
+ /* for emulated iommu, PASID cache invalidation implies IOTLB/DTLB */
+ if (!cap_caching_mode(svm->iommu->cap))
+ intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
}
rcu_read_unlock();
--
2.7.4
Signed-off-by: Jacob Pan <[email protected]>
---
include/trace/events/iommu.h | 112 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 112 insertions(+)
diff --git a/include/trace/events/iommu.h b/include/trace/events/iommu.h
index 72b4582..e64eb29 100644
--- a/include/trace/events/iommu.h
+++ b/include/trace/events/iommu.h
@@ -12,6 +12,8 @@
#define _TRACE_IOMMU_H
#include <linux/tracepoint.h>
+#include <linux/iommu.h>
+#include <uapi/linux/iommu.h>
struct device;
@@ -161,6 +163,116 @@ DEFINE_EVENT(iommu_error, io_page_fault,
TP_ARGS(dev, iova, flags)
);
+
+TRACE_EVENT(dev_fault,
+
+ TP_PROTO(struct device *dev, struct iommu_fault_event *evt),
+
+ TP_ARGS(dev, evt),
+
+ TP_STRUCT__entry(
+ __string(device, dev_name(dev))
+ __field(int, type)
+ __field(int, reason)
+ __field(u64, addr)
+ __field(u32, pasid)
+ __field(u32, pgid)
+ __field(u32, last_req)
+ __field(u32, prot)
+ ),
+
+ TP_fast_assign(
+ __assign_str(device, dev_name(dev));
+ __entry->type = evt->type;
+ __entry->reason = evt->reason;
+ __entry->addr = evt->addr;
+ __entry->pasid = evt->pasid;
+ __entry->pgid = evt->page_req_group_id;
+ __entry->last_req = evt->last_req;
+ __entry->prot = evt->prot;
+ ),
+
+ TP_printk("IOMMU:%s type=%d reason=%d addr=0x%016llx pasid=%d group=%d last=%d prot=%d",
+ __get_str(device),
+ __entry->type,
+ __entry->reason,
+ __entry->addr,
+ __entry->pasid,
+ __entry->pgid,
+ __entry->last_req,
+ __entry->prot
+ )
+);
+
+TRACE_EVENT(dev_page_response,
+
+ TP_PROTO(struct device *dev, struct page_response_msg *msg),
+
+ TP_ARGS(dev, msg),
+
+ TP_STRUCT__entry(
+ __string(device, dev_name(dev))
+ __field(int, code)
+ __field(u64, addr)
+ __field(u32, pasid)
+ __field(u32, pgid)
+ ),
+
+ TP_fast_assign(
+ __assign_str(device, dev_name(dev));
+ __entry->code = msg->resp_code;
+ __entry->addr = msg->addr;
+ __entry->pasid = msg->pasid;
+ __entry->pgid = msg->page_req_group_id;
+ ),
+
+ TP_printk("IOMMU:%s code=%d addr=0x%016llx pasid=%d group=%d",
+ __get_str(device),
+ __entry->code,
+ __entry->addr,
+ __entry->pasid,
+ __entry->pgid
+ )
+);
+
+TRACE_EVENT(sva_invalidate,
+
+ TP_PROTO(struct device *dev, struct tlb_invalidate_info *ti),
+
+ TP_ARGS(dev, ti),
+
+ TP_STRUCT__entry(
+ __string(device, dev_name(dev))
+ __field(int, type)
+ __field(u32, granu)
+ __field(u32, flags)
+ __field(u8, size)
+ __field(u32, pasid)
+ __field(u64, addr)
+ ),
+
+ TP_fast_assign(
+ __assign_str(device, dev_name(dev));
+ __entry->type = ti->hdr.type;
+ __entry->flags = ti->flags;
+ __entry->granu = ti->granularity;
+ __entry->size = ti->size;
+ __entry->pasid = ti->pasid;
+ __entry->addr = ti->addr;
+ ),
+
+ TP_printk("IOMMU:%s type=%d flags=0x%08x granu=%d size=%d pasid=%d addr=0x%016llx",
+ __get_str(device),
+ __entry->type,
+ __entry->flags,
+ __entry->granu,
+ __entry->size,
+ __entry->pasid,
+ __entry->addr
+ )
+);
+
+
#endif /* _TRACE_IOMMU_H */
/* This part must be outside protection */
--
2.7.4
When Shared Virtual Memory is exposed to a guest via vIOMMU, extended
IOTLB invalidation may be passed down from outside IOMMU subsystems.
This patch adds invalidation functions that can be used for additional
translation cache types.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/dmar.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 21 +++++++++++++++++++--
2 files changed, 63 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 38bb90f..71bfc73 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1339,6 +1339,18 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
qi_submit_sync(&desc, iommu);
}
+void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr, u32 pasid,
+ unsigned int size_order, u64 granu, bool global)
+{
+ struct qi_desc desc;
+
+ desc.low = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
+ QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
+ desc.high = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_GL(global) |
+ QI_EIOTLB_IH(0) | QI_EIOTLB_AM(size_order);
+ qi_submit_sync(&desc, iommu);
+}
+
void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
u16 qdep, u64 addr, unsigned mask)
{
@@ -1360,6 +1372,38 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
qi_submit_sync(&desc, iommu);
}
+void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
+ u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu)
+{
+ struct qi_desc desc;
+
+ desc.low = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+ QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE;
+ desc.high |= QI_DEV_EIOTLB_GLOB(granu);
+
+ /* If S bit is 0, we only flush a single page. If S bit is set,
+ * The least significant zero bit indicates the size. VT-d spec
+ * 6.5.2.6
+ */
+ if (!size)
+ desc.high = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
+ else {
+ unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
+
+ desc.high = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
+ }
+ qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
+{
+ struct qi_desc desc;
+
+ desc.high = 0;
+ desc.low = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
+
+ qi_submit_sync(&desc, iommu);
+}
/*
* Disable Queued Invalidation interface.
*/
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 678a0f4..c54bce1 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -262,6 +262,10 @@ enum {
#define QI_PGRP_RESP_TYPE 0x9
#define QI_PSTRM_RESP_TYPE 0xa
+#define QI_DID(did) (((u64)did & 0xffff) << 16)
+#define QI_DID_MASK GENMASK(31, 16)
+#define QI_TYPE_MASK GENMASK(3, 0)
+
#define QI_IEC_SELECTIVE (((u64)1) << 4)
#define QI_IEC_IIDEX(idx) (((u64)(idx & 0xffff) << 32))
#define QI_IEC_IM(m) (((u64)(m & 0x1f) << 27))
@@ -293,8 +297,9 @@ enum {
#define QI_PC_DID(did) (((u64)did) << 16)
#define QI_PC_GRAN(gran) (((u64)gran) << 4)
-#define QI_PC_ALL_PASIDS (QI_PC_TYPE | QI_PC_GRAN(0))
-#define QI_PC_PASID_SEL (QI_PC_TYPE | QI_PC_GRAN(1))
+/* PASID cache invalidation granu */
+#define QI_PC_ALL_PASIDS 0
+#define QI_PC_PASID_SEL 1
#define QI_EIOTLB_ADDR(addr) ((u64)(addr) & VTD_PAGE_MASK)
#define QI_EIOTLB_GL(gl) (((u64)gl) << 7)
@@ -304,6 +309,10 @@ enum {
#define QI_EIOTLB_DID(did) (((u64)did) << 16)
#define QI_EIOTLB_GRAN(gran) (((u64)gran) << 4)
+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL 0
+#define QI_DEV_IOTLB_GRAN_PASID_SEL 1
+
#define QI_DEV_EIOTLB_ADDR(a) ((u64)(a) & VTD_PAGE_MASK)
#define QI_DEV_EIOTLB_SIZE (((u64)1) << 11)
#define QI_DEV_EIOTLB_GLOB(g) ((u64)g)
@@ -332,6 +341,7 @@ enum {
#define QI_RESP_INVALID 0x1
#define QI_RESP_FAILURE 0xf
+/* QI EIOTLB inv granu */
#define QI_GRAN_ALL_ALL 0
#define QI_GRAN_NONG_ALL 1
#define QI_GRAN_NONG_PASID 2
@@ -504,8 +514,15 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
u8 fm, u64 type);
extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
unsigned int size_order, u64 type);
+extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
+ u32 pasid, unsigned int size_order, u64 type, bool global);
extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
u16 qdep, u64 addr, unsigned mask);
+
+extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
+ u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
+extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
+
extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
extern int dmar_ir_support(void);
--
2.7.4
When SRIOV VF device IOTLB is invalidated, we need to provide
the PF source ID such that IOMMU hardware can gauge the depth
of invalidation queue which is shared among VFs. This is needed
when device invalidation throttle (DIT) capability is supported.
This patch adds bit definitions for checking and tracking PFSID.
Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/intel-iommu.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ddc7d79..dfacd49 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -114,6 +114,7 @@
* Extended Capability Register
*/
+#define ecap_dit(e) ((e >> 41) & 0x1)
#define ecap_pasid(e) ((e >> 40) & 0x1)
#define ecap_pss(e) ((e >> 35) & 0x1f)
#define ecap_eafs(e) ((e >> 34) & 0x1)
@@ -284,6 +285,7 @@ enum {
#define QI_DEV_IOTLB_SID(sid) ((u64)((sid) & 0xffff) << 32)
#define QI_DEV_IOTLB_QDEP(qdep) (((qdep) & 0x1f) << 16)
#define QI_DEV_IOTLB_ADDR(addr) ((u64)(addr) & VTD_PAGE_MASK)
+#define QI_DEV_IOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
#define QI_DEV_IOTLB_SIZE 1
#define QI_DEV_IOTLB_MAX_INVS 32
@@ -308,6 +310,7 @@ enum {
#define QI_DEV_EIOTLB_PASID(p) (((u64)p) << 32)
#define QI_DEV_EIOTLB_SID(sid) ((u64)((sid) & 0xffff) << 16)
#define QI_DEV_EIOTLB_QDEP(qd) ((u64)((qd) & 0x1f) << 4)
+#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
#define QI_DEV_EIOTLB_MAX_INVS 32
#define QI_PGRP_IDX(idx) (((u64)(idx)) << 55)
@@ -467,6 +470,7 @@ struct device_domain_info {
struct list_head global; /* link to global list */
u8 bus; /* PCI bus number */
u8 devfn; /* PCI devfn number */
+ u16 pfsid; /* SRIOV physical function source ID */
u8 pasid_supported:3;
u8 pasid_enabled:1;
u8 pri_supported:1;
--
2.7.4
This patch adds page response support for Intel VT-d.
Generic response data is taken from the IOMMU API
then parsed into VT-d specific response descriptor format.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel-iommu.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 3 +++
2 files changed, 50 insertions(+)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a6ea67d..38f76d4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5142,6 +5142,52 @@ static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
return ret;
}
+static int intel_iommu_page_response(struct device *dev, struct page_response_msg *msg)
+{
+ struct qi_desc resp;
+ struct intel_iommu *iommu;
+ struct pci_dev *pdev;
+ u8 bus, devfn;
+ u16 rid;
+ u64 desc;
+
+ pdev = to_pci_dev(dev);
+ iommu = device_to_iommu(dev, &bus, &devfn);
+ if (!iommu) {
+ dev_err(dev, "No IOMMU for device to unbind PASID table\n");
+ return -ENODEV;
+ }
+
+ pci_dev_get(pdev);
+ rid = ((u16)bus << 8) | devfn;
+ /* Iommu private data contains preserved page request descriptor, so we
+ * inspect the SRR bit for response type then queue response with only
+ * the private data [54:32].
+ */
+ desc = msg->private_data;
+ if (desc & QI_PRQ_SRR) {
+ /* Page Stream Response */
+ resp.low = QI_PSTRM_IDX(msg->page_req_group_id) |
+ (desc & QI_PRQ_PRIV) | QI_PSTRM_BUS(PCI_BUS_NUM(pdev->bus->number)) |
+ QI_PSTRM_PASID(msg->pasid) | QI_PSTRM_RESP_TYPE;
+ resp.high = QI_PSTRM_ADDR(msg->addr) | QI_PSTRM_DEVFN(pdev->devfn & 0xff) |
+ QI_PSTRM_RESP_CODE(msg->resp_code);
+ } else {
+ /* Page Group Response */
+ resp.low = QI_PGRP_PASID(msg->pasid) |
+ QI_PGRP_DID(rid) |
+ QI_PGRP_PASID_P(msg->pasid_present) |
+ QI_PGRP_RESP_TYPE;
+ resp.high = QI_PGRP_IDX(msg->page_req_group_id) |
+ (desc & QI_PRQ_PRIV) | QI_PGRP_RESP_CODE(msg->resp_code);
+
+ }
+ qi_submit_sync(&resp, iommu);
+ pci_dev_put(pdev);
+
+ return 0;
+}
+
static int intel_iommu_map(struct iommu_domain *domain,
unsigned long iova, phys_addr_t hpa,
size_t size, int iommu_prot)
@@ -5568,6 +5614,7 @@ const struct iommu_ops intel_iommu_ops = {
.bind_pasid_table = intel_iommu_bind_pasid_table,
.unbind_pasid_table = intel_iommu_unbind_pasid_table,
.sva_invalidate = intel_iommu_sva_invalidate,
+ .page_response = intel_iommu_page_response,
#endif
.map = intel_iommu_map,
.unmap = intel_iommu_unmap,
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index dbe8c93..ed2883a 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -337,6 +337,9 @@ enum {
#define QI_PSTRM_BUS(bus) (((u64)(bus)) << 24)
#define QI_PSTRM_PASID(pasid) (((u64)(pasid)) << 4)
+#define QI_PRQ_SRR BIT_ULL(0)
+#define QI_PRQ_PRIV GENMASK_ULL(54, 32)
+
#define QI_RESP_SUCCESS 0x0
#define QI_RESP_INVALID 0x1
#define QI_RESP_FAILURE 0xf
--
2.7.4
Currently, dmar fault IRQ handler does nothing more than rate
limited printk, no critical hardware handling need to be done
in IRQ context.
For some use case such as vIOMMU, it might be useful to report
non-recoverable faults outside host IOMMU subsystem. DMAR fault
can come from both DMA and interrupt remapping which has to be
set up early before threaded IRQ is available.
This patch adds an option and a workqueue such that when faults
are requested, DMAR fault IRQ handler can use the IOMMU fault
reporting API to report.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
drivers/iommu/dmar.c | 159 ++++++++++++++++++++++++++++++++++++++++++--
drivers/iommu/intel-iommu.c | 6 +-
include/linux/dmar.h | 2 +-
include/linux/intel-iommu.h | 1 +
4 files changed, 159 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 71bfc73..9dcc95a 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1110,6 +1110,12 @@ static int alloc_iommu(struct dmar_drhd_unit *drhd)
return err;
}
+static inline void dmar_free_fault_wq(struct intel_iommu *iommu)
+{
+ if (iommu->fault_wq)
+ destroy_workqueue(iommu->fault_wq);
+}
+
static void free_iommu(struct intel_iommu *iommu)
{
if (intel_iommu_enabled) {
@@ -1126,6 +1132,7 @@ static void free_iommu(struct intel_iommu *iommu)
free_irq(iommu->irq, iommu);
dmar_free_hwirq(iommu->irq);
iommu->irq = 0;
+ dmar_free_fault_wq(iommu);
}
if (iommu->qi) {
@@ -1554,6 +1561,31 @@ static const char *irq_remap_fault_reasons[] =
"Blocked an interrupt request due to source-id verification failure",
};
+/* fault data and status */
+enum intel_iommu_fault_reason {
+ INTEL_IOMMU_FAULT_REASON_SW,
+ INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
+ INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
+ INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
+ INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
+ INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
+ INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
+ INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
+ INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
+ INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
+ INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
+ INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
+ INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
+ NR_INTEL_IOMMU_FAULT_REASON,
+};
+
+/* fault reasons that are allowed to be reported outside IOMMU subsystem */
+#define INTEL_IOMMU_FAULT_REASON_ALLOWED \
+ ((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH) | \
+ (1ULL << INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) | \
+ (1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
+
+
static const char *dmar_get_fault_reason(u8 fault_reason, int *fault_type)
{
if (fault_reason >= 0x20 && (fault_reason - 0x20 <
@@ -1634,11 +1666,91 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
}
+static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
+{
+ if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
+ pr_warn("unknown DMAR fault reason %d\n", reason);
+ return IOMMU_FAULT_REASON_UNKNOWN;
+ }
+ switch (reason) {
+ case INTEL_IOMMU_FAULT_REASON_SW:
+ case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
+ case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
+ case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
+ case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
+ case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
+ case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
+ return IOMMU_FAULT_REASON_INTERNAL;
+ case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
+ case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
+ case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
+ return IOMMU_FAULT_REASON_PERMISSION;
+ default:
+ return IOMMU_FAULT_REASON_UNKNOWN;
+ }
+}
+
+struct dmar_fault_work {
+ struct work_struct fault_work;
+ struct intel_iommu *iommu;
+ u64 addr;
+ int type;
+ int fault_type;
+ enum intel_iommu_fault_reason reason;
+ u16 sid;
+};
+
+static void report_fault_to_device(struct work_struct *work)
+{
+ struct dmar_fault_work *dfw = container_of(work, struct dmar_fault_work,
+ fault_work);
+ struct iommu_fault_event event;
+ struct pci_dev *pdev;
+ u8 bus, devfn;
+
+ memset(&event, 0, sizeof(struct iommu_fault_event));
+
+ /* check if fault reason is permitted to report outside IOMMU */
+ if (!((1 << dfw->reason) & INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
+ pr_debug("Fault reason %d not allowed to report to device\n",
+ dfw->reason);
+ goto free_work;
+ }
+
+ bus = PCI_BUS_NUM(dfw->sid);
+ devfn = PCI_DEVFN(PCI_SLOT(dfw->sid), PCI_FUNC(dfw->sid));
+ /*
+ * we need to check if the fault reporting is requested for the
+ * offending device.
+ */
+ pdev = pci_get_domain_bus_and_slot(dfw->iommu->segment, bus, devfn);
+ if (!pdev) {
+ pr_warn("No PCI device found for source ID %x\n", dfw->sid);
+ goto free_work;
+ }
+ /*
+ * unrecoverable fault is reported per IOMMU, notifier handler can
+ * resolve PCI device based on source ID.
+ */
+ event.reason = to_iommu_fault_reason(dfw->reason);
+ event.addr = dfw->addr;
+ event.type = IOMMU_FAULT_DMA_UNRECOV;
+ event.prot = dfw->type ? IOMMU_READ : IOMMU_WRITE;
+ dev_warn(&pdev->dev, "report device unrecoverable fault: %d, %x, %d\n",
+ event.reason, dfw->sid, event.type);
+ iommu_report_device_fault(&pdev->dev, &event);
+ pci_dev_put(pdev);
+
+free_work:
+ kfree(dfw);
+}
+
static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
u8 fault_reason, u16 source_id, unsigned long long addr)
{
const char *reason;
int fault_type;
+ struct dmar_fault_work *dfw;
reason = dmar_get_fault_reason(fault_reason, &fault_type);
@@ -1647,11 +1759,29 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
source_id >> 8, PCI_SLOT(source_id & 0xFF),
PCI_FUNC(source_id & 0xFF), addr >> 48,
fault_reason, reason);
- else
+ else {
pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
type ? "DMA Read" : "DMA Write",
source_id >> 8, PCI_SLOT(source_id & 0xFF),
PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+ }
+
+ dfw = kmalloc(sizeof(*dfw), GFP_ATOMIC);
+ if (!dfw)
+ return -ENOMEM;
+
+ INIT_WORK(&dfw->fault_work, report_fault_to_device);
+ dfw->addr = addr;
+ dfw->type = type;
+ dfw->fault_type = fault_type;
+ dfw->reason = fault_reason;
+ dfw->sid = source_id;
+ dfw->iommu = iommu;
+ if (!queue_work(iommu->fault_wq, &dfw->fault_work)) {
+ kfree(dfw);
+ return -EBUSY;
+ }
+
return 0;
}
@@ -1731,10 +1861,28 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
return IRQ_HANDLED;
}
-int dmar_set_interrupt(struct intel_iommu *iommu)
+static int dmar_set_fault_wq(struct intel_iommu *iommu)
+{
+ if (iommu->fault_wq)
+ return 0;
+
+ iommu->fault_wq = alloc_ordered_workqueue(iommu->name, 0);
+ if (!iommu->fault_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+int dmar_set_interrupt(struct intel_iommu *iommu, bool queue_fault)
{
int irq, ret;
+ /* fault can be reported back to device drivers via a wq */
+ if (queue_fault) {
+ ret = dmar_set_fault_wq(iommu);
+ if (ret)
+ pr_err("Failed to create fault handling workqueue\n");
+ }
/*
* Check if the fault interrupt is already initialized.
*/
@@ -1748,10 +1896,11 @@ int dmar_set_interrupt(struct intel_iommu *iommu)
pr_err("No free IRQ vectors\n");
return -EINVAL;
}
-
ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu);
- if (ret)
+ if (ret) {
pr_err("Can't request irq\n");
+ dmar_free_fault_wq(iommu);
+ }
return ret;
}
@@ -1765,7 +1914,7 @@ int __init enable_drhd_fault_handling(void)
*/
for_each_iommu(iommu, drhd) {
u32 fault_status;
- int ret = dmar_set_interrupt(iommu);
+ int ret = dmar_set_interrupt(iommu, false);
if (ret) {
pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n",
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index c765448..a6ea67d 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3401,10 +3401,10 @@ static int __init init_dmars(void)
goto free_iommu;
}
#endif
- ret = dmar_set_interrupt(iommu);
+ ret = dmar_set_interrupt(iommu, true);
+
if (ret)
goto free_iommu;
-
if (!translation_pre_enabled(iommu))
iommu_enable_translation(iommu);
@@ -4291,7 +4291,7 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
goto disable_iommu;
}
#endif
- ret = dmar_set_interrupt(iommu);
+ ret = dmar_set_interrupt(iommu, true);
if (ret)
goto disable_iommu;
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index e2433bc..21f2162 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -278,7 +278,7 @@ extern void dmar_msi_unmask(struct irq_data *data);
extern void dmar_msi_mask(struct irq_data *data);
extern void dmar_msi_read(int irq, struct msi_msg *msg);
extern void dmar_msi_write(int irq, struct msi_msg *msg);
-extern int dmar_set_interrupt(struct intel_iommu *iommu);
+extern int dmar_set_interrupt(struct intel_iommu *iommu, bool queue_fault);
extern irqreturn_t dmar_fault(int irq, void *dev_id);
extern int dmar_alloc_hwirq(int id, int node, void *arg);
extern void dmar_free_hwirq(int irq);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index c54bce1..dbe8c93 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -472,6 +472,7 @@ struct intel_iommu {
struct iommu_device iommu; /* IOMMU core code handle */
int node;
u32 flags; /* Software defined flags */
+ struct workqueue_struct *fault_wq; /* Reporting IOMMU fault to device */
};
/* PCI domain-device relationship */
--
2.7.4
When IO page faults are reported outside IOMMU subsystem, the page
request handler may fail for various reasons. E.g. a guest received
page requests but did not have a chance to run for a long time. The
irresponsive behavior could hold off limited resources on the pending
device.
There can be hardware or credit based software solutions as suggested
in the PCI ATS Ch-4. To provide a basic safty net this patch
introduces a per device deferrable timer which monitors the longest
pending page fault that requires a response. Proper action such as
sending failure response code could be taken when timer expires but not
included in this patch. We need to consider the life cycle of page
groupd ID to prevent confusion with reused group ID by a device.
For now, a warning message provides clue of such failure.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
drivers/iommu/iommu.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/iommu.h | 4 ++++
2 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 628346c..f6512692 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -799,6 +799,39 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
}
EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
+/* Max time to wait for a pending page request */
+#define IOMMU_PAGE_RESPONSE_MAXTIME (HZ * 10)
+static void iommu_dev_fault_timer_fn(struct timer_list *t)
+{
+ struct iommu_fault_param *fparam = from_timer(fparam, t, timer);
+ struct iommu_fault_event *evt, *iter;
+
+ u64 now;
+
+ now = get_jiffies_64();
+
+ /* The goal is to ensure driver or guest page fault handler(via vfio)
+ * send page response on time. Otherwise, limited queue resources
+ * may be occupied by some irresponsive guests or drivers.
+ * When per device pending fault list is not empty, we periodically checks
+ * if any anticipated page response time has expired.
+ *
+ * TODO:
+ * We could do the following if response time expires:
+ * 1. send page response code FAILURE to all pending PRQ
+ * 2. inform device driver or vfio
+ * 3. drain in-flight page requests and responses for this device
+ * 4. clear pending fault list such that driver can unregister fault
+ * handler(otherwise blocked when pending faults are present).
+ */
+ list_for_each_entry_safe(evt, iter, &fparam->faults, list) {
+ if (time_after64(evt->expire, now))
+ pr_err("Page response time expired!, pasid %d gid %d exp %llu now %llu\n",
+ evt->pasid, evt->page_req_group_id, evt->expire, now);
+ }
+ mod_timer(t, now + IOMMU_PAGE_RESPONSE_MAXTIME);
+}
+
/**
* iommu_register_device_fault_handler() - Register a device fault handler
* @dev: the device
@@ -806,8 +839,8 @@ EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
* @data: private data passed as argument to the handler
*
* When an IOMMU fault event is received, call this handler with the fault event
- * and data as argument. The handler should return 0. If the fault is
- * recoverable (IOMMU_FAULT_PAGE_REQ), the handler must also complete
+ * and data as argument. The handler should return 0 on success. If the fault is
+ * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
* the fault by calling iommu_page_response() with one of the following
* response code:
* - IOMMU_PAGE_RESP_SUCCESS: retry the translation
@@ -848,6 +881,9 @@ int iommu_register_device_fault_handler(struct device *dev,
param->fault_param->data = data;
INIT_LIST_HEAD(¶m->fault_param->faults);
+ timer_setup(¶m->fault_param->timer, iommu_dev_fault_timer_fn,
+ TIMER_DEFERRABLE);
+
mutex_unlock(¶m->lock);
return 0;
@@ -905,6 +941,8 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
{
int ret = 0;
struct iommu_fault_event *evt_pending;
+ struct timer_list *tmr;
+ u64 exp;
struct iommu_fault_param *fparam;
/* iommu_param is allocated when device is added to group */
@@ -925,6 +963,17 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
goto done_unlock;
}
memcpy(evt_pending, evt, sizeof(struct iommu_fault_event));
+ /* Keep track of response expiration time */
+ exp = get_jiffies_64() + IOMMU_PAGE_RESPONSE_MAXTIME;
+ evt_pending->expire = exp;
+
+ if (list_empty(&fparam->faults)) {
+ /* First pending event, start timer */
+ tmr = &dev->iommu_param->fault_param->timer;
+ WARN_ON(timer_pending(tmr));
+ mod_timer(tmr, exp);
+ }
+
mutex_lock(&fparam->lock);
list_add_tail(&evt_pending->list, &fparam->faults);
mutex_unlock(&fparam->lock);
@@ -1542,6 +1591,13 @@ int iommu_page_response(struct device *dev,
}
}
+ /* stop response timer if no more pending request */
+ if (list_empty(¶m->fault_param->faults) &&
+ timer_pending(¶m->fault_param->timer)) {
+ pr_debug("no pending PRQ, stop timer\n");
+ del_timer(¶m->fault_param->timer);
+ }
+
done_unlock:
mutex_unlock(¶m->fault_param->lock);
return ret;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 058b552..40088d6 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -375,6 +375,7 @@ enum iommu_fault_reason {
* @iommu_private: used by the IOMMU driver for storing fault-specific
* data. Users should not modify this field before
* sending the fault response.
+ * @expire: time limit in jiffies will wait for page response
*/
struct iommu_fault_event {
struct list_head list;
@@ -388,6 +389,7 @@ struct iommu_fault_event {
u32 prot;
u64 device_private;
u64 iommu_private;
+ u64 expire;
};
/**
@@ -395,11 +397,13 @@ struct iommu_fault_event {
* @dev_fault_handler: Callback function to handle IOMMU faults at device level
* @data: handler private data
* @faults: holds the pending faults which needs response, e.g. page response.
+ * @timer: track page request pending time limit
* @lock: protect pending PRQ event list
*/
struct iommu_fault_param {
iommu_dev_fault_handler_t handler;
struct list_head faults;
+ struct timer_list timer;
struct mutex lock;
void *data;
};
--
2.7.4
Intel VT-d interrupts come from both IRQ remapping and DMA remapping.
In order to report non-recoverable faults back to device driver, we
need to have access to IOMMU fault reporting APIs. This patch adds
build depenency to DMAR code where fault IRQ handlers can selectively
report faults.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 73590ba..8d8b63f 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -139,6 +139,7 @@ config AMD_IOMMU_V2
# Intel IOMMU support
config DMAR_TABLE
bool
+ select IOMMU_API
config INTEL_IOMMU
bool "Support for Intel IOMMU using DMA Remapping Devices"
--
2.7.4
If the source device of a page request has its PASID table pointer
bound to a guest, the first level page tables are owned by the guest.
In this case, we shall let guest OS to manage page fault.
This patch uses the IOMMU fault reporting API to send fault events,
possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
will issue page response which will be passed down via the invalidation
passdown APIs.
Recoverable faults, such as page request reporting is not limitted to
guest use. In kernel driver can also request a chance to receive fault
notifications.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
drivers/iommu/intel-svm.c | 73 ++++++++++++++++++++++++++++++++++++++++-------
include/linux/iommu.h | 1 +
2 files changed, 64 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index e8cd984..a8186f8 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -577,6 +577,58 @@ static bool is_canonical_address(u64 addr)
return (((saddr << shift) >> shift) == saddr);
}
+static int prq_to_iommu_prot(struct page_req_dsc *req)
+{
+ int prot = 0;
+
+ if (req->rd_req)
+ prot |= IOMMU_FAULT_READ;
+ if (req->wr_req)
+ prot |= IOMMU_FAULT_WRITE;
+ if (req->exe_req)
+ prot |= IOMMU_FAULT_EXEC;
+ if (req->priv_req)
+ prot |= IOMMU_FAULT_PRIV;
+
+ return prot;
+}
+
+static int intel_svm_prq_report(struct intel_iommu *iommu,
+ struct page_req_dsc *desc)
+{
+ int ret = 0;
+ struct iommu_fault_event event;
+ struct pci_dev *pdev;
+
+ memset(&event, 0, sizeof(struct iommu_fault_event));
+ pdev = pci_get_domain_bus_and_slot(iommu->segment,
+ desc->bus, desc->devfn);
+ if (!pdev) {
+ pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
+ desc->bus, PCI_SLOT(desc->devfn),
+ PCI_FUNC(desc->devfn));
+ return -ENODEV;
+ }
+
+ /* Fill in event data for device specific processing */
+ event.type = IOMMU_FAULT_PAGE_REQ;
+ event.addr = (u64)desc->addr << VTD_PAGE_SHIFT;
+ event.pasid = desc->pasid;
+ event.page_req_group_id = desc->prg_index;
+ event.prot = prq_to_iommu_prot(desc);
+ event.last_req = desc->lpig;
+ event.pasid_valid = 1;
+ /* keep track of PRQ so that when the response comes back, we know
+ * whether we do group response or stream response. SRR[0] and
+ * private[54:32] bits in the descriptor are stored.
+ */
+ event.iommu_private = *(u64 *)desc;
+ ret = iommu_report_device_fault(&pdev->dev, &event);
+ pci_dev_put(pdev);
+
+ return ret;
+}
+
static irqreturn_t prq_event_thread(int irq, void *d)
{
struct intel_iommu *iommu = d;
@@ -625,6 +677,16 @@ static irqreturn_t prq_event_thread(int irq, void *d)
goto no_pasid;
}
}
+ /* If address is not canonical, return invalid response */
+ if (!is_canonical_address(address))
+ goto bad_req;
+
+ /*
+ * If prq is to be handled outside iommu driver via receiver of
+ * the fault notifiers, we skip the page response here.
+ */
+ if (!intel_svm_prq_report(iommu, req))
+ goto prq_advance;
result = QI_RESP_INVALID;
/* Since we're using init_mm.pgd directly, we should never take
@@ -635,9 +697,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (!mmget_not_zero(svm->mm))
goto bad_req;
- /* If address is not canonical, return invalid response */
- if (!is_canonical_address(address))
- goto bad_req;
down_read(&svm->mm->mmap_sem);
vma = find_extend_vma(svm->mm, address);
@@ -670,12 +729,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (WARN_ON(&sdev->list == &svm->devs))
sdev = NULL;
-
- if (sdev && sdev->ops && sdev->ops->fault_cb) {
- int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
- (req->exe_req << 1) | (req->priv_req);
- sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, req->private, rwxp, result);
- }
/* We get here in the error case where the PASID lookup failed,
and these can be NULL. Do not use them below this point! */
sdev = NULL;
@@ -701,7 +754,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
qi_submit_sync(&resp, iommu);
}
-
+ prq_advance:
head = (head + sizeof(*req)) & PRQ_RING_MASK;
}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 40088d6..0933f72 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -42,6 +42,7 @@
* if the IOMMU page table format is equivalent.
*/
#define IOMMU_PRIV (1 << 5)
+#define IOMMU_EXEC (1 << 6)
struct iommu_ops;
struct iommu_group;
--
2.7.4
IO page faults can be handled outside IOMMU subsystem. For an example,
when nested translation is turned on and guest owns the
first level page tables, device page request can be forwared
to the guest for handling faults. As the page response returns
by the guest, IOMMU driver on the host need to process the
response which informs the device and completes the page request
transaction.
This patch introduces generic API function for page response
passing from the guest or other in-kernel users. The definitions of
the generic data is based on PCI ATS specification not limited to
any vendor.
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Link: https://lkml.org/lkml/2017/12/7/1725
---
drivers/iommu/iommu.c | 45 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/iommu.h | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index de19c33..628346c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1503,6 +1503,51 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
}
EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
+int iommu_page_response(struct device *dev,
+ struct page_response_msg *msg)
+{
+ struct iommu_param *param = dev->iommu_param;
+ int ret = -EINVAL;
+ struct iommu_fault_event *evt, *iter;
+ struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+ if (!domain || !domain->ops->page_response)
+ return -ENODEV;
+
+ /*
+ * Device iommu_param should have been allocated when device is
+ * added to its iommu_group.
+ */
+ if (!param || !param->fault_param)
+ return -EINVAL;
+
+ /* Only send response if there is a fault report pending */
+ mutex_lock(¶m->fault_param->lock);
+ if (list_empty(¶m->fault_param->faults)) {
+ pr_warn("no pending PRQ, drop response\n");
+ goto done_unlock;
+ }
+ /*
+ * Check if we have a matching page request pending to respond,
+ * otherwise return -EINVAL
+ */
+ list_for_each_entry_safe(evt, iter, ¶m->fault_param->faults, list) {
+ if (evt->pasid == msg->pasid &&
+ msg->page_req_group_id == evt->page_req_group_id) {
+ msg->private_data = evt->iommu_private;
+ ret = domain->ops->page_response(dev, msg);
+ list_del(&evt->list);
+ kfree(evt);
+ break;
+ }
+ }
+
+done_unlock:
+ mutex_unlock(¶m->fault_param->lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_page_response);
+
static void __iommu_detach_device(struct iommu_domain *domain,
struct device *dev)
{
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32435f9..058b552 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -163,6 +163,55 @@ struct iommu_resv_region {
#ifdef CONFIG_IOMMU_API
/**
+ * enum page_response_code - Return status of fault handlers, telling the IOMMU
+ * driver how to proceed with the fault.
+ *
+ * @IOMMU_FAULT_STATUS_SUCCESS: Fault has been handled and the page tables
+ * populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent faults from
+ * this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_FAULT_STATUS_INVALID: Could not handle this fault, don't retry the
+ * access. This is "Invalid Request" in PCI PRI.
+ */
+enum page_response_code {
+ IOMMU_PAGE_RESP_SUCCESS = 0,
+ IOMMU_PAGE_RESP_INVALID,
+ IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * enum page_request_handle_t - Return page request/response handler status
+ *
+ * @IOMMU_FAULT_STATUS_HANDLED: Stop processing the fault, and do not send a
+ * reply to the device.
+ * @IOMMU_FAULT_STATUS_CONTINUE: Fault was not handled. Call the next handler,
+ * or terminate.
+ */
+enum page_request_handle_t {
+ IOMMU_PAGE_RESP_HANDLED = 0,
+ IOMMU_PAGE_RESP_CONTINUE,
+};
+
+/**
+ * Generic page response information based on PCI ATS and PASID spec.
+ * @addr: servicing page address
+ * @pasid: contains process address space ID
+ * @resp_code: response code
+ * @page_req_group_id: page request group index
+ * @type: group or stream/single page response
+ * @private_data: uniquely identify device-specific private data for an
+ * individual page response
+ */
+struct page_response_msg {
+ u64 addr;
+ u32 pasid;
+ enum page_response_code resp_code;
+ u32 pasid_present:1;
+ u32 page_req_group_id;
+ u64 private_data;
+};
+
+/**
* struct iommu_ops - iommu ops and capabilities
* @capable: check capability
* @domain_alloc: allocate iommu domain
@@ -195,6 +244,7 @@ struct iommu_resv_region {
* @bind_pasid_table: bind pasid table pointer for guest SVM
* @unbind_pasid_table: unbind pasid table pointer and restore defaults
* @sva_invalidate: invalidate translation caches of shared virtual address
+ * @page_response: handle page request response
*/
struct iommu_ops {
bool (*capable)(enum iommu_cap);
@@ -250,6 +300,7 @@ struct iommu_ops {
struct device *dev);
int (*sva_invalidate)(struct iommu_domain *domain,
struct device *dev, struct tlb_invalidate_info *inv_info);
+ int (*page_response)(struct device *dev, struct page_response_msg *msg);
unsigned long pgsize_bitmap;
};
@@ -471,6 +522,7 @@ extern int iommu_unregister_device_fault_handler(struct device *dev);
extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
+extern int iommu_page_response(struct device *dev, struct page_response_msg *msg);
extern int iommu_group_id(struct iommu_group *group);
extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
--
2.7.4
Add Intel VT-d ops to the generic iommu_bind_pasid_table API
functions.
The primary use case is for direct assignment of SVM capable
device. Originated from emulated IOMMU in the guest, the request goes
through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
passes guest PASID table pointer (GPA) and size.
Device context table entry is modified by Intel IOMMU specific
bind_pasid_table function. This will turn on nesting mode and matching
translation type.
The unbind operation restores default context mapping.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
drivers/iommu/intel-iommu.c | 119 ++++++++++++++++++++++++++++++++++++++++++
include/linux/dma_remapping.h | 1 +
2 files changed, 120 insertions(+)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a0f81a4..d8058be 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2409,6 +2409,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->ats_supported = info->pasid_supported = info->pri_supported = 0;
info->ats_enabled = info->pasid_enabled = info->pri_enabled = 0;
info->ats_qdep = 0;
+ info->pasid_table_bound = 0;
info->dev = dev;
info->domain = domain;
info->iommu = iommu;
@@ -5132,6 +5133,7 @@ static void intel_iommu_put_resv_regions(struct device *dev,
#ifdef CONFIG_INTEL_IOMMU_SVM
#define MAX_NR_PASID_BITS (20)
+#define MIN_NR_PASID_BITS (5)
static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
{
/*
@@ -5258,6 +5260,119 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
return iommu;
}
+
+static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
+ struct device *dev, struct pasid_table_config *pasidt_binfo)
+{
+ struct intel_iommu *iommu;
+ struct context_entry *context;
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+ struct device_domain_info *info;
+ struct pci_dev *pdev;
+ u8 bus, devfn, host_table_pasid_bits;
+ u16 did, sid;
+ int ret = 0;
+ unsigned long flags;
+ u64 ctx_lo;
+
+ iommu = device_to_iommu(dev, &bus, &devfn);
+ if (!iommu)
+ return -ENODEV;
+ /* VT-d spec section 9.4 says pasid table size is encoded as 2^(x+5) */
+ host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
+ if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
+ pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
+ pr_err("Invalid gPASID bits %d, host range %d - %d\n",
+ pasidt_binfo->pasid_bits,
+ MIN_NR_PASID_BITS, host_table_pasid_bits);
+ return -ERANGE;
+ }
+ if (!ecap_nest(iommu->ecap)) {
+ dev_err(dev, "Cannot bind PASID table, no nested translation\n");
+ ret = -EINVAL;
+ goto out;
+ }
+ pdev = to_pci_dev(dev);
+ sid = PCI_DEVID(bus, devfn);
+ info = dev->archdata.iommu;
+
+ if (!info) {
+ dev_err(dev, "Invalid device domain info\n");
+ ret = -EINVAL;
+ goto out;
+ }
+ if (info->pasid_table_bound) {
+ dev_err(dev, "Device PASID table already bound\n");
+ ret = -EBUSY;
+ goto out;
+ }
+ if (!info->pasid_enabled) {
+ ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
+ if (ret) {
+ dev_err(dev, "Failed to enable PASID\n");
+ goto out;
+ }
+ }
+ spin_lock_irqsave(&iommu->lock, flags);
+ context = iommu_context_addr(iommu, bus, devfn, 0);
+ if (!context_present(context)) {
+ dev_err(dev, "Context not present\n");
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ /* Anticipate guest to use SVM and owns the first level, so we turn
+ * nested mode on
+ */
+ ctx_lo = context[0].lo;
+ ctx_lo |= CONTEXT_NESTE | CONTEXT_PRS | CONTEXT_PASIDE;
+ ctx_lo &= ~CONTEXT_TT_MASK;
+ ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
+ context[0].lo = ctx_lo;
+
+ /* Assign guest PASID table pointer and size order */
+ ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
+ (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
+ context[1].lo = ctx_lo;
+ /* make sure context entry is updated before flushing */
+ wmb();
+ did = dmar_domain->iommu_did[iommu->seq_id];
+ iommu->flush.flush_context(iommu, did,
+ (((u16)bus) << 8) | devfn,
+ DMA_CCMD_MASK_NOBIT,
+ DMA_CCMD_DEVICE_INVL);
+ iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+ info->pasid_table_bound = 1;
+out_unlock:
+ spin_unlock_irqrestore(&iommu->lock, flags);
+out:
+ return ret;
+}
+
+static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
+ struct device *dev)
+{
+ struct intel_iommu *iommu;
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+ struct device_domain_info *info;
+ u8 bus, devfn;
+
+ info = dev->archdata.iommu;
+ if (!info) {
+ dev_err(dev, "Invalid device domain info\n");
+ return;
+ }
+ iommu = device_to_iommu(dev, &bus, &devfn);
+ if (!iommu) {
+ dev_err(dev, "No IOMMU for device to unbind PASID table\n");
+ return;
+ }
+
+ domain_context_clear(iommu, dev);
+
+ domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
+ info->pasid_table_bound = 0;
+}
#endif /* CONFIG_INTEL_IOMMU_SVM */
const struct iommu_ops intel_iommu_ops = {
@@ -5266,6 +5381,10 @@ const struct iommu_ops intel_iommu_ops = {
.domain_free = intel_iommu_domain_free,
.attach_dev = intel_iommu_attach_device,
.detach_dev = intel_iommu_detach_device,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+ .bind_pasid_table = intel_iommu_bind_pasid_table,
+ .unbind_pasid_table = intel_iommu_unbind_pasid_table,
+#endif
.map = intel_iommu_map,
.unmap = intel_iommu_unmap,
.map_sg = default_iommu_map_sg,
diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
index 21b3e7d..db290b2 100644
--- a/include/linux/dma_remapping.h
+++ b/include/linux/dma_remapping.h
@@ -28,6 +28,7 @@
#define CONTEXT_DINVE (1ULL << 8)
#define CONTEXT_PRS (1ULL << 9)
+#define CONTEXT_NESTE (1ULL << 10)
#define CONTEXT_PASIDE (1ULL << 11)
struct intel_iommu;
--
2.7.4
From: "Liu, Yi L" <[email protected]>
When an SVM capable device is assigned to a guest, the first level page
tables are owned by the guest and the guest PASID table pointer is
linked to the device context entry of the physical IOMMU.
Host IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation activities are passed down to the host. The
primary usage is derived from emulated IOMMU in the guest, where QEMU
can trap invalidation activities before passing them down to the
host/physical IOMMU.
Since the invalidation data are obtained from user space and will be
written into physical IOMMU, we must allow security check at various
layers. Therefore, generic invalidation data format are proposed here,
model specific IOMMU drivers need to convert them into their own format.
Signed-off-by: Liu, Yi L <[email protected]>
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
drivers/iommu/iommu.c | 14 ++++++++
include/linux/iommu.h | 12 +++++++
include/uapi/linux/iommu.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 105 insertions(+)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3a69620..784e019 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1344,6 +1344,20 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
}
EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+int iommu_sva_invalidate(struct iommu_domain *domain,
+ struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+ int ret = 0;
+
+ if (unlikely(!domain->ops->sva_invalidate))
+ return -ENODEV;
+
+ ret = domain->ops->sva_invalidate(domain, dev, inv_info);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
+
static void __iommu_detach_device(struct iommu_domain *domain,
struct device *dev)
{
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 8ad111f..e963dbd 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -190,6 +190,7 @@ struct iommu_resv_region {
* @pgsize_bitmap: bitmap of all possible supported page sizes
* @bind_pasid_table: bind pasid table pointer for guest SVM
* @unbind_pasid_table: unbind pasid table pointer and restore defaults
+ * @sva_invalidate: invalidate translation caches of shared virtual address
*/
struct iommu_ops {
bool (*capable)(enum iommu_cap);
@@ -243,6 +244,8 @@ struct iommu_ops {
struct pasid_table_config *pasidt_binfo);
void (*unbind_pasid_table)(struct iommu_domain *domain,
struct device *dev);
+ int (*sva_invalidate)(struct iommu_domain *domain,
+ struct device *dev, struct tlb_invalidate_info *inv_info);
unsigned long pgsize_bitmap;
};
@@ -309,6 +312,9 @@ extern int iommu_bind_pasid_table(struct iommu_domain *domain,
struct device *dev, struct pasid_table_config *pasidt_binfo);
extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
struct device *dev);
+extern int iommu_sva_invalidate(struct iommu_domain *domain,
+ struct device *dev, struct tlb_invalidate_info *inv_info);
+
extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot);
@@ -720,6 +726,12 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
{
}
+static inline int iommu_sva_invalidate(struct iommu_domain *domain,
+ struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+ return -EINVAL;
+}
+
#endif /* CONFIG_IOMMU_API */
#endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 9f7a6bf..4447943 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -29,4 +29,83 @@ struct pasid_table_config {
__u8 pasid_bits;
};
+/**
+ * enum iommu_inv_granularity - Generic invalidation granularity
+ *
+ * When an invalidation request is sent to IOMMU to flush translation caches,
+ * it may carry different granularity. These granularity levels are not specific
+ * to a type of translation cache. For an example, PASID selective granularity
+ * is only applicable to PASID cache invalidation.
+ * This enum is a collection of granularities for all types of translation
+ * caches. The idea is to make it easy for IOMMU model specific driver do
+ * conversion from generic to model specific value.
+ */
+enum iommu_inv_granularity {
+ IOMMU_INV_GRANU_DOMAIN = 1, /* all TLBs associated with a domain */
+ IOMMU_INV_GRANU_DEVICE, /* caching structure associated with a
+ * device ID
+ */
+ IOMMU_INV_GRANU_DOMAIN_PAGE, /* address range with a domain */
+ IOMMU_INV_GRANU_ALL_PASID, /* cache of a given PASID */
+ IOMMU_INV_GRANU_PASID_SEL, /* only invalidate specified PASID */
+
+ IOMMU_INV_GRANU_NG_ALL_PASID, /* non-global within all PASIDs */
+ IOMMU_INV_GRANU_NG_PASID, /* non-global within a PASIDs */
+ IOMMU_INV_GRANU_PAGE_PASID, /* page-selective within a PASID */
+ IOMMU_INV_NR_GRANU,
+};
+
+/** enum iommu_inv_type - Generic translation cache types for invalidation
+ *
+ * Invalidation requests sent to IOMMU may indicate which translation cache
+ * to be operated on.
+ * Combined with enum iommu_inv_granularity, model specific driver can do a
+ * simple lookup to convert generic type to model specific value.
+ */
+enum iommu_inv_type {
+ IOMMU_INV_TYPE_DTLB, /* device IOTLB */
+ IOMMU_INV_TYPE_TLB, /* IOMMU paging structure cache */
+ IOMMU_INV_TYPE_PASID, /* PASID cache */
+ IOMMU_INV_TYPE_CONTEXT, /* device context entry cache */
+ IOMMU_INV_NR_TYPE
+};
+
+/**
+ * Translation cache invalidation header that contains mandatory meta data.
+ * @version: info format version, expecting future extesions
+ * @type: type of translation cache to be invalidated
+ */
+struct tlb_invalidate_hdr {
+ __u32 version;
+#define TLB_INV_HDR_VERSION_1 1
+ enum iommu_inv_type type;
+};
+
+/**
+ * Translation cache invalidation information, contains generic IOMMU
+ * data which can be parsed based on model ID by model specific drivers.
+ *
+ * @granularity: requested invalidation granularity, type dependent
+ * @size: 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+ * @pasid: processor address space ID value per PCI spec.
+ * @addr: page address to be invalidated
+ * @flags IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID tagged,
+ * @pasid validity can be
+ * deduced from @granularity
+ * IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
+ * IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
+ *
+ */
+struct tlb_invalidate_info {
+ struct tlb_invalidate_hdr hdr;
+ enum iommu_inv_granularity granularity;
+ __u32 flags;
+#define IOMMU_INVALIDATE_NO_PASID (1 << 0)
+#define IOMMU_INVALIDATE_ADDR_LEAF (1 << 1)
+#define IOMMU_INVALIDATE_GLOBAL_PAGE (1 << 2)
+#define IOMMU_INVALIDATE_PASID_TAGGED (1 << 3)
+ __u8 size;
+ __u32 pasid;
+ __u64 addr;
+};
#endif /* _UAPI_IOMMU_H */
--
2.7.4
DMA faults can be detected by IOMMU at device level. Adding a pointer
to struct device allows IOMMU subsystem to report relevant faults
back to the device driver for further handling.
For direct assigned device (or user space drivers), guest OS holds
responsibility to handle and respond per device IOMMU fault.
Therefore we need fault reporting mechanism to propagate faults beyond
IOMMU subsystem.
There are two other IOMMU data pointers under struct device today, here
we introduce iommu_param as a parent pointer such that all device IOMMU
data can be consolidated here. The idea was suggested here by Greg KH
and Joerg. The name iommu_param is chosen here since iommu_data has been used.
Suggested-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Link: https://lkml.org/lkml/2017/10/6/81
---
include/linux/device.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/device.h b/include/linux/device.h
index 0059b99..7c79e4e 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -41,6 +41,7 @@ struct iommu_ops;
struct iommu_group;
struct iommu_fwspec;
struct dev_pin_info;
+struct iommu_param;
struct bus_attribute {
struct attribute attr;
@@ -897,6 +898,7 @@ struct dev_links_info {
* device (i.e. the bus driver that discovered the device).
* @iommu_group: IOMMU group the device belongs to.
* @iommu_fwspec: IOMMU-specific properties supplied by firmware.
+ * @iommu_param: Per device generic IOMMU runtime data
*
* @offline_disabled: If set, the device is permanently online.
* @offline: Set after successful invocation of bus type's .offline().
@@ -986,6 +988,7 @@ struct device {
void (*release)(struct device *dev);
struct iommu_group *iommu_group;
struct iommu_fwspec *iommu_fwspec;
+ struct iommu_param *iommu_param;
bool offline_disabled:1;
bool offline:1;
--
2.7.4
Adding a flag in device domain into to track whether a guest or
user PASID table is bound to a device.
Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/intel-iommu.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 304afae..ddc7d79 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -473,6 +473,7 @@ struct device_domain_info {
u8 pri_enabled:1;
u8 ats_supported:1;
u8 ats_enabled:1;
+ u8 pasid_table_bound:1;
u8 ats_qdep;
u64 fault_mask; /* selected IOMMU faults to be reported */
struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
--
2.7.4
Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
use in the guest:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
As part of the proposed architecture, when an SVM capable PCI
device is assigned to a guest, nested mode is turned on. Guest owns the
first level page tables (request with PASID) which performs GVA->GPA
translation. Second level page tables are owned by the host for GPA->HPA
translation for both request with and without PASID.
A new IOMMU driver interface is therefore needed to perform tasks as
follows:
* Enable nested translation and appropriate translation type
* Assign guest PASID table pointer (in GPA) and size to host IOMMU
This patch introduces new API functions to perform bind/unbind guest PASID
tables. Based on common data, model specific IOMMU drivers can be extended
to perform the specific steps for binding pasid table of assigned devices.
Signed-off-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Liu, Yi L <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/iommu.c | 19 +++++++++++++++++++
include/linux/iommu.h | 24 ++++++++++++++++++++++++
include/uapi/linux/iommu.h | 32 ++++++++++++++++++++++++++++++++
3 files changed, 75 insertions(+)
create mode 100644 include/uapi/linux/iommu.h
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d2aa2320..3a69620 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1325,6 +1325,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
}
EXPORT_SYMBOL_GPL(iommu_attach_device);
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+ struct pasid_table_config *pasidt_binfo)
+{
+ if (unlikely(!domain->ops->bind_pasid_table))
+ return -ENODEV;
+
+ return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
+
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+ if (unlikely(!domain->ops->unbind_pasid_table))
+ return;
+
+ domain->ops->unbind_pasid_table(domain, dev);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+
static void __iommu_detach_device(struct iommu_domain *domain,
struct device *dev)
{
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 19938ee..8ad111f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
#include <linux/errno.h>
#include <linux/err.h>
#include <linux/of.h>
+#include <uapi/linux/iommu.h>
#define IOMMU_READ (1 << 0)
#define IOMMU_WRITE (1 << 1)
@@ -187,6 +188,8 @@ struct iommu_resv_region {
* @domain_get_windows: Return the number of windows for a domain
* @of_xlate: add OF master IDs to iommu grouping
* @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @bind_pasid_table: bind pasid table pointer for guest SVM
+ * @unbind_pasid_table: unbind pasid table pointer and restore defaults
*/
struct iommu_ops {
bool (*capable)(enum iommu_cap);
@@ -233,8 +236,14 @@ struct iommu_ops {
u32 (*domain_get_windows)(struct iommu_domain *domain);
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
+
bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
+ int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
+ struct pasid_table_config *pasidt_binfo);
+ void (*unbind_pasid_table)(struct iommu_domain *domain,
+ struct device *dev);
+
unsigned long pgsize_bitmap;
};
@@ -296,6 +305,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
struct device *dev);
extern void iommu_detach_device(struct iommu_domain *domain,
struct device *dev);
+extern int iommu_bind_pasid_table(struct iommu_domain *domain,
+ struct device *dev, struct pasid_table_config *pasidt_binfo);
+extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
+ struct device *dev);
extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot);
@@ -696,6 +709,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
return NULL;
}
+static inline
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+ struct pasid_table_config *pasidt_binfo)
+{
+ return -EINVAL;
+}
+static inline
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+}
+
#endif /* CONFIG_IOMMU_API */
#endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
new file mode 100644
index 0000000..9f7a6bf
--- /dev/null
+++ b/include/uapi/linux/iommu.h
@@ -0,0 +1,32 @@
+/*
+ * IOMMU user API definitions
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_IOMMU_H
+#define _UAPI_IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * PASID table data used to bind guest PASID table to the host IOMMU. This will
+ * enable guest managed first level page tables.
+ * @version: for future extensions and identification of the data format
+ * @bytes: size of this structure
+ * @base_ptr: PASID table pointer
+ * @pasid_bits: number of bits supported in the guest PASID table, must be less
+ * or equal than the host supported PASID size.
+ */
+struct pasid_table_config {
+ __u32 version;
+#define PASID_TABLE_CFG_VERSION 1
+ __u32 bytes;
+ __u64 base_ptr;
+ __u8 pasid_bits;
+};
+
+#endif /* _UAPI_IOMMU_H */
--
2.7.4
Allow both intel-iommu.c and dmar.c to access device_domain_info.
Prepare for additional per device arch data used in TLB flush function
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel-iommu.c | 18 ------------------
include/linux/intel-iommu.h | 19 +++++++++++++++++++
2 files changed, 19 insertions(+), 18 deletions(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d60b2fb..a0f81a4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -391,24 +391,6 @@ struct dmar_domain {
iommu core */
};
-/* PCI domain-device relationship */
-struct device_domain_info {
- struct list_head link; /* link to domain siblings */
- struct list_head global; /* link to global list */
- u8 bus; /* PCI bus number */
- u8 devfn; /* PCI devfn number */
- u8 pasid_supported:3;
- u8 pasid_enabled:1;
- u8 pri_supported:1;
- u8 pri_enabled:1;
- u8 ats_supported:1;
- u8 ats_enabled:1;
- u8 ats_qdep;
- struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
- struct intel_iommu *iommu; /* IOMMU used by this device */
- struct dmar_domain *domain; /* pointer to domain */
-};
-
struct dmar_rmrr_unit {
struct list_head list; /* list of rmrr units */
struct acpi_dmar_header *hdr; /* ACPI header */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index eec4827..304afae 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -461,6 +461,25 @@ struct intel_iommu {
u32 flags; /* Software defined flags */
};
+/* PCI domain-device relationship */
+struct device_domain_info {
+ struct list_head link; /* link to domain siblings */
+ struct list_head global; /* link to global list */
+ u8 bus; /* PCI bus number */
+ u8 devfn; /* PCI devfn number */
+ u8 pasid_supported:3;
+ u8 pasid_enabled:1;
+ u8 pri_supported:1;
+ u8 pri_enabled:1;
+ u8 ats_supported:1;
+ u8 ats_enabled:1;
+ u8 ats_qdep;
+ u64 fault_mask; /* selected IOMMU faults to be reported */
+ struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
+ struct intel_iommu *iommu; /* IOMMU used by this device */
+ struct dmar_domain *domain; /* pointer to domain */
+};
+
static inline void __iommu_flush_cache(
struct intel_iommu *iommu, void *addr, int size)
{
--
2.7.4
Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.
Faults detected by IOMMU is based on the transaction's source ID which
can be reported at per device basis, regardless of the device type is a
PCI device or not.
The fault types include recoverable (e.g. page request) and
unrecoverable faults(e.g. access error). In most cases, faults can be
handled by IOMMU drivers internally. The primary use cases are as
follows:
1. page request fault originated from an SVM capable device that is
assigned to guest via vIOMMU. In this case, the first level page tables
are owned by the guest. Page request must be propagated to the guest to
let guest OS fault in the pages then send page response. In this
mechanism, the direct receiver of IOMMU fault notification is VFIO,
which can relay notification events to QEMU or other user space
software.
2. faults need more subtle handling by device drivers. Other than
simply invoke reset function, there are needs to let device driver
handle the fault with a smaller impact.
This patchset is intended to create a generic fault report API such
that it can scale as follows:
- all IOMMU types
- PCI and non-PCI devices
- recoverable and unrecoverable faults
- VFIO and other other in kernel users
- DMA & IRQ remapping (TBD)
The original idea was brought up by David Woodhouse and discussions
summarized at https://lwn.net/Articles/608914/.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Jean-Philippe Brucker <[email protected]>
---
drivers/iommu/iommu.c | 147 +++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/iommu.h | 35 +++++++++++-
2 files changed, 179 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 784e019..de19c33 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -581,6 +581,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
goto err_free_name;
}
+ dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
+ if (!dev->iommu_param) {
+ ret = -ENOMEM;
+ goto err_free_name;
+ }
+ mutex_init(&dev->iommu_param->lock);
+
kobject_get(group->devices_kobj);
dev->iommu_group = group;
@@ -611,6 +618,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
mutex_unlock(&group->mutex);
dev->iommu_group = NULL;
kobject_put(group->devices_kobj);
+ kfree(dev->iommu_param);
err_free_name:
kfree(device->name);
err_remove_link:
@@ -657,7 +665,7 @@ void iommu_group_remove_device(struct device *dev)
sysfs_remove_link(&dev->kobj, "iommu_group");
trace_remove_device_from_group(group->id, dev);
-
+ kfree(dev->iommu_param);
kfree(device->name);
kfree(device);
dev->iommu_group = NULL;
@@ -792,6 +800,143 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
/**
+ * iommu_register_device_fault_handler() - Register a device fault handler
+ * @dev: the device
+ * @handler: the fault handler
+ * @data: private data passed as argument to the handler
+ *
+ * When an IOMMU fault event is received, call this handler with the fault event
+ * and data as argument. The handler should return 0. If the fault is
+ * recoverable (IOMMU_FAULT_PAGE_REQ), the handler must also complete
+ * the fault by calling iommu_page_response() with one of the following
+ * response code:
+ * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
+ * - IOMMU_PAGE_RESP_INVALID: terminate the fault
+ * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
+ * page faults if possible.
+ *
+ * Return 0 if the fault handler was installed successfully, or an error.
+ */
+int iommu_register_device_fault_handler(struct device *dev,
+ iommu_dev_fault_handler_t handler,
+ void *data)
+{
+ struct iommu_param *param = dev->iommu_param;
+
+ /*
+ * Device iommu_param should have been allocated when device is
+ * added to its iommu_group.
+ */
+ if (!param)
+ return -EINVAL;
+
+ /* Only allow one fault handler registered for each device */
+ if (param->fault_param)
+ return -EBUSY;
+
+ mutex_lock(¶m->lock);
+ get_device(dev);
+ param->fault_param =
+ kzalloc(sizeof(struct iommu_fault_param), GFP_ATOMIC);
+ if (!param->fault_param) {
+ put_device(dev);
+ mutex_unlock(¶m->lock);
+ return -ENOMEM;
+ }
+ mutex_init(¶m->fault_param->lock);
+ param->fault_param->handler = handler;
+ param->fault_param->data = data;
+ INIT_LIST_HEAD(¶m->fault_param->faults);
+
+ mutex_unlock(¶m->lock);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
+
+/**
+ * iommu_unregister_device_fault_handler() - Unregister the device fault handler
+ * @dev: the device
+ *
+ * Remove the device fault handler installed with
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_unregister_device_fault_handler(struct device *dev)
+{
+ struct iommu_param *param = dev->iommu_param;
+ int ret = 0;
+
+ if (!param)
+ return -EINVAL;
+
+ mutex_lock(¶m->lock);
+ /* we cannot unregister handler if there are pending faults */
+ if (list_empty(¶m->fault_param->faults)) {
+ ret = -EBUSY;
+ goto unlock;
+ }
+
+ list_del(¶m->fault_param->faults);
+ kfree(param->fault_param);
+ param->fault_param = NULL;
+ put_device(dev);
+
+unlock:
+ mutex_unlock(¶m->lock);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
+
+
+/**
+ * iommu_report_device_fault() - Report fault event to device
+ * @dev: the device
+ * @evt: fault event data
+ *
+ * Called by IOMMU model specific drivers when fault is detected, typically
+ * in a threaded IRQ handler.
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+ int ret = 0;
+ struct iommu_fault_event *evt_pending;
+ struct iommu_fault_param *fparam;
+
+ /* iommu_param is allocated when device is added to group */
+ if (!dev->iommu_param | !evt)
+ return -EINVAL;
+ /* we only report device fault if there is a handler registered */
+ mutex_lock(&dev->iommu_param->lock);
+ if (!dev->iommu_param->fault_param ||
+ !dev->iommu_param->fault_param->handler) {
+ ret = -EINVAL;
+ goto done_unlock;
+ }
+ fparam = dev->iommu_param->fault_param;
+ if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
+ evt_pending = kzalloc(sizeof(*evt_pending), GFP_ATOMIC);
+ if (!evt_pending) {
+ ret = -ENOMEM;
+ goto done_unlock;
+ }
+ memcpy(evt_pending, evt, sizeof(struct iommu_fault_event));
+ mutex_lock(&fparam->lock);
+ list_add_tail(&evt_pending->list, &fparam->faults);
+ mutex_unlock(&fparam->lock);
+ }
+ ret = fparam->handler(evt, fparam->data);
+done_unlock:
+ mutex_unlock(&dev->iommu_param->lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_fault);
+
+/**
* iommu_group_id - Return ID for a group
* @group: the group to ID
*
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 8968933..32435f9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -307,7 +307,8 @@ enum iommu_fault_reason {
* and PASID spec.
* - Un-recoverable faults of device interest
* - DMA remapping and IRQ remapping faults
-
+ *
+ * @list pending fault event list, used for tracking responses
* @type contains fault type.
* @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
* faults are not reported
@@ -325,6 +326,7 @@ enum iommu_fault_reason {
* sending the fault response.
*/
struct iommu_fault_event {
+ struct list_head list;
enum iommu_fault_type type;
enum iommu_fault_reason reason;
u64 addr;
@@ -341,10 +343,13 @@ struct iommu_fault_event {
* struct iommu_fault_param - per-device IOMMU fault data
* @dev_fault_handler: Callback function to handle IOMMU faults at device level
* @data: handler private data
- *
+ * @faults: holds the pending faults which needs response, e.g. page response.
+ * @lock: protect pending PRQ event list
*/
struct iommu_fault_param {
iommu_dev_fault_handler_t handler;
+ struct list_head faults;
+ struct mutex lock;
void *data;
};
@@ -358,6 +363,7 @@ struct iommu_fault_param {
* struct iommu_fwspec *iommu_fwspec;
*/
struct iommu_param {
+ struct mutex lock;
struct iommu_fault_param *fault_param;
};
@@ -457,6 +463,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
struct notifier_block *nb);
extern int iommu_group_unregister_notifier(struct iommu_group *group,
struct notifier_block *nb);
+extern int iommu_register_device_fault_handler(struct device *dev,
+ iommu_dev_fault_handler_t handler,
+ void *data);
+
+extern int iommu_unregister_device_fault_handler(struct device *dev);
+
+extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
+
extern int iommu_group_id(struct iommu_group *group);
extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -728,6 +742,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
return 0;
}
+static inline int iommu_register_device_fault_handler(struct device *dev,
+ iommu_dev_fault_handler_t handler,
+ void *data)
+{
+ return 0;
+}
+
+static inline int iommu_unregister_device_fault_handler(struct device *dev)
+{
+ return 0;
+}
+
+static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+ return 0;
+}
+
static inline int iommu_group_id(struct iommu_group *group)
{
return -ENODEV;
--
2.7.4
On Mon, 16 Apr 2018 14:48:58 -0700
Jacob Pan <[email protected]> wrote:
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
>
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
>
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
>
> Signed-off-by: Liu, Yi L <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 170 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 170 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index cae4042..c765448 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -4973,6 +4973,175 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
> dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
> }
>
> +/*
> + * 3D array for converting IOMMU generic type-granularity to VT-d granularity
> + * X indexed by enum iommu_inv_type
> + * Y indicates request without and with PASID
> + * Z indexed by enum iommu_inv_granularity
> + *
> + * For an example, if we want to find the VT-d granularity encoding for IOTLB
> + * type, DMA request with PASID, and page selective. The look up indices are:
> + * [1][1][8], where
> + * 1: IOMMU_INV_TYPE_TLB
> + * 1: with PASID
> + * 8: IOMMU_INV_GRANU_PAGE_PASID
> + *
> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> + *
> + */
> +const static int inv_type_granu_map[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> + /* extended dev IOTLBs, for dev-IOTLB, only global is valid,
> + for dev-EXIOTLB, two valid granu */
> + {
> + {1},
> + {0, 0, 0, 0, 1, 1, 0, 0, 0}
> + },
> + /* IOTLB and EIOTLB */
> + {
> + {1, 1, 0, 1, 0, 0, 0, 0, 0},
> + {0, 0, 0, 0, 1, 0, 1, 1, 1}
> + },
> + /* PASID cache */
> + {
> + {0},
> + {0, 0, 0, 0, 1, 1, 0, 0, 0}
> + },
> + /* context cache */
> + {
> + {1, 1, 1}
> + }
> +};
> +
> +const static u64 inv_type_granu_table[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> + /* extended dev IOTLBs, only global is valid */
> + {
> + {QI_DEV_IOTLB_GRAN_ALL},
> + {0, 0, 0, 0, QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0, 0, 0}
> + },
> + /* IOTLB and EIOTLB */
> + {
> + {DMA_TLB_GLOBAL_FLUSH, DMA_TLB_DSI_FLUSH, 0, DMA_TLB_PSI_FLUSH},
> + {0, 0, 0, 0, QI_GRAN_ALL_ALL, 0, QI_GRAN_NONG_ALL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID}
> + },
> + /* PASID cache */
> + {
> + {0},
> + {0, 0, 0, 0, QI_PC_ALL_PASIDS, QI_PC_PASID_SEL}
> + },
> + /* context cache */
> + {
> + {DMA_CCMD_GLOBAL_INVL, DMA_CCMD_DOMAIN_INVL, DMA_CCMD_DEVICE_INVL}
> + }
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu, int with_pasid, u64 *vtd_granu)
> +{
> + if (type >= IOMMU_INV_NR_TYPE || granu >= IOMMU_INV_NR_GRANU || with_pasid > 1)
> + return -EINVAL;
> +
> + if (inv_type_granu_map[type][with_pasid][granu] == 0)
> + return -EINVAL;
> +
> + *vtd_granu = inv_type_granu_table[type][with_pasid][granu];
> +
> + return 0;
> +}
> +
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> + struct device *dev, struct tlb_invalidate_info *inv_info)
inv_info->hdr.version is never checked, why do we have these if they're
not used?
> +{
> + struct intel_iommu *iommu;
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> + struct device_domain_info *info;
> + u16 did, sid;
> + u8 bus, devfn;
> + int ret = 0;
> + u64 granu;
> + unsigned long flags;
> +
> + if (!inv_info || !dmar_domain)
> + return -EINVAL;
> +
> + iommu = device_to_iommu(dev, &bus, &devfn);
> + if (!iommu)
> + return -ENODEV;
> +
> + if (!dev || !dev_is_pci(dev))
> + return -ENODEV;
> +
> + did = dmar_domain->iommu_did[iommu->seq_id];
> + sid = PCI_DEVID(bus, devfn);
> + ret = to_vtd_granularity(inv_info->hdr.type, inv_info->granularity,
> + !!(inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED), &granu);
> + if (ret) {
> + pr_err("Invalid range type %d, granu %d\n", inv_info->hdr.type,
> + inv_info->granularity);
> + return ret;
> + }
> +
> + spin_lock(&iommu->lock);
> + spin_lock_irqsave(&device_domain_lock, flags);
> +
> + switch (inv_info->hdr.type) {
> + case IOMMU_INV_TYPE_CONTEXT:
> + iommu->flush.flush_context(iommu, did, sid,
> + DMA_CCMD_MASK_NOBIT, granu);
> + break;
> + case IOMMU_INV_TYPE_TLB:
> + /* We need to deal with two scenarios:
> + * - IOTLB for request w/o PASID
> + * - extended IOTLB for request with PASID.
> + */
> + if (inv_info->size &&
> + (inv_info->addr & ((1 << (VTD_PAGE_SHIFT + inv_info->size)) - 1))) {
> + pr_err("Addr out of range, addr 0x%llx, size order %d\n",
> + inv_info->addr, inv_info->size);
> + ret = -ERANGE;
> + goto out_unlock;
> + }
> +
> + if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
> + qi_flush_eiotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
> + inv_info->pasid,
> + inv_info->size, granu,
> + inv_info->flags & IOMMU_INVALIDATE_GLOBAL_PAGE);
> + else
> + qi_flush_iotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
> + inv_info->size, granu);
> + /**
> + * Always flush device IOTLB if ATS is enabled since guest
> + * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
> + * down.
> + */
> + info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> + if (info && info->ats_enabled) {
> + if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
> + qi_flush_dev_eiotlb(iommu, sid,
> + inv_info->pasid, info->ats_qdep,
> + inv_info->addr, inv_info->size,
> + granu);
> + else
> + qi_flush_dev_iotlb(iommu, sid, info->pfsid,
> + info->ats_qdep, inv_info->addr,
> + inv_info->size);
> + }
> + break;
> + case IOMMU_INV_TYPE_PASID:
> + qi_flush_pasid(iommu, did, granu, inv_info->pasid);
> +
> + break;
> + default:
> + dev_err(dev, "Unknown IOMMU invalidation type %d\n",
> + inv_info->hdr.type);
> + ret = -EINVAL;
> + }
More verbose logging, is vfio just passing these through allowing them
to be user reachable? Thanks,
Alex
> +out_unlock:
> + spin_unlock(&iommu->lock);
> + spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> + return ret;
> +}
> +
> static int intel_iommu_map(struct iommu_domain *domain,
> unsigned long iova, phys_addr_t hpa,
> size_t size, int iommu_prot)
> @@ -5398,6 +5567,7 @@ const struct iommu_ops intel_iommu_ops = {
> #ifdef CONFIG_INTEL_IOMMU_SVM
> .bind_pasid_table = intel_iommu_bind_pasid_table,
> .unbind_pasid_table = intel_iommu_unbind_pasid_table,
> + .sva_invalidate = intel_iommu_sva_invalidate,
> #endif
> .map = intel_iommu_map,
> .unmap = intel_iommu_unmap,
On Mon, 16 Apr 2018 14:48:53 -0700
Jacob Pan <[email protected]> wrote:
> Add Intel VT-d ops to the generic iommu_bind_pasid_table API
> functions.
>
> The primary use case is for direct assignment of SVM capable
> device. Originated from emulated IOMMU in the guest, the request goes
> through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
> passes guest PASID table pointer (GPA) and size.
>
> Device context table entry is modified by Intel IOMMU specific
> bind_pasid_table function. This will turn on nesting mode and matching
> translation type.
>
> The unbind operation restores default context mapping.
>
> Signed-off-by: Jacob Pan <[email protected]>
> Signed-off-by: Liu, Yi L <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 119 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/dma_remapping.h | 1 +
> 2 files changed, 120 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index a0f81a4..d8058be 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2409,6 +2409,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
> info->ats_supported = info->pasid_supported = info->pri_supported = 0;
> info->ats_enabled = info->pasid_enabled = info->pri_enabled = 0;
> info->ats_qdep = 0;
> + info->pasid_table_bound = 0;
> info->dev = dev;
> info->domain = domain;
> info->iommu = iommu;
> @@ -5132,6 +5133,7 @@ static void intel_iommu_put_resv_regions(struct device *dev,
>
> #ifdef CONFIG_INTEL_IOMMU_SVM
> #define MAX_NR_PASID_BITS (20)
> +#define MIN_NR_PASID_BITS (5)
> static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
> {
> /*
> @@ -5258,6 +5260,119 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
>
> return iommu;
> }
> +
> +static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
> + struct device *dev, struct pasid_table_config *pasidt_binfo)
> +{
Never validates pasidt_binfo->{version,bytes}
> + struct intel_iommu *iommu;
> + struct context_entry *context;
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> + struct device_domain_info *info;
> + struct pci_dev *pdev;
> + u8 bus, devfn, host_table_pasid_bits;
> + u16 did, sid;
> + int ret = 0;
> + unsigned long flags;
> + u64 ctx_lo;
> +
> + iommu = device_to_iommu(dev, &bus, &devfn);
> + if (!iommu)
> + return -ENODEV;
> + /* VT-d spec section 9.4 says pasid table size is encoded as 2^(x+5) */
> + host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
> + if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
> + pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
> + pr_err("Invalid gPASID bits %d, host range %d - %d\n",
> + pasidt_binfo->pasid_bits,
> + MIN_NR_PASID_BITS, host_table_pasid_bits);
> + return -ERANGE;
> + }
> + if (!ecap_nest(iommu->ecap)) {
> + dev_err(dev, "Cannot bind PASID table, no nested translation\n");
> + ret = -EINVAL;
> + goto out;
> + }
Gratuitous use of pr_err, some of these look user reachable, for
instance can vfio know in advance the supported widths or can the user
trigger that pr_err at will? Some of these errno values are also maybe
not as descriptive as they could be. For instance if the iommu doesn't
support nesting, that's not a calling argument error, that's an
unsupported device error, right?
> + pdev = to_pci_dev(dev);
> + sid = PCI_DEVID(bus, devfn);
> + info = dev->archdata.iommu;
> +
> + if (!info) {
> + dev_err(dev, "Invalid device domain info\n");
> + ret = -EINVAL;
> + goto out;
> + }
> + if (info->pasid_table_bound) {
> + dev_err(dev, "Device PASID table already bound\n");
> + ret = -EBUSY;
> + goto out;
> + }
> + if (!info->pasid_enabled) {
> + ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
> + if (ret) {
> + dev_err(dev, "Failed to enable PASID\n");
> + goto out;
> + }
> + }
> + spin_lock_irqsave(&iommu->lock, flags);
> + context = iommu_context_addr(iommu, bus, devfn, 0);
> + if (!context_present(context)) {
> + dev_err(dev, "Context not present\n");
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + /* Anticipate guest to use SVM and owns the first level, so we turn
> + * nested mode on
> + */
> + ctx_lo = context[0].lo;
> + ctx_lo |= CONTEXT_NESTE | CONTEXT_PRS | CONTEXT_PASIDE;
> + ctx_lo &= ~CONTEXT_TT_MASK;
> + ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
> + context[0].lo = ctx_lo;
> +
> + /* Assign guest PASID table pointer and size order */
> + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
Where does this IOMMU API interface define that base_ptr is 4K
aligned or the format of the PASID table? Are these all standardized
or do they vary by host IOMMU? If they're standards, maybe we could
note that and the spec which defines them when we declare base_ptr. If
they're IOMMU specific then I don't understand how we'll match a user
provided PASID table to the requirements and format of the host IOMMU.
Thanks,
Alex
> + context[1].lo = ctx_lo;
> + /* make sure context entry is updated before flushing */
> + wmb();
> + did = dmar_domain->iommu_did[iommu->seq_id];
> + iommu->flush.flush_context(iommu, did,
> + (((u16)bus) << 8) | devfn,
> + DMA_CCMD_MASK_NOBIT,
> + DMA_CCMD_DEVICE_INVL);
> + iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> + info->pasid_table_bound = 1;
> +out_unlock:
> + spin_unlock_irqrestore(&iommu->lock, flags);
> +out:
> + return ret;
> +}
> +
> +static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
> + struct device *dev)
> +{
> + struct intel_iommu *iommu;
> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> + struct device_domain_info *info;
> + u8 bus, devfn;
> +
> + info = dev->archdata.iommu;
> + if (!info) {
> + dev_err(dev, "Invalid device domain info\n");
> + return;
> + }
> + iommu = device_to_iommu(dev, &bus, &devfn);
> + if (!iommu) {
> + dev_err(dev, "No IOMMU for device to unbind PASID table\n");
> + return;
> + }
> +
> + domain_context_clear(iommu, dev);
> +
> + domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
> + info->pasid_table_bound = 0;
> +}
> #endif /* CONFIG_INTEL_IOMMU_SVM */
>
> const struct iommu_ops intel_iommu_ops = {
> @@ -5266,6 +5381,10 @@ const struct iommu_ops intel_iommu_ops = {
> .domain_free = intel_iommu_domain_free,
> .attach_dev = intel_iommu_attach_device,
> .detach_dev = intel_iommu_detach_device,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> + .bind_pasid_table = intel_iommu_bind_pasid_table,
> + .unbind_pasid_table = intel_iommu_unbind_pasid_table,
> +#endif
> .map = intel_iommu_map,
> .unmap = intel_iommu_unmap,
> .map_sg = default_iommu_map_sg,
> diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
> index 21b3e7d..db290b2 100644
> --- a/include/linux/dma_remapping.h
> +++ b/include/linux/dma_remapping.h
> @@ -28,6 +28,7 @@
>
> #define CONTEXT_DINVE (1ULL << 8)
> #define CONTEXT_PRS (1ULL << 9)
> +#define CONTEXT_NESTE (1ULL << 10)
> #define CONTEXT_PASIDE (1ULL << 11)
>
> struct intel_iommu;
Hi Jacob,
On Mon, Apr 16, 2018 at 10:48:54PM +0100, Jacob Pan wrote:
[...]
> +/**
> + * enum iommu_inv_granularity - Generic invalidation granularity
> + *
> + * When an invalidation request is sent to IOMMU to flush translation caches,
> + * it may carry different granularity. These granularity levels are not specific
> + * to a type of translation cache. For an example, PASID selective granularity
> + * is only applicable to PASID cache invalidation.
I'm still confused by this, I think we should add more definitions
because architectures tend to use different names. What you call
"Translations caches" encompasses all caches that can be invalidated
with this request, right? So all of:
* "TLB" and "DTLB" that cache IOVA->GPA and GPA->PA (TLB is in the
IOMMU, DTLB is an ATC in an endpoint),
* "PASID cache" that cache PASID->Translation Table,
* "Context cache" that cache RID->PASID table
Does this match the model you're using?
The last name is a bit unfortunate. Since the Arm architecture uses the
name "context" for what a PASID points to, "Device cache" would suit us
better but it's not important.
I don't understand what you mean by "PASID selective granularity is only
applicable to PASID cache invalidation", it seems to contradict the
preceding sentence. What if user sends an invalidation with
IOMMU_INV_TYPE_TLB and IOMMU_INV_GRANU_ALL_PASID? Doesn't this remove
from the TLBs all entries with the given PASID?
> + * This enum is a collection of granularities for all types of translation
> + * caches. The idea is to make it easy for IOMMU model specific driver do
> + * conversion from generic to model specific value.
> + */
> +enum iommu_inv_granularity {
In patch 9, inv_type_granu_map has some valid fields with granularity ==
0. Does it mean "invalidate all caches"?
I don't think user should ever be allowed to invalidate caches entries
of devices and domains it doesn't own.
> + IOMMU_INV_GRANU_DOMAIN = 1, /* all TLBs associated with a domain */
> + IOMMU_INV_GRANU_DEVICE, /* caching structure associated with a
> + * device ID
> + */
> + IOMMU_INV_GRANU_DOMAIN_PAGE, /* address range with a domain */
> + IOMMU_INV_GRANU_ALL_PASID, /* cache of a given PASID */
If this corresponds to QI_GRAN_ALL_ALL in patch 9, the comment should be
"Cache of all PASIDs"? Or maybe "all entries for all PASIDs"? Is it
different from GRANU_DOMAIN then?
> + IOMMU_INV_GRANU_PASID_SEL, /* only invalidate specified PASID */
> +
> + IOMMU_INV_GRANU_NG_ALL_PASID, /* non-global within all PASIDs */
> + IOMMU_INV_GRANU_NG_PASID, /* non-global within a PASIDs */
Are the "NG" variant needed since there is a IOMMU_INVALIDATE_GLOBAL_PAGE
below? We should drop either flag or granule.
FWIW I'm starting to think more granule options is actually better than
flags, because it flattens the combinations and keeps them to two
dimensions, that we can understand and explain with a table.
> + IOMMU_INV_GRANU_PAGE_PASID, /* page-selective within a PASID */
Maybe this should be called "NG_PAGE_PASID", and "DOMAIN_PAGE" should
instead be "PAGE_PASID". If I understood their meaning correctly, it
would be more consistent with the rest.
> + IOMMU_INV_NR_GRANU,
> +};
> +
> +/** enum iommu_inv_type - Generic translation cache types for invalidation
> + *
> + * Invalidation requests sent to IOMMU may indicate which translation cache
> + * to be operated on.
> + * Combined with enum iommu_inv_granularity, model specific driver can do a
> + * simple lookup to convert generic type to model specific value.
> + */
> +enum iommu_inv_type {
These should be flags (1 << 0), (1 << 1) etc, since IOMMUs will want to
invalidate multiple caches at once (at least DTLB and TLB). You could
then do for_each_set_bit in the driver
> + IOMMU_INV_TYPE_DTLB, /* device IOTLB */
> + IOMMU_INV_TYPE_TLB, /* IOMMU paging structure cache */
> + IOMMU_INV_TYPE_PASID, /* PASID cache */
> + IOMMU_INV_TYPE_CONTEXT, /* device context entry cache */
> + IOMMU_INV_NR_TYPE
> +};
We need to summarize and explain valid combinations, because reading
inv_type_granu_map and inv_type_granu_table is a bit tedious. I tried to
reproduce inv_type_granu_map here (Cell format is PASID_TAGGED /
!PASID_TAGGED). Could you check if this matches your model?
type | DTLB | TLB | PASID | CONTEXT
granule | | | |
-----------------+-----------+-----------+-----------+-----------
- | / Y | / Y | | / Y
DOMAIN | | / Y | | / Y
DEVICE | | | | / Y
DOMAIN_PAGE | | / Y | |
ALL_PASID | Y | Y | |
PASID_SEL | Y | | Y |
NG_ALL_PASID | | Y | Y |
NG_PASID | | Y | |
PAGE_PASID | | Y | |
There is no intersection between PASID_TAGGED and !PASID_TAGGED (Y/Y),
so the flag might not be needed.
I think the API can be more relaxed. Each IOMMU driver can add more
restrictions, but I think the SMMU can support these combinations:
| DTLB | TLB | PASID | CONTEXT
--------------+-----------+-----------+-----------+-----------
DOMAIN | Y | Y | Y | Y
DEVICE | Y | Y | Y | Y
DOMAIN_PAGE | Y | Y | |
ALL_PASID | Y | Y | Y |
PASID_SEL | Y | Y | Y |
NG_ALL_PASID | Y | Y | Y |
NG_PASID | Y | Y | Y |
PAGE_PASID | Y | Y | |
Two are missing in the PASID column because it doesn't make any sense to
target the PASID cache with a page-selective invalidation. And for the
context cache, we can only invalidate per device or per domain. So I
think this is the biggest set of allowed combinations.
> +
> +/**
> + * Translation cache invalidation header that contains mandatory meta data.
> + * @version: info format version, expecting future extesions
> + * @type: type of translation cache to be invalidated
> + */
> +struct tlb_invalidate_hdr {
> + __u32 version;
> +#define TLB_INV_HDR_VERSION_1 1
> + enum iommu_inv_type type;
> +};
> +
> +/**
> + * Translation cache invalidation information, contains generic IOMMU
> + * data which can be parsed based on model ID by model specific drivers.
> + *
> + * @granularity: requested invalidation granularity, type dependent
> + * @size: 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
Maybe start the size at 1 byte, we don't know what sort of granularity
future architectures will offer.
> + * @pasid: processor address space ID value per PCI spec.
> + * @addr: page address to be invalidated
> + * @flags IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID tagged,
> + * @pasid validity can be
> + * deduced from @granularity
This is really hurting my brain... Two dimensions was already difficult,
but I can't follow anymore. What does PASID_TAGGED say if not "@pasid is
valid"? I thought VT-d mandated PASID for nested translation?
> + * IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
> + * IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
> + *
> + */
> +struct tlb_invalidate_info {
> + struct tlb_invalidate_hdr hdr;
> + enum iommu_inv_granularity granularity;
> + __u32 flags;
> +#define IOMMU_INVALIDATE_NO_PASID (1 << 0)
I suggested NO_PASID because Arm can have pasid-tagged and one no-pasid
address spaces within the same domain in DSS0 mode. AMD would also need
this for their GIoV mode, if I understood it correctly.
When specifying NO_PASID, the user invalidates mappings for the address
space that doesn't have a PASID, but the same granularities as PASID
contexts apply. I now think we can remove the NO_PASID flag and avoid a
lot of confusion.
The GIoV and DSS0 modes are implemented by reserving entry 0 of the
PASID table for NO_PASID translations. Given that the guest specifies
this mode at BIND_TABLE time, the host understands that when the guest
invalidates PASID 0, if GIoV or DSS0 was enabled, then the invalidation
applies to the NO_PASID context. So you can drop this flag in my
opinion.
Thanks,
Jean
On Tue, Apr 17, 2018 at 08:10:47PM +0100, Alex Williamson wrote:
[...]
> > + /* Assign guest PASID table pointer and size order */
> > + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> > + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
>
> Where does this IOMMU API interface define that base_ptr is 4K
> aligned or the format of the PASID table? Are these all standardized
> or do they vary by host IOMMU? If they're standards, maybe we could
> note that and the spec which defines them when we declare base_ptr. If
> they're IOMMU specific then I don't understand how we'll match a user
> provided PASID table to the requirements and format of the host IOMMU.
> Thanks,
On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so a guest
under a vSMMU might pass a pointer that's not aligned on 4k.
Maybe this information could be part of the data passed to userspace
about IOMMU table formats and features? They're not part of this series,
but I think we wanted to communicate IOMMU-specific features via sysfs.
Thanks,
Jean
On Tue, 17 Apr 2018 13:10:45 -0600
Alex Williamson <[email protected]> wrote:
> On Mon, 16 Apr 2018 14:48:58 -0700
> Jacob Pan <[email protected]> wrote:
>
> > When Shared Virtual Address (SVA) is enabled for a guest OS via
> > vIOMMU, we need to provide invalidation support at IOMMU API and
> > driver level. This patch adds Intel VT-d specific function to
> > implement iommu passdown invalidate API for shared virtual address.
> >
> > The use case is for supporting caching structure invalidation
> > of assigned SVM capable devices. Emulated IOMMU exposes queue
> > invalidation capability and passes down all descriptors from the
> > guest to the physical IOMMU.
> >
> > The assumption is that guest to host device ID mapping should be
> > resolved prior to calling IOMMU driver. Based on the device handle,
> > host IOMMU driver can replace certain fields before submit to the
> > invalidation queue.
> >
> > Signed-off-by: Liu, Yi L <[email protected]>
> > Signed-off-by: Ashok Raj <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/intel-iommu.c | 170
> > ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 170
> > insertions(+)
> >
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index cae4042..c765448 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -4973,6 +4973,175 @@ static void
> > intel_iommu_detach_device(struct iommu_domain *domain,
> > dmar_remove_one_dev_info(to_dmar_domain(domain), dev); }
> >
> > +/*
> > + * 3D array for converting IOMMU generic type-granularity to VT-d
> > granularity
> > + * X indexed by enum iommu_inv_type
> > + * Y indicates request without and with PASID
> > + * Z indexed by enum iommu_inv_granularity
> > + *
> > + * For an example, if we want to find the VT-d granularity
> > encoding for IOTLB
> > + * type, DMA request with PASID, and page selective. The look up
> > indices are:
> > + * [1][1][8], where
> > + * 1: IOMMU_INV_TYPE_TLB
> > + * 1: with PASID
> > + * 8: IOMMU_INV_GRANU_PAGE_PASID
> > + *
> > + * Granu_map array indicates validity of the table. 1: valid, 0:
> > invalid
> > + *
> > + */
> > +const static int
> > inv_type_granu_map[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> > + /* extended dev IOTLBs, for dev-IOTLB, only global is
> > valid,
> > + for dev-EXIOTLB, two valid granu */
> > + {
> > + {1},
> > + {0, 0, 0, 0, 1, 1, 0, 0, 0}
> > + },
> > + /* IOTLB and EIOTLB */
> > + {
> > + {1, 1, 0, 1, 0, 0, 0, 0, 0},
> > + {0, 0, 0, 0, 1, 0, 1, 1, 1}
> > + },
> > + /* PASID cache */
> > + {
> > + {0},
> > + {0, 0, 0, 0, 1, 1, 0, 0, 0}
> > + },
> > + /* context cache */
> > + {
> > + {1, 1, 1}
> > + }
> > +};
> > +
> > +const static u64
> > inv_type_granu_table[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> > + /* extended dev IOTLBs, only global is valid */
> > + {
> > + {QI_DEV_IOTLB_GRAN_ALL},
> > + {0, 0, 0, 0, QI_DEV_IOTLB_GRAN_ALL,
> > QI_DEV_IOTLB_GRAN_PASID_SEL, 0, 0, 0}
> > + },
> > + /* IOTLB and EIOTLB */
> > + {
> > + {DMA_TLB_GLOBAL_FLUSH, DMA_TLB_DSI_FLUSH, 0,
> > DMA_TLB_PSI_FLUSH},
> > + {0, 0, 0, 0, QI_GRAN_ALL_ALL, 0, QI_GRAN_NONG_ALL,
> > QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID}
> > + },
> > + /* PASID cache */
> > + {
> > + {0},
> > + {0, 0, 0, 0, QI_PC_ALL_PASIDS, QI_PC_PASID_SEL}
> > + },
> > + /* context cache */
> > + {
> > + {DMA_CCMD_GLOBAL_INVL, DMA_CCMD_DOMAIN_INVL,
> > DMA_CCMD_DEVICE_INVL}
> > + }
> > +};
> > +
> > +static inline int to_vtd_granularity(int type, int granu, int
> > with_pasid, u64 *vtd_granu) +{
> > + if (type >= IOMMU_INV_NR_TYPE || granu >=
> > IOMMU_INV_NR_GRANU || with_pasid > 1)
> > + return -EINVAL;
> > +
> > + if (inv_type_granu_map[type][with_pasid][granu] == 0)
> > + return -EINVAL;
> > +
> > + *vtd_granu = inv_type_granu_table[type][with_pasid][granu];
> > +
> > + return 0;
> > +}
> > +
> > +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> > + struct device *dev, struct tlb_invalidate_info
> > *inv_info)
>
> inv_info->hdr.version is never checked, why do we have these if
> they're not used?
>
the version was added to leave room for future extension. you are
right, it should be checked.
> > +{
> > + struct intel_iommu *iommu;
> > + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > + struct device_domain_info *info;
> > + u16 did, sid;
> > + u8 bus, devfn;
> > + int ret = 0;
> > + u64 granu;
> > + unsigned long flags;
> > +
> > + if (!inv_info || !dmar_domain)
> > + return -EINVAL;
> > +
> > + iommu = device_to_iommu(dev, &bus, &devfn);
> > + if (!iommu)
> > + return -ENODEV;
> > +
> > + if (!dev || !dev_is_pci(dev))
> > + return -ENODEV;
> > +
> > + did = dmar_domain->iommu_did[iommu->seq_id];
> > + sid = PCI_DEVID(bus, devfn);
> > + ret = to_vtd_granularity(inv_info->hdr.type,
> > inv_info->granularity,
> > + !!(inv_info->flags &
> > IOMMU_INVALIDATE_PASID_TAGGED), &granu);
> > + if (ret) {
> > + pr_err("Invalid range type %d, granu %d\n",
> > inv_info->hdr.type,
> > + inv_info->granularity);
> > + return ret;
> > + }
> > +
> > + spin_lock(&iommu->lock);
> > + spin_lock_irqsave(&device_domain_lock, flags);
> > +
> > + switch (inv_info->hdr.type) {
> > + case IOMMU_INV_TYPE_CONTEXT:
> > + iommu->flush.flush_context(iommu, did, sid,
> > + DMA_CCMD_MASK_NOBIT,
> > granu);
> > + break;
> > + case IOMMU_INV_TYPE_TLB:
> > + /* We need to deal with two scenarios:
> > + * - IOTLB for request w/o PASID
> > + * - extended IOTLB for request with PASID.
> > + */
> > + if (inv_info->size &&
> > + (inv_info->addr & ((1 << (VTD_PAGE_SHIFT +
> > inv_info->size)) - 1))) {
> > + pr_err("Addr out of range, addr 0x%llx,
> > size order %d\n",
> > + inv_info->addr, inv_info->size);
> > + ret = -ERANGE;
> > + goto out_unlock;
> > + }
> > +
> > + if (inv_info->flags &
> > IOMMU_INVALIDATE_PASID_TAGGED)
> > + qi_flush_eiotlb(iommu, did,
> > mm_to_dma_pfn(inv_info->addr),
> > + inv_info->pasid,
> > + inv_info->size, granu,
> > + inv_info->flags &
> > IOMMU_INVALIDATE_GLOBAL_PAGE);
> > + else
> > + qi_flush_iotlb(iommu, did,
> > mm_to_dma_pfn(inv_info->addr),
> > + inv_info->size, granu);
> > + /**
> > + * Always flush device IOTLB if ATS is enabled
> > since guest
> > + * vIOMMU exposes CM = 1, no device IOTLB flush
> > will be passed
> > + * down.
> > + */
> > + info = iommu_support_dev_iotlb(dmar_domain, iommu,
> > bus, devfn);
> > + if (info && info->ats_enabled) {
> > + if (inv_info->flags &
> > IOMMU_INVALIDATE_PASID_TAGGED)
> > + qi_flush_dev_eiotlb(iommu, sid,
> > + inv_info->pasid,
> > info->ats_qdep,
> > + inv_info->addr,
> > inv_info->size,
> > + granu);
> > + else
> > + qi_flush_dev_iotlb(iommu, sid,
> > info->pfsid,
> > + info->ats_qdep,
> > inv_info->addr,
> > + inv_info->size);
> > + }
> > + break;
> > + case IOMMU_INV_TYPE_PASID:
> > + qi_flush_pasid(iommu, did, granu, inv_info->pasid);
> > +
> > + break;
> > + default:
> > + dev_err(dev, "Unknown IOMMU invalidation type
> > %d\n",
> > + inv_info->hdr.type);
> > + ret = -EINVAL;
> > + }
>
>
> More verbose logging,
you mean dev_err is unnecessary? I will remove that.
> is vfio just passing these through allowing them
> to be user reachable? Thanks,
yes, the invalidation types are in uapi, expect qemu traps invalidation
from vIOMMU and passdown to physical IOMMU.
>
> Alex
>
> [...]
>
[Jacob Pan]
On Tue, 17 Apr 2018 13:10:47 -0600
Alex Williamson <[email protected]> wrote:
> On Mon, 16 Apr 2018 14:48:53 -0700
> Jacob Pan <[email protected]> wrote:
>
> > Add Intel VT-d ops to the generic iommu_bind_pasid_table API
> > functions.
> >
> > The primary use case is for direct assignment of SVM capable
> > device. Originated from emulated IOMMU in the guest, the request
> > goes through many layers (e.g. VFIO). Upon calling host IOMMU
> > driver, caller passes guest PASID table pointer (GPA) and size.
> >
> > Device context table entry is modified by Intel IOMMU specific
> > bind_pasid_table function. This will turn on nesting mode and
> > matching translation type.
> >
> > The unbind operation restores default context mapping.
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > Signed-off-by: Liu, Yi L <[email protected]>
> > Signed-off-by: Ashok Raj <[email protected]>
> > ---
> > drivers/iommu/intel-iommu.c | 119
> > ++++++++++++++++++++++++++++++++++++++++++
> > include/linux/dma_remapping.h | 1 + 2 files changed, 120
> > insertions(+)
> >
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index a0f81a4..d8058be 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -2409,6 +2409,7 @@ static struct dmar_domain
> > *dmar_insert_one_dev_info(struct intel_iommu *iommu,
> > info->ats_supported = info->pasid_supported = info->pri_supported =
> > 0; info->ats_enabled = info->pasid_enabled = info->pri_enabled = 0;
> > info->ats_qdep = 0;
> > + info->pasid_table_bound = 0;
> > info->dev = dev;
> > info->domain = domain;
> > info->iommu = iommu;
> > @@ -5132,6 +5133,7 @@ static void
> > intel_iommu_put_resv_regions(struct device *dev,
> > #ifdef CONFIG_INTEL_IOMMU_SVM
> > #define MAX_NR_PASID_BITS (20)
> > +#define MIN_NR_PASID_BITS (5)
> > static inline unsigned long intel_iommu_get_pts(struct intel_iommu
> > *iommu) {
> > /*
> > @@ -5258,6 +5260,119 @@ struct intel_iommu
> > *intel_svm_device_to_iommu(struct device *dev)
> > return iommu;
> > }
> > +
> > +static int intel_iommu_bind_pasid_table(struct iommu_domain
> > *domain,
> > + struct device *dev, struct pasid_table_config
> > *pasidt_binfo) +{
>
> Never validates pasidt_binfo->{version,bytes}
>
good catch, will do.
> > + struct intel_iommu *iommu;
> > + struct context_entry *context;
> > + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > + struct device_domain_info *info;
> > + struct pci_dev *pdev;
> > + u8 bus, devfn, host_table_pasid_bits;
> > + u16 did, sid;
> > + int ret = 0;
> > + unsigned long flags;
> > + u64 ctx_lo;
> > +
> > + iommu = device_to_iommu(dev, &bus, &devfn);
> > + if (!iommu)
> > + return -ENODEV;
> > + /* VT-d spec section 9.4 says pasid table size is encoded
> > as 2^(x+5) */
> > + host_table_pasid_bits = intel_iommu_get_pts(iommu) +
> > MIN_NR_PASID_BITS;
> > + if (!pasidt_binfo || pasidt_binfo->pasid_bits >
> > host_table_pasid_bits ||
> > + pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
> > + pr_err("Invalid gPASID bits %d, host range %d -
> > %d\n",
> > + pasidt_binfo->pasid_bits,
> > + MIN_NR_PASID_BITS, host_table_pasid_bits);
> > + return -ERANGE;
> > + }
> > + if (!ecap_nest(iommu->ecap)) {
> > + dev_err(dev, "Cannot bind PASID table, no nested
> > translation\n");
> > + ret = -EINVAL;
> > + goto out;
> > + }
>
> Gratuitous use of pr_err, some of these look user reachable, for
> instance can vfio know in advance the supported widths or can the user
> trigger that pr_err at will?
Yes, the current IOMMU sysfs for vt-d does show the content of
capability registers so user could know in advance whether the nested
mode is supported. But I think we are in need of some generic interface
to enumerate IOMMU features. Here I am trying to prepare for the worst.
Are you concerned about security if user can trigger that error at
will? Sorry I didn't get the point.
> Some of these errno values are also
> maybe not as descriptive as they could be. For instance if the iommu
> doesn't support nesting, that's not a calling argument error, that's
> an unsupported device error, right?
>
your are right, that is not invalid argument. You mean use ENODEV?
> > + pdev = to_pci_dev(dev);
> > + sid = PCI_DEVID(bus, devfn);
> > + info = dev->archdata.iommu;
> > +
> > + if (!info) {
> > + dev_err(dev, "Invalid device domain info\n");
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > + if (info->pasid_table_bound) {
> > + dev_err(dev, "Device PASID table already bound\n");
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > + if (!info->pasid_enabled) {
> > + ret = pci_enable_pasid(pdev, info->pasid_supported
> > & ~1);
> > + if (ret) {
> > + dev_err(dev, "Failed to enable PASID\n");
> > + goto out;
> > + }
> > + }
> > + spin_lock_irqsave(&iommu->lock, flags);
> > + context = iommu_context_addr(iommu, bus, devfn, 0);
> > + if (!context_present(context)) {
> > + dev_err(dev, "Context not present\n");
> > + ret = -EINVAL;
> > + goto out_unlock;
> > + }
> > +
> > + /* Anticipate guest to use SVM and owns the first level,
> > so we turn
> > + * nested mode on
> > + */
> > + ctx_lo = context[0].lo;
> > + ctx_lo |= CONTEXT_NESTE | CONTEXT_PRS | CONTEXT_PASIDE;
> > + ctx_lo &= ~CONTEXT_TT_MASK;
> > + ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
> > + context[0].lo = ctx_lo;
> > +
> > + /* Assign guest PASID table pointer and size order */
> > + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> > + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
>
> Where does this IOMMU API interface define that base_ptr is 4K
> aligned or the format of the PASID table? Are these all standardized
> or do they vary by host IOMMU? If they're standards, maybe we could
> note that and the spec which defines them when we declare base_ptr.
> If they're IOMMU specific then I don't understand how we'll match a
> user provided PASID table to the requirements and format of the host
> IOMMU. Thanks,
>
follow up in the other thread with Jean.
Thanks for the review.
Jacob
> Alex
>
> > + context[1].lo = ctx_lo;
> > + /* make sure context entry is updated before flushing */
> > + wmb();
> > + did = dmar_domain->iommu_did[iommu->seq_id];
> > + iommu->flush.flush_context(iommu, did,
> > + (((u16)bus) << 8) | devfn,
> > + DMA_CCMD_MASK_NOBIT,
> > + DMA_CCMD_DEVICE_INVL);
> > + iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > DMA_TLB_DSI_FLUSH);
> > + info->pasid_table_bound = 1;
> > +out_unlock:
> > + spin_unlock_irqrestore(&iommu->lock, flags);
> > +out:
> > + return ret;
> > +}
> > +
> > +static void intel_iommu_unbind_pasid_table(struct iommu_domain
> > *domain,
> > + struct device *dev)
> > +{
> > + struct intel_iommu *iommu;
> > + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > + struct device_domain_info *info;
> > + u8 bus, devfn;
> > +
> > + info = dev->archdata.iommu;
> > + if (!info) {
> > + dev_err(dev, "Invalid device domain info\n");
> > + return;
> > + }
> > + iommu = device_to_iommu(dev, &bus, &devfn);
> > + if (!iommu) {
> > + dev_err(dev, "No IOMMU for device to unbind PASID
> > table\n");
> > + return;
> > + }
> > +
> > + domain_context_clear(iommu, dev);
> > +
> > + domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
> > + info->pasid_table_bound = 0;
> > +}
> > #endif /* CONFIG_INTEL_IOMMU_SVM */
> >
> > const struct iommu_ops intel_iommu_ops = {
> > @@ -5266,6 +5381,10 @@ const struct iommu_ops intel_iommu_ops = {
> > .domain_free = intel_iommu_domain_free,
> > .attach_dev = intel_iommu_attach_device,
> > .detach_dev = intel_iommu_detach_device,
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > + .bind_pasid_table = intel_iommu_bind_pasid_table,
> > + .unbind_pasid_table =
> > intel_iommu_unbind_pasid_table, +#endif
> > .map = intel_iommu_map,
> > .unmap = intel_iommu_unmap,
> > .map_sg = default_iommu_map_sg,
> > diff --git a/include/linux/dma_remapping.h
> > b/include/linux/dma_remapping.h index 21b3e7d..db290b2 100644
> > --- a/include/linux/dma_remapping.h
> > +++ b/include/linux/dma_remapping.h
> > @@ -28,6 +28,7 @@
> >
> > #define CONTEXT_DINVE (1ULL << 8)
> > #define CONTEXT_PRS (1ULL << 9)
> > +#define CONTEXT_NESTE (1ULL << 10)
> > #define CONTEXT_PASIDE (1ULL << 11)
> >
> > struct intel_iommu;
>
[Jacob Pan]
On Fri, 20 Apr 2018 19:25:34 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On Tue, Apr 17, 2018 at 08:10:47PM +0100, Alex Williamson wrote:
> [...]
> > > + /* Assign guest PASID table pointer and size order */
> > > + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> > > + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
> >
> > Where does this IOMMU API interface define that base_ptr is 4K
> > aligned or the format of the PASID table? Are these all
> > standardized or do they vary by host IOMMU? If they're standards,
> > maybe we could note that and the spec which defines them when we
> > declare base_ptr. If they're IOMMU specific then I don't
> > understand how we'll match a user provided PASID table to the
> > requirements and format of the host IOMMU. Thanks,
>
> On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so a guest
> under a vSMMU might pass a pointer that's not aligned on 4k.
>
PASID table pointer for VT-d is 4K aligned.
> Maybe this information could be part of the data passed to userspace
> about IOMMU table formats and features? They're not part of this
> series, but I think we wanted to communicate IOMMU-specific features
> via sysfs.
>
Agreed, I believe Yi Liu is working on a sysfs interface such that QEMU
can match IOMMU model and features.
On Mon, Apr 16, 2018 at 10:48:59PM +0100, Jacob Pan wrote:
> +/**
> + * struct iommu_fault_event - Generic per device fault data
> + *
> + * - PCI and non-PCI devices
> + * - Recoverable faults (e.g. page request), information based on PCI ATS
> + * and PASID spec.
> + * - Un-recoverable faults of device interest
> + * - DMA remapping and IRQ remapping faults
> +
> + * @type contains fault type.
> + * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
> + * faults are not reported
> + * @addr: tells the offending page address
> + * @pasid: contains process address space ID, used in shared virtual memory(SVM)
> + * @rid: requestor ID
You can remove @rid from the comment
> + * @page_req_group_id: page request group index
> + * @last_req: last request in a page request group
> + * @pasid_valid: indicates if the PRQ has a valid PASID
> + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
> + * @device_private: if present, uniquely identify device-specific
> + * private data for an individual page request.
> + * @iommu_private: used by the IOMMU driver for storing fault-specific
> + * data. Users should not modify this field before
> + * sending the fault response.
In my opinion you can remove @iommu_private entirely. I proposed this
field so that the IOMMU driver can store fault metadata when reporting
them, and read them back when completing the fault. I'm not using it in
SMMUv3 anymore (instead re-fetching the metadata) and it can't be used
anyway, because the value isn't copied into page_response_msg.
Thanks,
Jean
> + */
> +struct iommu_fault_event {
> + enum iommu_fault_type type;
> + enum iommu_fault_reason reason;
> + u64 addr;
> + u32 pasid;
> + u32 page_req_group_id;
> + u32 last_req : 1;
> + u32 pasid_valid : 1;
> + u32 prot;
> + u64 device_private;
> + u64 iommu_private;
> +};
On Mon, Apr 16, 2018 at 02:49:00PM -0700, Jacob Pan wrote:
> DMA faults can be detected by IOMMU at device level. Adding a pointer
> to struct device allows IOMMU subsystem to report relevant faults
> back to the device driver for further handling.
> For direct assigned device (or user space drivers), guest OS holds
> responsibility to handle and respond per device IOMMU fault.
> Therefore we need fault reporting mechanism to propagate faults beyond
> IOMMU subsystem.
>
> There are two other IOMMU data pointers under struct device today, here
> we introduce iommu_param as a parent pointer such that all device IOMMU
> data can be consolidated here. The idea was suggested here by Greg KH
> and Joerg. The name iommu_param is chosen here since iommu_data has been used.
>
> Suggested-by: Greg Kroah-Hartman <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> Signed-off-by: Jean-Philippe Brucker <[email protected]>
> Link: https://lkml.org/lkml/2017/10/6/81
Reviewed-by: Greg Kroah-Hartman <[email protected]>
On Mon, Apr 16, 2018 at 10:49:01PM +0100, Jacob Pan wrote:
[...]
> +int iommu_register_device_fault_handler(struct device *dev,
> + iommu_dev_fault_handler_t handler,
> + void *data)
> +{
> + struct iommu_param *param = dev->iommu_param;
> +
> + /*
> + * Device iommu_param should have been allocated when device is
> + * added to its iommu_group.
> + */
> + if (!param)
> + return -EINVAL;
> +
> + /* Only allow one fault handler registered for each device */
> + if (param->fault_param)
> + return -EBUSY;
> +
> + mutex_lock(¶m->lock);
> + get_device(dev);
> + param->fault_param =
> + kzalloc(sizeof(struct iommu_fault_param), GFP_ATOMIC);
This can be GFP_KERNEL
[...]
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> + int ret = 0;
> + struct iommu_fault_event *evt_pending;
> + struct iommu_fault_param *fparam;
> +
> + /* iommu_param is allocated when device is added to group */
> + if (!dev->iommu_param | !evt)
> + return -EINVAL;
> + /* we only report device fault if there is a handler registered */
> + mutex_lock(&dev->iommu_param->lock);
> + if (!dev->iommu_param->fault_param ||
> + !dev->iommu_param->fault_param->handler) {
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> + fparam = dev->iommu_param->fault_param;
> + if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> + evt_pending = kzalloc(sizeof(*evt_pending), GFP_ATOMIC);
We're expecting caller to be a thread at the moment, so this could be
GFP_KERNEL too. You could also use kmemdup to remove the memcpy below
[...]
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> + iommu_dev_fault_handler_t handler,
> + void *data)
> +{
> + return 0;
Should return -ENODEV
Thanks,
Jean
On Mon, Apr 16, 2018 at 10:49:02PM +0100, Jacob Pan wrote:
[...]
> + /*
> + * Check if we have a matching page request pending to respond,
> + * otherwise return -EINVAL
> + */
> + list_for_each_entry_safe(evt, iter, ¶m->fault_param->faults, list) {
I don't think you need the "_safe" iterator if you're exiting the loop
right after removing the event.
> + if (evt->pasid == msg->pasid &&
> + msg->page_req_group_id == evt->page_req_group_id) {
> + msg->private_data = evt->iommu_private;
Ah sorry, I missed this bit in my review of 10/22. I thought
private_data would be for evt->device_private. In this case I guess we
can drop device_private, or do you plan to use it?
> + ret = domain->ops->page_response(dev, msg);
> + list_del(&evt->list);
> + kfree(evt);
> + break;
> + }
> + }
> +
> +done_unlock:
> + mutex_unlock(¶m->fault_param->lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_page_response);
> +
> static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
> {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 32435f9..058b552 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -163,6 +163,55 @@ struct iommu_resv_region {
> #ifdef CONFIG_IOMMU_API
>
> /**
> + * enum page_response_code - Return status of fault handlers, telling the IOMMU
> + * driver how to proceed with the fault.
> + *
> + * @IOMMU_FAULT_STATUS_SUCCESS: Fault has been handled and the page tables
> + * populated, retry the access. This is "Success" in PCI PRI.
> + * @IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent faults from
> + * this device if possible. This is "Response Failure" in PCI PRI.
> + * @IOMMU_FAULT_STATUS_INVALID: Could not handle this fault, don't retry the
> + * access. This is "Invalid Request" in PCI PRI.
> + */
> +enum page_response_code {
> + IOMMU_PAGE_RESP_SUCCESS = 0,
> + IOMMU_PAGE_RESP_INVALID,
> + IOMMU_PAGE_RESP_FAILURE,
> +};
Field names aren't consistent with the comment. I'd go with
IOMMU_PAGE_RESP_*
> +
> +/**
> + * enum page_request_handle_t - Return page request/response handler status
> + *
> + * @IOMMU_FAULT_STATUS_HANDLED: Stop processing the fault, and do not send a
> + * reply to the device.
> + * @IOMMU_FAULT_STATUS_CONTINUE: Fault was not handled. Call the next handler,
> + * or terminate.
> + */
> +enum page_request_handle_t {
> + IOMMU_PAGE_RESP_HANDLED = 0,
> + IOMMU_PAGE_RESP_CONTINUE,
Same here regarding the comment. Here I'd prefer "iommu_fault_status_t"
for the enum and IOMMU_FAULT_STATUS_* for the fields, because they can
be used for unrecoverable faults as well.
But since you're not using these values in your patches, I guess you can
drop this enum? At the moment the return value of fault handler is 0 (as
specified at iommu_register_device_fault_handler), meaning that the
handler always takes ownership of the fault.
It will be easy to extend once we introduce multiple fault handlers that
can either take the fault or pass it to the next one. Existing
implementations will still return 0 - HANDLED, and new ones will return
either HANDLED or CONTINUE.
> +/**
> + * Generic page response information based on PCI ATS and PASID spec.
> + * @addr: servicing page address
> + * @pasid: contains process address space ID
> + * @resp_code: response code
> + * @page_req_group_id: page request group index
> + * @type: group or stream/single page response
@type isn't in the structure
> + * @private_data: uniquely identify device-specific private data for an
> + * individual page response
IOMMU-specific? If it is set by iommu.c, I think we should comment about
it, something like "This field is written by the IOMMU core". Maybe also
rename it to iommu_private to be consistent with iommu_fault_event
> + */
> +struct page_response_msg {
> + u64 addr;
> + u32 pasid;
> + enum page_response_code resp_code;
> + u32 pasid_present:1;
> + u32 page_req_group_id;
> + u64 private_data;
> +};
> +
> +/**
> * struct iommu_ops - iommu ops and capabilities
> * @capable: check capability
> * @domain_alloc: allocate iommu domain
> @@ -195,6 +244,7 @@ struct iommu_resv_region {
> * @bind_pasid_table: bind pasid table pointer for guest SVM
> * @unbind_pasid_table: unbind pasid table pointer and restore defaults
> * @sva_invalidate: invalidate translation caches of shared virtual address
> + * @page_response: handle page request response
> */
> struct iommu_ops {
> bool (*capable)(enum iommu_cap);
> @@ -250,6 +300,7 @@ struct iommu_ops {
> struct device *dev);
> int (*sva_invalidate)(struct iommu_domain *domain,
> struct device *dev, struct tlb_invalidate_info *inv_info);
> + int (*page_response)(struct device *dev, struct page_response_msg *msg);
>
> unsigned long pgsize_bitmap;
> };
> @@ -471,6 +522,7 @@ extern int iommu_unregister_device_fault_handler(struct device *dev);
>
> extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
>
> +extern int iommu_page_response(struct device *dev, struct page_response_msg *msg);
Please also define a -ENODEV function for !CONFIG_IOMMU_API, otherwise
it doesn't build.
And I think struct page_response_msg and the enums should be declared
outside #ifdef CONFIG_IOMMU_API. Same for struct iommu_fault_event and
the enums in patch 10/22. Otherwise device drivers will have to add
#ifdefs everywhere their code accesses these structures.
Thanks,
Jean
> extern int iommu_group_id(struct iommu_group *group);
> extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
> extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> --
> 2.7.4
>
On Mon, 23 Apr 2018 11:11:41 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On Mon, Apr 16, 2018 at 10:48:59PM +0100, Jacob Pan wrote:
> > +/**
> > + * struct iommu_fault_event - Generic per device fault data
> > + *
> > + * - PCI and non-PCI devices
> > + * - Recoverable faults (e.g. page request), information based on
> > PCI ATS
> > + * and PASID spec.
> > + * - Un-recoverable faults of device interest
> > + * - DMA remapping and IRQ remapping faults
> > +
> > + * @type contains fault type.
> > + * @reason fault reasons if relevant outside IOMMU driver, IOMMU
> > driver internal
> > + * faults are not reported
> > + * @addr: tells the offending page address
> > + * @pasid: contains process address space ID, used in shared
> > virtual memory(SVM)
> > + * @rid: requestor ID
>
> You can remove @rid from the comment
>
thanks, will do.
> > + * @page_req_group_id: page request group index
> > + * @last_req: last request in a page request group
> > + * @pasid_valid: indicates if the PRQ has a valid PASID
> > + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> > IOMMU_FAULT_WRITE
> > + * @device_private: if present, uniquely identify device-specific
> > + * private data for an individual page request.
> > + * @iommu_private: used by the IOMMU driver for storing
> > fault-specific
> > + * data. Users should not modify this field before
> > + * sending the fault response.
>
> In my opinion you can remove @iommu_private entirely. I proposed this
> field so that the IOMMU driver can store fault metadata when reporting
> them, and read them back when completing the fault. I'm not using it
> in SMMUv3 anymore (instead re-fetching the metadata) and it can't be
> used anyway, because the value isn't copied into page_response_msg.
>
In vt-d use, I use private data for preserving vt-d specific fault data
across request and response. e.g. vt-d has streaming response type in
addition to group response (standard). This way, generic code does not
need to know about it.
At device level, since we have to sanitize page response based on
pending page requests, I am doing the private data copy in iommu.c, in
the pending event list.
> Thanks,
> Jean
>
> > + */
> > +struct iommu_fault_event {
> > + enum iommu_fault_type type;
> > + enum iommu_fault_reason reason;
> > + u64 addr;
> > + u32 pasid;
> > + u32 page_req_group_id;
> > + u32 last_req : 1;
> > + u32 pasid_valid : 1;
> > + u32 prot;
> > + u64 device_private;
> > + u64 iommu_private;
> > +};
[Jacob Pan]
On Mon, 23 Apr 2018 12:47:10 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On Mon, Apr 16, 2018 at 10:49:02PM +0100, Jacob Pan wrote:
> [...]
> > + /*
> > + * Check if we have a matching page request pending to
> > respond,
> > + * otherwise return -EINVAL
> > + */
> > + list_for_each_entry_safe(evt, iter,
> > ¶m->fault_param->faults, list) {
>
> I don't think you need the "_safe" iterator if you're exiting the loop
> right after removing the event.
>
you are right, good catch!
> > + if (evt->pasid == msg->pasid &&
> > + msg->page_req_group_id ==
> > evt->page_req_group_id) {
> > + msg->private_data = evt->iommu_private;
>
> Ah sorry, I missed this bit in my review of 10/22. I thought
> private_data would be for evt->device_private. In this case I guess we
> can drop device_private, or do you plan to use it?
>
NP. vt-d still plan to use device_private for gfx device.
> > + ret = domain->ops->page_response(dev, msg);
> > + list_del(&evt->list);
> > + kfree(evt);
> > + break;
> > + }
> > + }
> > +
> > +done_unlock:
> > + mutex_unlock(¶m->fault_param->lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_page_response);
> > +
> > static void __iommu_detach_device(struct iommu_domain *domain,
> > struct device *dev)
> > {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 32435f9..058b552 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -163,6 +163,55 @@ struct iommu_resv_region {
> > #ifdef CONFIG_IOMMU_API
> >
> > /**
> > + * enum page_response_code - Return status of fault handlers,
> > telling the IOMMU
> > + * driver how to proceed with the fault.
> > + *
> > + * @IOMMU_FAULT_STATUS_SUCCESS: Fault has been handled and the
> > page tables
> > + * populated, retry the access. This is "Success" in PCI
> > PRI.
> > + * @IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent
> > faults from
> > + * this device if possible. This is "Response Failure" in
> > PCI PRI.
> > + * @IOMMU_FAULT_STATUS_INVALID: Could not handle this fault, don't
> > retry the
> > + * access. This is "Invalid Request" in PCI PRI.
> > + */
> > +enum page_response_code {
> > + IOMMU_PAGE_RESP_SUCCESS = 0,
> > + IOMMU_PAGE_RESP_INVALID,
> > + IOMMU_PAGE_RESP_FAILURE,
> > +};
>
> Field names aren't consistent with the comment. I'd go with
> IOMMU_PAGE_RESP_*
>
will do.
> > +
> > +/**
> > + * enum page_request_handle_t - Return page request/response
> > handler status
> > + *
> > + * @IOMMU_FAULT_STATUS_HANDLED: Stop processing the fault, and do
> > not send a
> > + * reply to the device.
> > + * @IOMMU_FAULT_STATUS_CONTINUE: Fault was not handled. Call the
> > next handler,
> > + * or terminate.
> > + */
> > +enum page_request_handle_t {
> > + IOMMU_PAGE_RESP_HANDLED = 0,
> > + IOMMU_PAGE_RESP_CONTINUE,
>
> Same here regarding the comment. Here I'd prefer
> "iommu_fault_status_t" for the enum and IOMMU_FAULT_STATUS_* for the
> fields, because they can be used for unrecoverable faults as well.
>
> But since you're not using these values in your patches, I guess you
> can drop this enum? At the moment the return value of fault handler
> is 0 (as specified at iommu_register_device_fault_handler), meaning
> that the handler always takes ownership of the fault.
>
> It will be easy to extend once we introduce multiple fault handlers
> that can either take the fault or pass it to the next one. Existing
> implementations will still return 0 - HANDLED, and new ones will
> return either HANDLED or CONTINUE.
>
I shall drop these, only put in here to match your patch. i am looking
into converting vt-d svm prq to your queued fault patch. I think it will
give both functional and performance benefit.
> > +/**
> > + * Generic page response information based on PCI ATS and PASID
> > spec.
> > + * @addr: servicing page address
> > + * @pasid: contains process address space ID
> > + * @resp_code: response code
> > + * @page_req_group_id: page request group index
> > + * @type: group or stream/single page response
>
> @type isn't in the structure
>
missed that, i move it to iommu private data since it is vtd only
> > + * @private_data: uniquely identify device-specific private data
> > for an
> > + * individual page response
>
> IOMMU-specific? If it is set by iommu.c, I think we should comment
> about it, something like "This field is written by the IOMMU core".
> Maybe also rename it to iommu_private to be consistent with
> iommu_fault_event
>
sounds good.
> > + */
> > +struct page_response_msg {
> > + u64 addr;
> > + u32 pasid;
> > + enum page_response_code resp_code;
> > + u32 pasid_present:1;
> > + u32 page_req_group_id;
> > + u64 private_data;
> > +};
> > +
> > +/**
> > * struct iommu_ops - iommu ops and capabilities
> > * @capable: check capability
> > * @domain_alloc: allocate iommu domain
> > @@ -195,6 +244,7 @@ struct iommu_resv_region {
> > * @bind_pasid_table: bind pasid table pointer for guest SVM
> > * @unbind_pasid_table: unbind pasid table pointer and restore
> > defaults
> > * @sva_invalidate: invalidate translation caches of shared
> > virtual address
> > + * @page_response: handle page request response
> > */
> > struct iommu_ops {
> > bool (*capable)(enum iommu_cap);
> > @@ -250,6 +300,7 @@ struct iommu_ops {
> > struct device *dev);
> > int (*sva_invalidate)(struct iommu_domain *domain,
> > struct device *dev, struct tlb_invalidate_info
> > *inv_info);
> > + int (*page_response)(struct device *dev, struct
> > page_response_msg *msg);
> > unsigned long pgsize_bitmap;
> > };
> > @@ -471,6 +522,7 @@ extern int
> > iommu_unregister_device_fault_handler(struct device *dev);
> > extern int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt);
> > +extern int iommu_page_response(struct device *dev, struct
> > page_response_msg *msg);
>
> Please also define a -ENODEV function for !CONFIG_IOMMU_API, otherwise
> it doesn't build.
>
> And I think struct page_response_msg and the enums should be declared
> outside #ifdef CONFIG_IOMMU_API. Same for struct iommu_fault_event and
> the enums in patch 10/22. Otherwise device drivers will have to add
> #ifdefs everywhere their code accesses these structures.
>
> Thanks,
> Jean
>
good point.
> > extern int iommu_group_id(struct iommu_group *group);
> > extern struct iommu_group *iommu_group_get_for_dev(struct device
> > *dev); extern struct iommu_domain
> > *iommu_group_default_domain(struct iommu_group *); --
> > 2.7.4
> >
[Jacob Pan]
On Mon, Apr 16, 2018 at 10:49:03PM +0100, Jacob Pan wrote:
> When IO page faults are reported outside IOMMU subsystem, the page
> request handler may fail for various reasons. E.g. a guest received
> page requests but did not have a chance to run for a long time. The
> irresponsive behavior could hold off limited resources on the pending
> device.
> There can be hardware or credit based software solutions as suggested
> in the PCI ATS Ch-4. To provide a basic safty net this patch
> introduces a per device deferrable timer which monitors the longest
> pending page fault that requires a response. Proper action such as
> sending failure response code could be taken when timer expires but not
> included in this patch. We need to consider the life cycle of page
> groupd ID to prevent confusion with reused group ID by a device.
> For now, a warning message provides clue of such failure.
>
> Signed-off-by: Jacob Pan <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> ---
> drivers/iommu/iommu.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
> include/linux/iommu.h | 4 ++++
> 2 files changed, 62 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 628346c..f6512692 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -799,6 +799,39 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
> }
> EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>
> +/* Max time to wait for a pending page request */
> +#define IOMMU_PAGE_RESPONSE_MAXTIME (HZ * 10)
> +static void iommu_dev_fault_timer_fn(struct timer_list *t)
> +{
> + struct iommu_fault_param *fparam = from_timer(fparam, t, timer);
> + struct iommu_fault_event *evt, *iter;
> +
> + u64 now;
> +
> + now = get_jiffies_64();
> +
> + /* The goal is to ensure driver or guest page fault handler(via vfio)
> + * send page response on time. Otherwise, limited queue resources
> + * may be occupied by some irresponsive guests or drivers.
By "limited queue resources", do you mean the PRI fault queue in the
pIOMMU device, or something else?
I'm still uneasy about this timeout. We don't really know if the guest
doesn't respond because it is suspended, because it doesn't support PRI
or because it's attempting to kill the host. In the first case, then
receiving and responding to page requests later than 10s should be fine,
right?
Or maybe the guest is doing something weird like fetching pages from
network storage and it occasionally hits a latency oddity. This wouldn't
interrupt the fault queues, because other page requests for the same
device can be serviced in parallel, but if you implement a PRG timeout
it would still unfairly disable PRI.
In the other cases (unsupported PRI or rogue guest) then disabling PRI
using a FAILURE status might be the right thing to do. However, assuming
the device follows the PCI spec it will stop sending page requests once
there are as many PPRs in flight as the allocated credit.
Even though drivers set the PPR credit number arbitrarily (because
finding an ideal number is difficult or impossible), the device stops
issuing faults at some point if the guest is unresponsive, and it won't
grab any more shared resources, or use slots in shared queues. Resources
for pending faults can be cleaned when the device is reset and assigned
to a different guest.
That's for sane endpoints that follow the spec. If on the other hand, we
can't rely on the device implementation to respect our maximum credit
allocation, then we should do the accounting ourselves and reject
incoming faults with INVALID as fast as possible. Otherwise it's an easy
way for a guest to DoS the host and I don't think a timeout solves this
problem (The guest can wait 9 seconds before replying to faults and
meanwhile fill all the queues). In addition the timeout is done on PRGs
but not individual page faults, so a guest could overflow the queues by
triggering lots of page requests without setting the last bit.
If there isn't any possibility of memory leak or abusing resources, I
don't think it's our problem that the guest is excessively slow at
handling page requests. Setting an upper bound to page request latency
might do more harm than good. Ensuring that devices respect the number
of allocated in-flight PPRs is more important in my opinion.
> + * When per device pending fault list is not empty, we periodically checks
> + * if any anticipated page response time has expired.
> + *
> + * TODO:
> + * We could do the following if response time expires:
> + * 1. send page response code FAILURE to all pending PRQ
> + * 2. inform device driver or vfio
> + * 3. drain in-flight page requests and responses for this device
> + * 4. clear pending fault list such that driver can unregister fault
> + * handler(otherwise blocked when pending faults are present).
> + */
> + list_for_each_entry_safe(evt, iter, &fparam->faults, list) {
> + if (time_after64(evt->expire, now))
> + pr_err("Page response time expired!, pasid %d gid %d exp %llu now %llu\n",
> + evt->pasid, evt->page_req_group_id, evt->expire, now);
> + }
> + mod_timer(t, now + IOMMU_PAGE_RESPONSE_MAXTIME);
> +}
> +
> /**
> * iommu_register_device_fault_handler() - Register a device fault handler
> * @dev: the device
> @@ -806,8 +839,8 @@ EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> * @data: private data passed as argument to the handler
> *
> * When an IOMMU fault event is received, call this handler with the fault event
> - * and data as argument. The handler should return 0. If the fault is
> - * recoverable (IOMMU_FAULT_PAGE_REQ), the handler must also complete
> + * and data as argument. The handler should return 0 on success. If the fault is
> + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
This change might belong in patch 12/22
> * the fault by calling iommu_page_response() with one of the following
> * response code:
> * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> @@ -848,6 +881,9 @@ int iommu_register_device_fault_handler(struct device *dev,
> param->fault_param->data = data;
> INIT_LIST_HEAD(¶m->fault_param->faults);
>
> + timer_setup(¶m->fault_param->timer, iommu_dev_fault_timer_fn,
> + TIMER_DEFERRABLE);
> +
> mutex_unlock(¶m->lock);
>
> return 0;
> @@ -905,6 +941,8 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> {
> int ret = 0;
> struct iommu_fault_event *evt_pending;
> + struct timer_list *tmr;
> + u64 exp;
> struct iommu_fault_param *fparam;
>
> /* iommu_param is allocated when device is added to group */
> @@ -925,6 +963,17 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> goto done_unlock;
> }
> memcpy(evt_pending, evt, sizeof(struct iommu_fault_event));
> + /* Keep track of response expiration time */
> + exp = get_jiffies_64() + IOMMU_PAGE_RESPONSE_MAXTIME;
> + evt_pending->expire = exp;
> +
> + if (list_empty(&fparam->faults)) {
The list_empty() and timer modification need to be inside fparam->lock,
otherwise we race with iommu_page_response
Thanks,
Jean
> + /* First pending event, start timer */
> + tmr = &dev->iommu_param->fault_param->timer;
> + WARN_ON(timer_pending(tmr));
> + mod_timer(tmr, exp);
> + }
> +
> mutex_lock(&fparam->lock);
> list_add_tail(&evt_pending->list, &fparam->faults);
> mutex_unlock(&fparam->lock);
> @@ -1542,6 +1591,13 @@ int iommu_page_response(struct device *dev,
> }
> }
>
> + /* stop response timer if no more pending request */
> + if (list_empty(¶m->fault_param->faults) &&
> + timer_pending(¶m->fault_param->timer)) {
> + pr_debug("no pending PRQ, stop timer\n");
> + del_timer(¶m->fault_param->timer);
> + }
On 23/04/18 13:16, Jacob Pan wrote:
> I shall drop these, only put in here to match your patch. i am looking
> into converting vt-d svm prq to your queued fault patch. I think it will
> give both functional and performance benefit.
Thanks, I just rebased my patches onto this series and am hoping to
re-send the IOMMU part early next month.
Jean
On Fri, 20 Apr 2018 19:19:54 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> Hi Jacob,
>
> On Mon, Apr 16, 2018 at 10:48:54PM +0100, Jacob Pan wrote:
> [...]
> > +/**
> > + * enum iommu_inv_granularity - Generic invalidation granularity
> > + *
> > + * When an invalidation request is sent to IOMMU to flush
> > translation caches,
> > + * it may carry different granularity. These granularity levels
> > are not specific
> > + * to a type of translation cache. For an example, PASID selective
> > granularity
> > + * is only applicable to PASID cache invalidation.
>
> I'm still confused by this, I think we should add more definitions
> because architectures tend to use different names. What you call
> "Translations caches" encompasses all caches that can be invalidated
> with this request, right? So all of:
>
yes correct.
> * "TLB" and "DTLB" that cache IOVA->GPA and GPA->PA (TLB is in the
> IOMMU, DTLB is an ATC in an endpoint),
> * "PASID cache" that cache PASID->Translation Table,
> * "Context cache" that cache RID->PASID table
>
> Does this match the model you're using?
>
yes. PASID cache and context caches are in the IOMMU.
> The last name is a bit unfortunate. Since the Arm architecture uses
> the name "context" for what a PASID points to, "Device cache" would
> suit us better but it's not important.
>
or call it device context cache. actually so far context cache is here
only for completeness purpose. the expected use case is that QEMU traps
guest device context cache flush and call bind_pasid_table.
> I don't understand what you mean by "PASID selective granularity is
> only applicable to PASID cache invalidation", it seems to contradict
> the preceding sentence.
You are right. That was a mistake. I meant to say "These granularity
levels are specific to a type of"
> What if user sends an invalidation with
> IOMMU_INV_TYPE_TLB and IOMMU_INV_GRANU_ALL_PASID? Doesn't this remove
> from the TLBs all entries with the given PASID?
>
No, this meant to invalidate all PASID of a given domain ID. I need to
correct the description.
The dilemma here is to map model specific fields into generic list. not
all combinations are legal.
> > + * This enum is a collection of granularities for all types of
> > translation
> > + * caches. The idea is to make it easy for IOMMU model specific
> > driver do
> > + * conversion from generic to model specific value.
> > + */
> > +enum iommu_inv_granularity {
>
> In patch 9, inv_type_granu_map has some valid fields with granularity
> == 0. Does it mean "invalidate all caches"?
>
> I don't think user should ever be allowed to invalidate caches entries
> of devices and domains it doesn't own.
>
Agreed, I removed global granu to avoid device invalidation beyond
device itself. But I missed some of the fields in
inv_type_granu_map{}.
> > + IOMMU_INV_GRANU_DOMAIN = 1, /* all TLBs associated
> > with a domain */
> > + IOMMU_INV_GRANU_DEVICE, /* caching
> > structure associated with a
> > + * device ID
> > + */
> > + IOMMU_INV_GRANU_DOMAIN_PAGE, /* address range with
> > a domain */
>
> > + IOMMU_INV_GRANU_ALL_PASID, /* cache of a given
> > PASID */
>
> If this corresponds to QI_GRAN_ALL_ALL in patch 9, the comment should
> be "Cache of all PASIDs"? Or maybe "all entries for all PASIDs"? Is it
> different from GRANU_DOMAIN then?
QI_GRAN_ALL_ALL maps to VT-d spec 6.5.2.4, which invalidates all ext
TLB cache within a domain. It could reuse GRANU_DOMAIN but I was
also trying to match the naming convention in the spec.
> > + IOMMU_INV_GRANU_PASID_SEL, /* only invalidate
> > specified PASID */ +
> > + IOMMU_INV_GRANU_NG_ALL_PASID, /* non-global within
> > all PASIDs */
> > + IOMMU_INV_GRANU_NG_PASID, /* non-global within a
> > PASIDs */
>
> Are the "NG" variant needed since there is a
> IOMMU_INVALIDATE_GLOBAL_PAGE below? We should drop either flag or
> granule.
>
> FWIW I'm starting to think more granule options is actually better
> than flags, because it flattens the combinations and keeps them to two
> dimensions, that we can understand and explain with a table.
>
> > + IOMMU_INV_GRANU_PAGE_PASID, /* page-selective
> > within a PASID */
>
> Maybe this should be called "NG_PAGE_PASID",
Sure. I was thinking page range already implies non-global pages.
> and "DOMAIN_PAGE" should
> instead be "PAGE_PASID". If I understood their meaning correctly, it
> would be more consistent with the rest.
>
I am trying not to mix granu between request w/ PASID and w/o.
DOMAIN_PAGE meant to be for request w/o PASID.
> > + IOMMU_INV_NR_GRANU,
> > +};
> > +
> > +/** enum iommu_inv_type - Generic translation cache types for
> > invalidation
> > + *
> > + * Invalidation requests sent to IOMMU may indicate which
> > translation cache
> > + * to be operated on.
> > + * Combined with enum iommu_inv_granularity, model specific driver
> > can do a
> > + * simple lookup to convert generic type to model specific value.
> > + */
> > +enum iommu_inv_type {
>
> These should be flags (1 << 0), (1 << 1) etc, since IOMMUs will want
> to invalidate multiple caches at once (at least DTLB and TLB). You
> could then do for_each_set_bit in the driver
>
I was thinking the invalidation to be inclusive as we discussed earlier
,last year :).
TLB includes DLTB
PASID cache includes TLB and DTLB. I need to document it better.
> > + IOMMU_INV_TYPE_DTLB, /* device IOTLB */
> > + IOMMU_INV_TYPE_TLB, /* IOMMU paging structure cache
> > */
> > + IOMMU_INV_TYPE_PASID, /* PASID cache */
> > + IOMMU_INV_TYPE_CONTEXT, /* device context entry
> > cache */
> > + IOMMU_INV_NR_TYPE
> > +};
>
> We need to summarize and explain valid combinations, because reading
> inv_type_granu_map and inv_type_granu_table is a bit tedious. I tried
> to reproduce inv_type_granu_map here (Cell format is PASID_TAGGED /
> !PASID_TAGGED). Could you check if this matches your model?
great summary. thanks
>
> type | DTLB | TLB | PASID | CONTEXT
> granule | | | |
> -----------------+-----------+-----------+-----------+-----------
> - | / Y | / Y | | / Y
what is this row?
> DOMAIN | | / Y | | / Y
> DEVICE | | | | / Y
> DOMAIN_PAGE | | / Y | |
> ALL_PASID | Y | Y | |
> PASID_SEL | Y | | Y |
> NG_ALL_PASID | | Y | Y |
> NG_PASID | | Y | |
> PAGE_PASID | | Y | |
>
Mostly match what I intended for VT-d. Just one thing on the PASID
column, all PASID associated with a given domain ID can go either
NG_ALL_PASID (as in your table) or ALL_PASID.
Here is what I plan to change in comments that can reflect what you
have in the table above.
Can I also copy your table in the next version?
enum iommu_inv_granularity {
IOMMU_INV_GRANU_DOMAIN = 1, /* IOTLBs and device context
* cache associated with a
* domain ID
*/
IOMMU_INV_GRANU_DEVICE, /* device context cache
* associated with a device ID
*/
IOMMU_INV_GRANU_DOMAIN_PAGE, /* IOTLBs associated with
* address range of a
* given domain ID
*/
IOMMU_INV_GRANU_ALL_PASID, /* DTLB or IOTLB of all
* PASIDs associated to a
* given domain ID
*/
IOMMU_INV_GRANU_PASID_SEL, /* DTLB and PASID cache
* associated to a PASID
*/
IOMMU_INV_GRANU_NG_ALL_PASID, /* IOTLBs of non-global
* pages for all PASIDs for a
* given domain ID
*/
IOMMU_INV_GRANU_NG_PASID, /* IOTLBs of non-global
* pages for a given PASID
*/
IOMMU_INV_GRANU_PAGE_PASID, /* IOTLBs of selected page
* range within a PASID
*/
> There is no intersection between PASID_TAGGED and !PASID_TAGGED (Y/Y),
> so the flag might not be needed.
>
right
> I think the API can be more relaxed. Each IOMMU driver can add more
> restrictions, but I think the SMMU can support these combinations:
>
> | DTLB | TLB | PASID | CONTEXT
> --------------+-----------+-----------+-----------+-----------
> DOMAIN | Y | Y | Y | Y
> DEVICE | Y | Y | Y | Y
> DOMAIN_PAGE | Y | Y | |
> ALL_PASID | Y | Y | Y |
> PASID_SEL | Y | Y | Y |
> NG_ALL_PASID | Y | Y | Y |
> NG_PASID | Y | Y | Y |
> PAGE_PASID | Y | Y | |
>
> Two are missing in the PASID column because it doesn't make any sense
> to target the PASID cache with a page-selective invalidation. And for
> the context cache, we can only invalidate per device or per domain.
> So I think this is the biggest set of allowed combinations.
>
right, not all combinations are allowed, it is up to each IOMMU driver
to convert and sanitize based on a built-in valid map. e.g. in my vt-d
patch inv_type_granu_map[]
>
> > +
> > +/**
> > + * Translation cache invalidation header that contains mandatory
> > meta data.
> > + * @version: info format version, expecting future extesions
> > + * @type: type of translation cache to be invalidated
> > + */
> > +struct tlb_invalidate_hdr {
> > + __u32 version;
> > +#define TLB_INV_HDR_VERSION_1 1
> > + enum iommu_inv_type type;
> > +};
> > +
> > +/**
> > + * Translation cache invalidation information, contains generic
> > IOMMU
> > + * data which can be parsed based on model ID by model specific
> > drivers.
> > + *
> > + * @granularity: requested invalidation granularity, type
> > dependent
> > + * @size: 2^size of 4K pages, 0 for 4k, 9 for 2MB,
> > etc.
>
> Maybe start the size at 1 byte, we don't know what sort of granularity
> future architectures will offer.
>
I can't see any case we are not operating at sub-page size. why would
anyone cache translation for 1 byte, that is too much overhead.
> > + * @pasid: processor address space ID value per PCI
> > spec.
> > + * @addr: page address to be invalidated
> > + * @flags IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID
> > tagged,
> > + * @pasid validity
> > can be
> > + * deduced from
> > @granularity
>
> This is really hurting my brain... Two dimensions was already
> difficult, but I can't follow anymore. What does PASID_TAGGED say if
> not "@pasid is valid"? I thought VT-d mandated PASID for nested
> translation?
>
you already have 3-D in your granu table :), this is the same as your
"Y" and "/Y" filed.
I need the PASID_TAGGED flag to differentiate different IOTLB types.
PASID_TAGGED is used only when @pasid is valid, which is implied in the
grnu. E.g. certain granu is only allowed for PASID tagged case.
> > + * IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
> > + * IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
> > + *
> > + */
> > +struct tlb_invalidate_info {
> > + struct tlb_invalidate_hdr hdr;
> > + enum iommu_inv_granularity granularity;
> > + __u32 flags;
> > +#define IOMMU_INVALIDATE_NO_PASID (1 << 0)
>
> I suggested NO_PASID because Arm can have pasid-tagged and one
> no-pasid address spaces within the same domain in DSS0 mode. AMD
> would also need this for their GIoV mode, if I understood it
> correctly.
>
> When specifying NO_PASID, the user invalidates mappings for the
> address space that doesn't have a PASID, but the same granularities
> as PASID contexts apply. I now think we can remove the NO_PASID flag
> and avoid a lot of confusion.
>
> The GIoV and DSS0 modes are implemented by reserving entry 0 of the
> PASID table for NO_PASID translations. Given that the guest specifies
> this mode at BIND_TABLE time, the host understands that when the guest
> invalidates PASID 0, if GIoV or DSS0 was enabled, then the
> invalidation applies to the NO_PASID context. So you can drop this
> flag in my opinion.
>
sounds good. PASID0 is used for request w/o PASID so both GIOVA and SVA
have PASIDs. Will drop.
> Thanks,
> Jean
[Jacob Pan]
On Mon, 23 Apr 2018 12:30:13 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On Mon, Apr 16, 2018 at 10:49:01PM +0100, Jacob Pan wrote:
> [...]
> > +int iommu_register_device_fault_handler(struct device *dev,
> > + iommu_dev_fault_handler_t
> > handler,
> > + void *data)
> > +{
> > + struct iommu_param *param = dev->iommu_param;
> > +
> > + /*
> > + * Device iommu_param should have been allocated when
> > device is
> > + * added to its iommu_group.
> > + */
> > + if (!param)
> > + return -EINVAL;
> > +
> > + /* Only allow one fault handler registered for each device
> > */
> > + if (param->fault_param)
> > + return -EBUSY;
> > +
> > + mutex_lock(¶m->lock);
> > + get_device(dev);
> > + param->fault_param =
> > + kzalloc(sizeof(struct iommu_fault_param),
> > GFP_ATOMIC);
>
> This can be GFP_KERNEL
>
yes, will change.
> [...]
> > +int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt) +{
> > + int ret = 0;
> > + struct iommu_fault_event *evt_pending;
> > + struct iommu_fault_param *fparam;
> > +
> > + /* iommu_param is allocated when device is added to group
> > */
> > + if (!dev->iommu_param | !evt)
> > + return -EINVAL;
> > + /* we only report device fault if there is a handler
> > registered */
> > + mutex_lock(&dev->iommu_param->lock);
> > + if (!dev->iommu_param->fault_param ||
> > + !dev->iommu_param->fault_param->handler) {
> > + ret = -EINVAL;
> > + goto done_unlock;
> > + }
> > + fparam = dev->iommu_param->fault_param;
> > + if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> > + evt_pending = kzalloc(sizeof(*evt_pending),
> > GFP_ATOMIC);
>
> We're expecting caller to be a thread at the moment, so this could be
> GFP_KERNEL too. You could also use kmemdup to remove the memcpy below
>
good idea. will do.
> [...]
> > +static inline int iommu_register_device_fault_handler(struct
> > device *dev,
> > +
> > iommu_dev_fault_handler_t handler,
> > + void *data)
> > +{
> > + return 0;
>
> Should return -ENODEV
>
right. thanks.
> Thanks,
> Jean
[Jacob Pan]
On Mon, 23 Apr 2018 16:36:23 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On Mon, Apr 16, 2018 at 10:49:03PM +0100, Jacob Pan wrote:
> > When IO page faults are reported outside IOMMU subsystem, the page
> > request handler may fail for various reasons. E.g. a guest received
> > page requests but did not have a chance to run for a long time. The
> > irresponsive behavior could hold off limited resources on the
> > pending device.
> > There can be hardware or credit based software solutions as
> > suggested in the PCI ATS Ch-4. To provide a basic safty net this
> > patch introduces a per device deferrable timer which monitors the
> > longest pending page fault that requires a response. Proper action
> > such as sending failure response code could be taken when timer
> > expires but not included in this patch. We need to consider the
> > life cycle of page groupd ID to prevent confusion with reused group
> > ID by a device. For now, a warning message provides clue of such
> > failure.
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > Signed-off-by: Ashok Raj <[email protected]>
> > ---
> > drivers/iommu/iommu.c | 60
> > +++++++++++++++++++++++++++++++++++++++++++++++++--
> > include/linux/iommu.h | 4 ++++ 2 files changed, 62 insertions(+),
> > 2 deletions(-)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 628346c..f6512692 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -799,6 +799,39 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group, }
> > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> >
> > +/* Max time to wait for a pending page request */
> > +#define IOMMU_PAGE_RESPONSE_MAXTIME (HZ * 10)
> > +static void iommu_dev_fault_timer_fn(struct timer_list *t)
> > +{
> > + struct iommu_fault_param *fparam = from_timer(fparam, t,
> > timer);
> > + struct iommu_fault_event *evt, *iter;
> > +
> > + u64 now;
> > +
> > + now = get_jiffies_64();
> > +
> > + /* The goal is to ensure driver or guest page fault
> > handler(via vfio)
> > + * send page response on time. Otherwise, limited queue
> > resources
> > + * may be occupied by some irresponsive guests or
> > drivers.
>
> By "limited queue resources", do you mean the PRI fault queue in the
> pIOMMU device, or something else?
>
I am referring to the device resource for tracking pending PRQs. Intel
IOMMU does not track pending PRQs.
>
> I'm still uneasy about this timeout. We don't really know if the guest
> doesn't respond because it is suspended, because it doesn't support
> PRI or because it's attempting to kill the host. In the first case,
> then receiving and responding to page requests later than 10s should
> be fine, right?
>
when a guest is going into system suspend, suspend callback functions of
assigned device driver and vIOMMU should be called. I think vIOMMU
should propagate the iommu_suspend call to host IOMMU driver, therefore
terminate all the pending PRQs. We can make the timeout adjustable.
> Or maybe the guest is doing something weird like fetching pages from
> network storage and it occasionally hits a latency oddity. This
> wouldn't interrupt the fault queues, because other page requests for
> the same device can be serviced in parallel, but if you implement a
> PRG timeout it would still unfairly disable PRI.
>
The timeout here is intended to be a broader and basic safety net at
per device level. We can implement finer grain safety mechanism but I
am guessing it is better to be done in HW.
> In the other cases (unsupported PRI or rogue guest) then disabling PRI
> using a FAILURE status might be the right thing to do. However,
> assuming the device follows the PCI spec it will stop sending page
> requests once there are as many PPRs in flight as the allocated
> credit.
>
Agreed, here I am not taking any actions. There may be need to drain
in-fly requests.
> Even though drivers set the PPR credit number arbitrarily (because
> finding an ideal number is difficult or impossible), the device stops
> issuing faults at some point if the guest is unresponsive, and it
> won't grab any more shared resources, or use slots in shared queues.
> Resources for pending faults can be cleaned when the device is reset
> and assigned to a different guest.
>
>
> That's for sane endpoints that follow the spec. If on the other hand,
> we can't rely on the device implementation to respect our maximum
> credit allocation, then we should do the accounting ourselves and
> reject incoming faults with INVALID as fast as possible. Otherwise
> it's an easy way for a guest to DoS the host and I don't think a
> timeout solves this problem (The guest can wait 9 seconds before
> replying to faults and meanwhile fill all the queues). In addition
> the timeout is done on PRGs but not individual page faults, so a
> guest could overflow the queues by triggering lots of page requests
> without setting the last bit.
>
>
> If there isn't any possibility of memory leak or abusing resources, I
> don't think it's our problem that the guest is excessively slow at
> handling page requests. Setting an upper bound to page request latency
> might do more harm than good. Ensuring that devices respect the number
> of allocated in-flight PPRs is more important in my opinion.
>
How about we have a really long timeout, e.g. 1 min similar to device
invalidate response timeout in ATS spec., just for basic safety and
diagnosis. Optionally, we could have quota in parallel.
> > + * When per device pending fault list is not empty, we
> > periodically checks
> > + * if any anticipated page response time has expired.
> > + *
> > + * TODO:
> > + * We could do the following if response time expires:
> > + * 1. send page response code FAILURE to all pending PRQ
> > + * 2. inform device driver or vfio
> > + * 3. drain in-flight page requests and responses for this
> > device
> > + * 4. clear pending fault list such that driver can
> > unregister fault
> > + * handler(otherwise blocked when pending faults are
> > present).
> > + */
> > + list_for_each_entry_safe(evt, iter, &fparam->faults, list)
> > {
> > + if (time_after64(evt->expire, now))
> > + pr_err("Page response time expired!, pasid
> > %d gid %d exp %llu now %llu\n",
> > + evt->pasid,
> > evt->page_req_group_id, evt->expire, now);
> > + }
> > + mod_timer(t, now + IOMMU_PAGE_RESPONSE_MAXTIME);
> > +}
> > +
> > /**
> > * iommu_register_device_fault_handler() - Register a device fault
> > handler
> > * @dev: the device
> > @@ -806,8 +839,8 @@
> > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> > * @data: private data passed as argument to the handler
> > *
> > * When an IOMMU fault event is received, call this handler with
> > the fault event
> > - * and data as argument. The handler should return 0. If the fault
> > is
> > - * recoverable (IOMMU_FAULT_PAGE_REQ), the handler must also
> > complete
> > + * and data as argument. The handler should return 0 on success.
> > If the fault is
> > + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also
> > complete
>
> This change might belong in patch 12/22
>
Good point, will fix
> > * the fault by calling iommu_page_response() with one of the
> > following
> > * response code:
> > * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> > @@ -848,6 +881,9 @@ int iommu_register_device_fault_handler(struct
> > device *dev, param->fault_param->data = data;
> > INIT_LIST_HEAD(¶m->fault_param->faults);
> >
> > + timer_setup(¶m->fault_param->timer,
> > iommu_dev_fault_timer_fn,
> > + TIMER_DEFERRABLE);
> > +
> > mutex_unlock(¶m->lock);
> >
> > return 0;
> > @@ -905,6 +941,8 @@ int iommu_report_device_fault(struct device
> > *dev, struct iommu_fault_event *evt) {
> > int ret = 0;
> > struct iommu_fault_event *evt_pending;
> > + struct timer_list *tmr;
> > + u64 exp;
> > struct iommu_fault_param *fparam;
> >
> > /* iommu_param is allocated when device is added to group
> > */ @@ -925,6 +963,17 @@ int iommu_report_device_fault(struct device
> > *dev, struct iommu_fault_event *evt) goto done_unlock;
> > }
> > memcpy(evt_pending, evt, sizeof(struct
> > iommu_fault_event));
> > + /* Keep track of response expiration time */
> > + exp = get_jiffies_64() +
> > IOMMU_PAGE_RESPONSE_MAXTIME;
> > + evt_pending->expire = exp;
> > +
> > + if (list_empty(&fparam->faults)) {
>
> The list_empty() and timer modification need to be inside
> fparam->lock, otherwise we race with iommu_page_response
>
right, thanks.
> Thanks,
> Jean
>
> > + /* First pending event, start timer */
> > + tmr =
> > &dev->iommu_param->fault_param->timer;
> > + WARN_ON(timer_pending(tmr));
> > + mod_timer(tmr, exp);
> > + }
> > +
> > mutex_lock(&fparam->lock);
> > list_add_tail(&evt_pending->list, &fparam->faults);
> > mutex_unlock(&fparam->lock);
> > @@ -1542,6 +1591,13 @@ int iommu_page_response(struct device *dev,
> > }
> > }
> >
> > + /* stop response timer if no more pending request */
> > + if (list_empty(¶m->fault_param->faults) &&
> > + timer_pending(¶m->fault_param->timer)) {
> > + pr_debug("no pending PRQ, stop timer\n");
> > + del_timer(¶m->fault_param->timer);
> > + }
[Jacob Pan]
On 23/04/18 21:43, Jacob Pan wrote:
[...]
>> The last name is a bit unfortunate. Since the Arm architecture uses
>> the name "context" for what a PASID points to, "Device cache" would
>> suit us better but it's not important.
>>
> or call it device context cache. actually so far context cache is here
> only for completeness purpose. the expected use case is that QEMU traps
> guest device context cache flush and call bind_pasid_table.
Right, makes sense
[...]
>> If this corresponds to QI_GRAN_ALL_ALL in patch 9, the comment should
>> be "Cache of all PASIDs"? Or maybe "all entries for all PASIDs"? Is it
>> different from GRANU_DOMAIN then?
> QI_GRAN_ALL_ALL maps to VT-d spec 6.5.2.4, which invalidates all ext
> TLB cache within a domain. It could reuse GRANU_DOMAIN but I was
> also trying to match the naming convention in the spec.
Sorry I don't quite understand the difference between TLB and ext TLB
invalidation. Can an ext TLB invalidation do everything a TLB can do
plus some additional parameters (introduced in more recent version of
the spec), or do they have distinct purposes? I'm trying to understand
why it needs to be user-visible
>>> + IOMMU_INV_GRANU_PASID_SEL, /* only invalidate
>>> specified PASID */ +
>>> + IOMMU_INV_GRANU_NG_ALL_PASID, /* non-global within
>>> all PASIDs */
>>> + IOMMU_INV_GRANU_NG_PASID, /* non-global within a
>>> PASIDs */
>>
>> Are the "NG" variant needed since there is a
>> IOMMU_INVALIDATE_GLOBAL_PAGE below? We should drop either flag or
>> granule.
>>
>> FWIW I'm starting to think more granule options is actually better
>> than flags, because it flattens the combinations and keeps them to two
>> dimensions, that we can understand and explain with a table.
>>
>>> + IOMMU_INV_GRANU_PAGE_PASID, /* page-selective
>>> within a PASID */
>>
>> Maybe this should be called "NG_PAGE_PASID",
> Sure. I was thinking page range already implies non-global pages.
>> and "DOMAIN_PAGE" should
>> instead be "PAGE_PASID". If I understood their meaning correctly, it
>> would be more consistent with the rest.
>>
> I am trying not to mix granu between request w/ PASID and w/o.
> DOMAIN_PAGE meant to be for request w/o PASID.
Is the distinction necessary? I understand the IOMMU side might offer
many possibilities for invalidation, but the user probably doesn't need
all of them. It might be easier to document, upstream and maintain if we
only specify what's currently needed by users (what does QEMU VT-d use?)
Others can always extend it by increasing the version.
Do you think that this invalidation message will be used outside of
BIND_PASID_TABLE context? I can't see an other use but who knows. At the
moment requests w/o PASID are managed with VFIO_IOMMU_MAP/UNMAP_DMA,
which doesn't require invalidation. And in a BIND_PASID_TABLE context,
IOMMUs requests w/o PASID are just a special case using PASID 0 (for Arm
and AMD) so I suppose they'll use the same invalidation commands as
requests w/ PASID.
>>> + IOMMU_INV_NR_GRANU,
>>> +};
>>> +
>>> +/** enum iommu_inv_type - Generic translation cache types for
>>> invalidation
>>> + *
>>> + * Invalidation requests sent to IOMMU may indicate which
>>> translation cache
>>> + * to be operated on.
>>> + * Combined with enum iommu_inv_granularity, model specific driver
>>> can do a
>>> + * simple lookup to convert generic type to model specific value.
>>> + */
>>> +enum iommu_inv_type {
>>
>> These should be flags (1 << 0), (1 << 1) etc, since IOMMUs will want
>> to invalidate multiple caches at once (at least DTLB and TLB). You
>> could then do for_each_set_bit in the driver
>>
> I was thinking the invalidation to be inclusive as we discussed earlier
> ,last year :).
> TLB includes DLTB
> PASID cache includes TLB and DTLB. I need to document it better.
Ah right, I guess I was stuck on an old version :) Then the current
values make sense
>>> + IOMMU_INV_TYPE_DTLB, /* device IOTLB */
>>> + IOMMU_INV_TYPE_TLB, /* IOMMU paging structure cache
>>> */
>>> + IOMMU_INV_TYPE_PASID, /* PASID cache */
>>> + IOMMU_INV_TYPE_CONTEXT, /* device context entry
>>> cache */
>>> + IOMMU_INV_NR_TYPE
>>> +};
>>
>> We need to summarize and explain valid combinations, because reading
>> inv_type_granu_map and inv_type_granu_table is a bit tedious. I tried
>> to reproduce inv_type_granu_map here (Cell format is PASID_TAGGED /
>> !PASID_TAGGED). Could you check if this matches your model?
> great summary. thanks
>>
>> type | DTLB | TLB | PASID | CONTEXT
>> granule | | | |
>> -----------------+-----------+-----------+-----------+-----------
>> - | / Y | / Y | | / Y
> what is this row?
Hm, the arrays in patch 9 have 9 entries, this is entry 0 (for which I
asked if it corresponded to "invalidate all caches" in my previous
reply).
>> DOMAIN | | / Y | | / Y
>> DEVICE | | | | / Y
>> DOMAIN_PAGE | | / Y | |
>> ALL_PASID | Y | Y | |
>> PASID_SEL | Y | | Y |
>> NG_ALL_PASID | | Y | Y |
>> NG_PASID | | Y | |
>> PAGE_PASID | | Y | |
>>
> Mostly match what I intended for VT-d. Just one thing on the PASID
> column, all PASID associated with a given domain ID can go either
> NG_ALL_PASID (as in your table) or ALL_PASID.
>
> Here is what I plan to change in comments that can reflect what you
> have in the table above.
> Can I also copy your table in the next version?
Sure
(For the patch, putting all descriptions in a single comment at the top
of the enum would be better)
> enum iommu_inv_granularity {
> IOMMU_INV_GRANU_DOMAIN = 1, /* IOTLBs and device context
> * cache associated with a
> * domain ID
> */
>
> IOMMU_INV_GRANU_DEVICE, /* device context cache
> * associated with a device ID
> */
>
> IOMMU_INV_GRANU_DOMAIN_PAGE, /* IOTLBs associated with
> * address range of a
> * given domain ID
> */
Another nit: it might be easier to understand if we sort these values by
"coarseness". DOMAIN_PAGE seems finer than ALL_PASID or PASID_SEL
because it doesn't nuke all TLB entries of an address space, so might
make more sense to move it at the bottom. Though as said above, I don't
think we should distinguish between DOMAIN_PAGE and PAGE_PASID
>
> IOMMU_INV_GRANU_ALL_PASID, /* DTLB or IOTLB of all
> * PASIDs associated to a
> * given domain ID
> */
>
> IOMMU_INV_GRANU_PASID_SEL, /* DTLB and PASID cache
> * associated to a PASID
> */
This comment has "DTLB", the previous had "DTLB or IOTLB", and the first
one had "IOTLBs". But doesn't the TLB selection, either DTLB or
"DTLB+IOTLB", depend on iommu_inv_type? So maybe saying "TLB entries"
everywhere in the granule comments is good enough?
> IOMMU_INV_GRANU_NG_ALL_PASID, /* IOTLBs of non-global
> * pages for all PASIDs for a
> * given domain ID
> */
>
> IOMMU_INV_GRANU_NG_PASID, /* IOTLBs of non-global
> * pages for a given PASID
> */
>
> IOMMU_INV_GRANU_PAGE_PASID, /* IOTLBs of selected page
> * range within a PASID
> */
I think the other comments are fine
[...]
>>> + * @size: 2^size of 4K pages, 0 for 4k, 9 for 2MB,
>>> etc.
>>
>> Maybe start the size at 1 byte, we don't know what sort of granularity
>> future architectures will offer.
>>
> I can't see any case we are not operating at sub-page size. why would
> anyone cache translation for 1 byte, that is too much overhead.
1 bytes is probably overkill, but why not 2048 for TCP packets... we
don't really know what strange ideas people will come up with. But
you're right, it's unlikely.
However I thought about this more and we are actually missing something.
Some architectures will have arbitrary ranges in their invalidation
commands, they might want to invalidate three pages at a time without
sending three invalidation commands. Having a page granularity is good,
because users might want to invalidate huge TLB, but we should also have
a number of pages.
Could you add a nr_pages parameter?
@size: one page is 2^size (*4k?) bytes
@nr_pages: number of pages to invalidate
u8 size
u64 nr_pages
Sorry about the late changes, I don't want to slow this down and I think
we're nearly there, but this last point seems important.
Thanks,
Jean
> From: Jean-Philippe Brucker [mailto:[email protected]]
> Sent: Saturday, April 28, 2018 2:08 AM
>
> [...]
> >> If this corresponds to QI_GRAN_ALL_ALL in patch 9, the comment should
> >> be "Cache of all PASIDs"? Or maybe "all entries for all PASIDs"? Is it
> >> different from GRANU_DOMAIN then?
> > QI_GRAN_ALL_ALL maps to VT-d spec 6.5.2.4, which invalidates all ext
> > TLB cache within a domain. It could reuse GRANU_DOMAIN but I was
> > also trying to match the naming convention in the spec.
>
> Sorry I don't quite understand the difference between TLB and ext TLB
> invalidation. Can an ext TLB invalidation do everything a TLB can do
> plus some additional parameters (introduced in more recent version of
> the spec), or do they have distinct purposes? I'm trying to understand
> why it needs to be user-visible
distinct purpose though some overlapped effect:
IOTLB invalidate is more for 2nd-level cache on granularity (global/
domain/PSI), with side effect on 1st-level and nested caches (global/
domain).
Extended IOTLB invalidate is specifically for 1st-level and nested
caches on granularity (per-domain: all PASIDs/per PASID/PSI).
Thanks
Kevin
On 25/04/18 16:37, Jacob Pan wrote:
>> In the other cases (unsupported PRI or rogue guest) then disabling PRI
>> using a FAILURE status might be the right thing to do. However,
>> assuming the device follows the PCI spec it will stop sending page
>> requests once there are as many PPRs in flight as the allocated
>> credit.
>>
> Agreed, here I am not taking any actions. There may be need to drain
> in-fly requests.
Right, as long as we first ensure that no new fault is generated (by
using a Response Failure). Though in my opinion not taking action might
be the safest option :)
Another thought: currently the comment in iommu.h says
"@IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent faults
from this device if possible. This is "Response Failure" in PCI PRI."
I wonder if we should simply say "Drop all subsequent faults from the
device". Even if the PCI device doesn't properly implement PRI, the
IOMMU driver should set a "PRI disabled" bit in the device data that
prevents it from from reporting new faults and flooding the queue.
Anyway, it's a small detail that could go in a future patch series.
>> If there isn't any possibility of memory leak or abusing resources, I
>> don't think it's our problem that the guest is excessively slow at
>> handling page requests. Setting an upper bound to page request latency
>> might do more harm than good. Ensuring that devices respect the number
>> of allocated in-flight PPRs is more important in my opinion.
>>
> How about we have a really long timeout, e.g. 1 min similar to device
> invalidate response timeout in ATS spec., just for basic safety and
> diagnosis. Optionally, we could have quota in parallel.
I agree that for development a timeout is useful. It might be worth
adding it as an option to the IOMMU module instead of a define. Perhaps
a number of seconds, 10 being the default and 0 disabling the timeout?
Otherwise we would probably end up with a succession of patches
incrementing the timeout by arbitrary values, if people find it
inconvenient.
Thanks,
Jean
Hi,
I noticed a couple issues when testing
On 16/04/18 22:49, Jacob Pan wrote:
> +int iommu_register_device_fault_handler(struct device *dev,
> + iommu_dev_fault_handler_t handler,
> + void *data)
> +{
> + struct iommu_param *param = dev->iommu_param;
> +
> + /*
> + * Device iommu_param should have been allocated when device is
> + * added to its iommu_group.
> + */
> + if (!param)
> + return -EINVAL;
> +
> + /* Only allow one fault handler registered for each device */
> + if (param->fault_param)
> + return -EBUSY;
Should this be inside the param lock? We probably don't expect
concurrent register/unregister but it seems cleaner
> +
> + mutex_lock(¶m->lock);
> + get_device(dev);
> + param->fault_param =
> + kzalloc(sizeof(struct iommu_fault_param), GFP_ATOMIC);
> + if (!param->fault_param) {
> + put_device(dev);
> + mutex_unlock(¶m->lock);
> + return -ENOMEM;
> + }
> + mutex_init(¶m->fault_param->lock);
> + param->fault_param->handler = handler;
> + param->fault_param->data = data;
> + INIT_LIST_HEAD(¶m->fault_param->faults);
> +
> + mutex_unlock(¶m->lock);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +/**
> + * iommu_unregister_device_fault_handler() - Unregister the device fault handler
> + * @dev: the device
> + *
> + * Remove the device fault handler installed with
> + * iommu_register_device_fault_handler().
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> + struct iommu_param *param = dev->iommu_param;
> + int ret = 0;
> +
> + if (!param)
> + return -EINVAL;
> +
> + mutex_lock(¶m->lock);
We should return EINVAL here, if fault_param is NULL. That way users can
call unregister_fault_handler unconditionally in their cleanup paths
> + /* we cannot unregister handler if there are pending faults */
> + if (list_empty(¶m->fault_param->faults)) {
if (!list_empty(...))
> + ret = -EBUSY;
> + goto unlock;
> + }
> +
> + list_del(¶m->fault_param->faults);
faults is the list head, no need for list_del
> + kfree(param->fault_param);
> + param->fault_param = NULL;
> + put_device(dev);
> +
> +unlock:
> + mutex_unlock(¶m->lock);
> +
> + return 0;
return ret
Thanks,
Jean
On Mon, 30 Apr 2018 11:58:10 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On 25/04/18 16:37, Jacob Pan wrote:
> >> In the other cases (unsupported PRI or rogue guest) then disabling
> >> PRI using a FAILURE status might be the right thing to do. However,
> >> assuming the device follows the PCI spec it will stop sending page
> >> requests once there are as many PPRs in flight as the allocated
> >> credit.
> >>
> > Agreed, here I am not taking any actions. There may be need to drain
> > in-fly requests.
>
> Right, as long as we first ensure that no new fault is generated (by
> using a Response Failure). Though in my opinion not taking action
> might be the safest option :)
>
> Another thought: currently the comment in iommu.h says
> "@IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent
> faults from this device if possible. This is "Response Failure" in
> PCI PRI."
>
> I wonder if we should simply say "Drop all subsequent faults from the
> device". Even if the PCI device doesn't properly implement PRI, the
> IOMMU driver should set a "PRI disabled" bit in the device data that
> prevents it from from reporting new faults and flooding the queue.
> Anyway, it's a small detail that could go in a future patch series.
>
right, we should disable PRI and let future PRQ response pending on
re-enabling of PRI on the device. I will leave that to future
enhancement.
> >> If there isn't any possibility of memory leak or abusing
> >> resources, I don't think it's our problem that the guest is
> >> excessively slow at handling page requests. Setting an upper bound
> >> to page request latency might do more harm than good. Ensuring
> >> that devices respect the number of allocated in-flight PPRs is
> >> more important in my opinion.
> > How about we have a really long timeout, e.g. 1 min similar to
> > device invalidate response timeout in ATS spec., just for basic
> > safety and diagnosis. Optionally, we could have quota in parallel.
>
> I agree that for development a timeout is useful. It might be worth
> adding it as an option to the IOMMU module instead of a define.
> Perhaps a number of seconds, 10 being the default and 0 disabling the
> timeout? Otherwise we would probably end up with a succession of
> patches incrementing the timeout by arbitrary values, if people find
> it inconvenient.
>
make sense. will do.
On Mon, 30 Apr 2018 17:53:52 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> Hi,
>
> I noticed a couple issues when testing
>
> On 16/04/18 22:49, Jacob Pan wrote:
> > +int iommu_register_device_fault_handler(struct device *dev,
> > + iommu_dev_fault_handler_t
> > handler,
> > + void *data)
> > +{
> > + struct iommu_param *param = dev->iommu_param;
> > +
> > + /*
> > + * Device iommu_param should have been allocated when
> > device is
> > + * added to its iommu_group.
> > + */
> > + if (!param)
> > + return -EINVAL;
> > +
> > + /* Only allow one fault handler registered for each device
> > */
> > + if (param->fault_param)
> > + return -EBUSY;
>
> Should this be inside the param lock? We probably don't expect
> concurrent register/unregister but it seems cleaner
agreed, same as corrections below. Thanks!
>
> [...]
>
> We should return EINVAL here, if fault_param is NULL. That way users
> can call unregister_fault_handler unconditionally in their cleanup
> paths
>
> > + /* we cannot unregister handler if there are pending
> > faults */
> > + if (list_empty(¶m->fault_param->faults)) {
>
> if (!list_empty(...))
>
> > + ret = -EBUSY;
> > + goto unlock;
> > + }
> > +
> > + list_del(¶m->fault_param->faults);
>
> faults is the list head, no need for list_del
>
> > + kfree(param->fault_param);
> > + param->fault_param = NULL;
> > + put_device(dev);
> > +
> > +unlock:
> > + mutex_unlock(¶m->lock);
> > +
> > + return 0;
>
> return ret
>
> Thanks,
> Jean
[Jacob Pan]
On Fri, 27 Apr 2018 19:07:43 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On 23/04/18 21:43, Jacob Pan wrote:
> [...]
> >> The last name is a bit unfortunate. Since the Arm architecture uses
> >> the name "context" for what a PASID points to, "Device cache" would
> >> suit us better but it's not important.
> >>
> > or call it device context cache. actually so far context cache is
> > here only for completeness purpose. the expected use case is that
> > QEMU traps guest device context cache flush and call
> > bind_pasid_table.
>
> Right, makes sense
>
> [...]
> >> If this corresponds to QI_GRAN_ALL_ALL in patch 9, the comment
> >> should be "Cache of all PASIDs"? Or maybe "all entries for all
> >> PASIDs"? Is it different from GRANU_DOMAIN then?
> > QI_GRAN_ALL_ALL maps to VT-d spec 6.5.2.4, which invalidates all ext
> > TLB cache within a domain. It could reuse GRANU_DOMAIN but I was
> > also trying to match the naming convention in the spec.
>
> Sorry I don't quite understand the difference between TLB and ext TLB
> invalidation. Can an ext TLB invalidation do everything a TLB can do
> plus some additional parameters (introduced in more recent version of
> the spec), or do they have distinct purposes? I'm trying to understand
> why it needs to be user-visible
>
> >>> + IOMMU_INV_GRANU_PASID_SEL, /* only invalidate
> >>> specified PASID */ +
> >>> + IOMMU_INV_GRANU_NG_ALL_PASID, /* non-global within
> >>> all PASIDs */
> >>> + IOMMU_INV_GRANU_NG_PASID, /* non-global within a
> >>> PASIDs */
> >>
> >> Are the "NG" variant needed since there is a
> >> IOMMU_INVALIDATE_GLOBAL_PAGE below? We should drop either flag or
> >> granule.
> >>
> >> FWIW I'm starting to think more granule options is actually better
> >> than flags, because it flattens the combinations and keeps them to
> >> two dimensions, that we can understand and explain with a table.
> >>
> >>> + IOMMU_INV_GRANU_PAGE_PASID, /* page-selective
> >>> within a PASID */
> >>
> >> Maybe this should be called "NG_PAGE_PASID",
> > Sure. I was thinking page range already implies non-global pages.
> >> and "DOMAIN_PAGE" should
> >> instead be "PAGE_PASID". If I understood their meaning correctly,
> >> it would be more consistent with the rest.
> >>
> > I am trying not to mix granu between request w/ PASID and w/o.
> > DOMAIN_PAGE meant to be for request w/o PASID.
>
> Is the distinction necessary? I understand the IOMMU side might offer
> many possibilities for invalidation, but the user probably doesn't
> need all of them. It might be easier to document, upstream and
> maintain if we only specify what's currently needed by users (what
> does QEMU VT-d use?) Others can always extend it by increasing the
> version.
>
> Do you think that this invalidation message will be used outside of
> BIND_PASID_TABLE context? I can't see an other use but who knows. At
> the moment requests w/o PASID are managed with
> VFIO_IOMMU_MAP/UNMAP_DMA, which doesn't require invalidation. And in
> a BIND_PASID_TABLE context, IOMMUs requests w/o PASID are just a
> special case using PASID 0 (for Arm and AMD) so I suppose they'll use
> the same invalidation commands as requests w/ PASID.
>
My understanding is that for GIOVA use case, VT-d vIOMMU creates
GIOVA-GPA mapping and the host shadows the 2nd level page tables to
create GIOVA-HPA mapping. So when assigned device in the guest can do
both DMA map/unmap and VFIO map/unmap, VFIO unmap is one time deal
(I guess invalidation can be captured in other code path), but guest
kernel use of DMA unmap could will trigger invalidation. QEMU needs to
trap those invalidation and passdown to physical IOMMU. So we do need
invalidation w/o PASID.
> >>> + IOMMU_INV_NR_GRANU,
> >>> +};
> >>> +
> >>> +/** enum iommu_inv_type - Generic translation cache types for
> >>> invalidation
> >>> + *
> >>> + * Invalidation requests sent to IOMMU may indicate which
> >>> translation cache
> >>> + * to be operated on.
> >>> + * Combined with enum iommu_inv_granularity, model specific
> >>> driver can do a
> >>> + * simple lookup to convert generic type to model specific value.
> >>> + */
> >>> +enum iommu_inv_type {
> >>
> >> These should be flags (1 << 0), (1 << 1) etc, since IOMMUs will
> >> want to invalidate multiple caches at once (at least DTLB and
> >> TLB). You could then do for_each_set_bit in the driver
> >>
> > I was thinking the invalidation to be inclusive as we discussed
> > earlier ,last year :).
> > TLB includes DLTB
> > PASID cache includes TLB and DTLB. I need to document it better.
>
> Ah right, I guess I was stuck on an old version :) Then the current
> values make sense
>
> >>> + IOMMU_INV_TYPE_DTLB, /* device IOTLB */
> >>> + IOMMU_INV_TYPE_TLB, /* IOMMU paging structure
> >>> cache */
> >>> + IOMMU_INV_TYPE_PASID, /* PASID cache */
> >>> + IOMMU_INV_TYPE_CONTEXT, /* device context entry
> >>> cache */
> >>> + IOMMU_INV_NR_TYPE
> >>> +};
> >>
> >> We need to summarize and explain valid combinations, because
> >> reading inv_type_granu_map and inv_type_granu_table is a bit
> >> tedious. I tried to reproduce inv_type_granu_map here (Cell format
> >> is PASID_TAGGED / !PASID_TAGGED). Could you check if this matches
> >> your model?
> > great summary. thanks
> >>
> >> type | DTLB | TLB | PASID | CONTEXT
> >> granule | | | |
> >> -----------------+-----------+-----------+-----------+-----------
> >> - | / Y | / Y | | / Y
> > what is this row?
>
> Hm, the arrays in patch 9 have 9 entries, this is entry 0 (for which I
> asked if it corresponded to "invalidate all caches" in my previous
> reply).
>
I see, I have removed global invalidation. So we can remove this row.
> >> DOMAIN | | / Y | | / Y
> >> DEVICE | | | | / Y
> >> DOMAIN_PAGE | | / Y | |
> >> ALL_PASID | Y | Y | |
> >> PASID_SEL | Y | | Y |
> >> NG_ALL_PASID | | Y | Y |
> >> NG_PASID | | Y | |
> >> PAGE_PASID | | Y | |
> >>
> > Mostly match what I intended for VT-d. Just one thing on the PASID
> > column, all PASID associated with a given domain ID can go either
> > NG_ALL_PASID (as in your table) or ALL_PASID.
> >
> > Here is what I plan to change in comments that can reflect what you
> > have in the table above.
> > Can I also copy your table in the next version?
>
> Sure
>
> (For the patch, putting all descriptions in a single comment at the
> top of the enum would be better)
>
ok, will do.
> > enum iommu_inv_granularity {
> > IOMMU_INV_GRANU_DOMAIN = 1, /* IOTLBs and device
> > context
> > * cache associated with a
> > * domain ID
> > */
> >
> > IOMMU_INV_GRANU_DEVICE, /* device context
> > cache
> > * associated with a device
> > ID */
> >
> > IOMMU_INV_GRANU_DOMAIN_PAGE, /* IOTLBs associated
> > with
> > * address range of a
> > * given domain ID
> > */
>
> Another nit: it might be easier to understand if we sort these values
> by "coarseness". DOMAIN_PAGE seems finer than ALL_PASID or PASID_SEL
> because it doesn't nuke all TLB entries of an address space, so might
> make more sense to move it at the bottom. Though as said above, I
> don't think we should distinguish between DOMAIN_PAGE and PAGE_PASID
>
It is hard to sort by coarseness when we cross different types. If you
are convinced that we do need w/o PASID case, can we keep both
DOMAIN_PAGE and PAGE_PASID?
> >
> > IOMMU_INV_GRANU_ALL_PASID, /* DTLB or IOTLB of all
> > * PASIDs associated to a
> > * given domain ID
> > */
> >
> > IOMMU_INV_GRANU_PASID_SEL, /* DTLB and PASID cache
> > * associated to a PASID
> > */
>
> This comment has "DTLB", the previous had "DTLB or IOTLB", and the
> first one had "IOTLBs". But doesn't the TLB selection, either DTLB or
> "DTLB+IOTLB", depend on iommu_inv_type? So maybe saying "TLB entries"
> everywhere in the granule comments is good enough?
>
Since not all inv_types and granu combinations are valid, I was trying
to give additional info so that people can understand certain granu
only apply to certain types. I guess your truth table explains the same
information better, I will rename them to TLB entries.
> > IOMMU_INV_GRANU_NG_ALL_PASID, /* IOTLBs of non-global
> > * pages for all PASIDs for
> > a
> > * given domain ID
> > */
> >
> > IOMMU_INV_GRANU_NG_PASID, /* IOTLBs of non-global
> > * pages for a given PASID
> > */
> >
> > IOMMU_INV_GRANU_PAGE_PASID, /* IOTLBs of selected
> > page
> > * range within a PASID
> > */
>
> I think the other comments are fine
>
> [...]
> >>> + * @size: 2^size of 4K pages, 0 for 4k, 9 for 2MB,
> >>> etc.
> >>
> >> Maybe start the size at 1 byte, we don't know what sort of
> >> granularity future architectures will offer.
> >>
> > I can't see any case we are not operating at sub-page size. why
> > would anyone cache translation for 1 byte, that is too much
> > overhead.
>
> 1 bytes is probably overkill, but why not 2048 for TCP packets... we
> don't really know what strange ideas people will come up with. But
> you're right, it's unlikely.
>
> However I thought about this more and we are actually missing
> something. Some architectures will have arbitrary ranges in their
> invalidation commands, they might want to invalidate three pages at a
> time without sending three invalidation commands. Having a page
> granularity is good, because users might want to invalidate huge TLB,
> but we should also have a number of pages.
>
> Could you add a nr_pages parameter?
>
> @size: one page is 2^size (*4k?) bytes
> @nr_pages: number of pages to invalidate
>
> u8 size
> u64 nr_pages
>
That sounds good. VT-d DTLB size is implied in the address bits. but it
better to open code here to accommodate all.
> Sorry about the late changes, I don't want to slow this down and I
> think we're nearly there, but this last point seems important.
>
> Thanks,
> Jean
On 01/05/18 23:58, Jacob Pan wrote:
>>>> Maybe this should be called "NG_PAGE_PASID",
>>> Sure. I was thinking page range already implies non-global pages.
>>>> and "DOMAIN_PAGE" should
>>>> instead be "PAGE_PASID". If I understood their meaning correctly,
>>>> it would be more consistent with the rest.
>>>>
>>> I am trying not to mix granu between request w/ PASID and w/o.
>>> DOMAIN_PAGE meant to be for request w/o PASID.
>>
>> Is the distinction necessary? I understand the IOMMU side might offer
>> many possibilities for invalidation, but the user probably doesn't
>> need all of them. It might be easier to document, upstream and
>> maintain if we only specify what's currently needed by users (what
>> does QEMU VT-d use?) Others can always extend it by increasing the
>> version.
>>
>> Do you think that this invalidation message will be used outside of
>> BIND_PASID_TABLE context? I can't see an other use but who knows. At
>> the moment requests w/o PASID are managed with
>> VFIO_IOMMU_MAP/UNMAP_DMA, which doesn't require invalidation. And in
>> a BIND_PASID_TABLE context, IOMMUs requests w/o PASID are just a
>> special case using PASID 0 (for Arm and AMD) so I suppose they'll use
>> the same invalidation commands as requests w/ PASID.
>>
> My understanding is that for GIOVA use case, VT-d vIOMMU creates
> GIOVA-GPA mapping and the host shadows the 2nd level page tables to
> create GIOVA-HPA mapping. So when assigned device in the guest can do
> both DMA map/unmap and VFIO map/unmap, VFIO unmap is one time deal
> (I guess invalidation can be captured in other code path), but guest
> kernel use of DMA unmap could will trigger invalidation. QEMU needs to
> trap those invalidation and passdown to physical IOMMU. So we do need
> invalidation w/o PASID.
Hm, isn't this all done by host userspace? Whether guest does DMA
map/unmap or VFIO map/unmap, it creates/removes IOVA-GPA mappings in the
vIOMMU. QEMU captures invalidation requests for these mappings from the
guest, finds GPA-HVA in the shadow map and sends a VFIO map/unmap
request for IOVA-HVA.
Thanks,
Jean
On Wed, 2 May 2018 10:31:50 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On 01/05/18 23:58, Jacob Pan wrote:
> >>>> Maybe this should be called "NG_PAGE_PASID",
> >>> Sure. I was thinking page range already implies non-global
> >>> pages.
> >>>> and "DOMAIN_PAGE" should
> >>>> instead be "PAGE_PASID". If I understood their meaning correctly,
> >>>> it would be more consistent with the rest.
> >>>>
> >>> I am trying not to mix granu between request w/ PASID and w/o.
> >>> DOMAIN_PAGE meant to be for request w/o PASID.
> >>
> >> Is the distinction necessary? I understand the IOMMU side might
> >> offer many possibilities for invalidation, but the user probably
> >> doesn't need all of them. It might be easier to document, upstream
> >> and maintain if we only specify what's currently needed by users
> >> (what does QEMU VT-d use?) Others can always extend it by
> >> increasing the version.
> >>
> >> Do you think that this invalidation message will be used outside of
> >> BIND_PASID_TABLE context? I can't see an other use but who knows.
> >> At the moment requests w/o PASID are managed with
> >> VFIO_IOMMU_MAP/UNMAP_DMA, which doesn't require invalidation. And
> >> in a BIND_PASID_TABLE context, IOMMUs requests w/o PASID are just a
> >> special case using PASID 0 (for Arm and AMD) so I suppose they'll
> >> use the same invalidation commands as requests w/ PASID.
> >>
> > My understanding is that for GIOVA use case, VT-d vIOMMU creates
> > GIOVA-GPA mapping and the host shadows the 2nd level page tables to
> > create GIOVA-HPA mapping. So when assigned device in the guest can
> > do both DMA map/unmap and VFIO map/unmap, VFIO unmap is one time
> > deal (I guess invalidation can be captured in other code path), but
> > guest kernel use of DMA unmap could will trigger invalidation. QEMU
> > needs to trap those invalidation and passdown to physical IOMMU. So
> > we do need invalidation w/o PASID.
>
> Hm, isn't this all done by host userspace? Whether guest does DMA
> map/unmap or VFIO map/unmap, it creates/removes IOVA-GPA mappings in
> the vIOMMU. QEMU captures invalidation requests for these mappings
> from the guest, finds GPA-HVA in the shadow map and sends a VFIO
> map/unmap request for IOVA-HVA.
>
Sorry for the delay but you are right, I have also confirmed with Yi
that we don't need second level invalidation. I will remove IOTLB
invalidation w/o PASID case from the API.
Thanks,
> Thanks,
> Jean
>
[Jacob Pan]
On Thu, 3 May 2018 21:46:16 -0700
Jacob Pan <[email protected]> wrote:
> On Wed, 2 May 2018 10:31:50 +0100
> Jean-Philippe Brucker <[email protected]> wrote:
>
> > On 01/05/18 23:58, Jacob Pan wrote:
> > >>>> Maybe this should be called "NG_PAGE_PASID",
> > >>> Sure. I was thinking page range already implies non-global
> > >>> pages.
> > >>>> and "DOMAIN_PAGE" should
> > >>>> instead be "PAGE_PASID". If I understood their meaning
> > >>>> correctly, it would be more consistent with the rest.
> > >>>>
> > >>> I am trying not to mix granu between request w/ PASID and w/o.
> > >>> DOMAIN_PAGE meant to be for request w/o PASID.
> > >>
> > >> Is the distinction necessary? I understand the IOMMU side might
> > >> offer many possibilities for invalidation, but the user probably
> > >> doesn't need all of them. It might be easier to document,
> > >> upstream and maintain if we only specify what's currently needed
> > >> by users (what does QEMU VT-d use?) Others can always extend it
> > >> by increasing the version.
> > >>
> > >> Do you think that this invalidation message will be used outside
> > >> of BIND_PASID_TABLE context? I can't see an other use but who
> > >> knows. At the moment requests w/o PASID are managed with
> > >> VFIO_IOMMU_MAP/UNMAP_DMA, which doesn't require invalidation. And
> > >> in a BIND_PASID_TABLE context, IOMMUs requests w/o PASID are
> > >> just a special case using PASID 0 (for Arm and AMD) so I suppose
> > >> they'll use the same invalidation commands as requests w/ PASID.
> > >>
> > > My understanding is that for GIOVA use case, VT-d vIOMMU creates
> > > GIOVA-GPA mapping and the host shadows the 2nd level page tables
> > > to create GIOVA-HPA mapping. So when assigned device in the guest
> > > can do both DMA map/unmap and VFIO map/unmap, VFIO unmap is one
> > > time deal (I guess invalidation can be captured in other code
> > > path), but guest kernel use of DMA unmap could will trigger
> > > invalidation. QEMU needs to trap those invalidation and passdown
> > > to physical IOMMU. So we do need invalidation w/o PASID.
> >
> > Hm, isn't this all done by host userspace? Whether guest does DMA
> > map/unmap or VFIO map/unmap, it creates/removes IOVA-GPA mappings in
> > the vIOMMU. QEMU captures invalidation requests for these mappings
> > from the guest, finds GPA-HVA in the shadow map and sends a VFIO
> > map/unmap request for IOVA-HVA.
> >
> Sorry for the delay but you are right, I have also confirmed with Yi
> that we don't need second level invalidation. I will remove IOTLB
> invalidation w/o PASID case from the API.
>
Now the passdown invalidation granularities look like:
(sorted by coarseness), will send out in v5 patchset soon if no issues.
/**
* enum iommu_inv_granularity - Generic invalidation granularity
*
* @IOMMU_INV_GRANU_DOMAIN: Device context cache associated with a
* domain ID.
* @IOMMU_INV_GRANU_DEVICE: Device context cache associated with a
* device ID
* @IOMMU_INV_GRANU_DOMAIN_ALL_PASID: TLB entries or PASID caches of all
* PASIDs associated with a domain ID
* @IOMMU_INV_GRANU_PASID_SEL: TLB entries or PASID cache associated
* with a PASID and a domain
* @IOMMU_INV_GRANU_PAGE_PASID: TLB entries of selected page range
* within a PASID
*
* When an invalidation request is passed down to IOMMU to flush translation
* caches, it may carry different granularity levels, which can be specific
* to certain types of translation caches. For an example, PASID selective
* granularity is only applicable PASID cache and IOTLB invalidation but for
* device context caches.
* This enum is a collection of granularities for all types of translation
* caches. The idea is to make it easy for IOMMU model specific driver to
* convert from generic to model specific value. Not all combinations between
* translation caches and granularity levels are valid. Each IOMMU driver
* can enforce check based on its own conversion table. The conversion is
* based on 2D look-up with inputs as follows:
* - translation cache types
* - granularity
* No global granularity is allowed in that passdown invalidation for an
* assigned device should only impact the device or domain itself.
*
* type | DTLB | TLB | PASID | CONTEXT
* granule | | | |
* -----------------+-----------+-----------+-----------+-----------
* DOMAIN | | | | Y
* DEVICE | | | | Y
* DN_ALL_PASID | Y | Y | Y |
* PASID_SEL | Y | Y | Y |
* PAGE_PASID | | Y | |
*
*/
enum iommu_inv_granularity {
IOMMU_INV_GRANU_DOMAIN,
IOMMU_INV_GRANU_DEVICE,
IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
IOMMU_INV_GRANU_PASID_SEL,
IOMMU_INV_GRANU_PAGE_PASID,
IOMMU_INV_NR_GRANU,
};
> Thanks,
>
> > Thanks,
> > Jean
> >
>
> [Jacob Pan]
[Jacob Pan]
On Mon Apr 16 18, Jacob Pan wrote:
>From: "Liu, Yi L" <[email protected]>
>
>When an SVM capable device is assigned to a guest, the first level page
>tables are owned by the guest and the guest PASID table pointer is
>linked to the device context entry of the physical IOMMU.
>
>Host IOMMU driver has no knowledge of caching structure updates unless
>the guest invalidation activities are passed down to the host. The
>primary usage is derived from emulated IOMMU in the guest, where QEMU
>can trap invalidation activities before passing them down to the
>host/physical IOMMU.
>Since the invalidation data are obtained from user space and will be
>written into physical IOMMU, we must allow security check at various
>layers. Therefore, generic invalidation data format are proposed here,
>model specific IOMMU drivers need to convert them into their own format.
>
>Signed-off-by: Liu, Yi L <[email protected]>
>Signed-off-by: Jean-Philippe Brucker <[email protected]>
>Signed-off-by: Jacob Pan <[email protected]>
>Signed-off-by: Ashok Raj <[email protected]>
>---
> drivers/iommu/iommu.c | 14 ++++++++
> include/linux/iommu.h | 12 +++++++
> include/uapi/linux/iommu.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 105 insertions(+)
>
>diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>index 3a69620..784e019 100644
>--- a/drivers/iommu/iommu.c
>+++ b/drivers/iommu/iommu.c
>@@ -1344,6 +1344,20 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> }
> EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
>
>+int iommu_sva_invalidate(struct iommu_domain *domain,
>+ struct device *dev, struct tlb_invalidate_info *inv_info)
>+{
>+ int ret = 0;
>+
>+ if (unlikely(!domain->ops->sva_invalidate))
>+ return -ENODEV;
>+
>+ ret = domain->ops->sva_invalidate(domain, dev, inv_info);
>+
>+ return ret;
>+}
>+EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
>+
> static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
> {
>diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>index 8ad111f..e963dbd 100644
>--- a/include/linux/iommu.h
>+++ b/include/linux/iommu.h
>@@ -190,6 +190,7 @@ struct iommu_resv_region {
> * @pgsize_bitmap: bitmap of all possible supported page sizes
> * @bind_pasid_table: bind pasid table pointer for guest SVM
> * @unbind_pasid_table: unbind pasid table pointer and restore defaults
>+ * @sva_invalidate: invalidate translation caches of shared virtual address
> */
> struct iommu_ops {
> bool (*capable)(enum iommu_cap);
>@@ -243,6 +244,8 @@ struct iommu_ops {
> struct pasid_table_config *pasidt_binfo);
> void (*unbind_pasid_table)(struct iommu_domain *domain,
> struct device *dev);
>+ int (*sva_invalidate)(struct iommu_domain *domain,
>+ struct device *dev, struct tlb_invalidate_info *inv_info);
>
> unsigned long pgsize_bitmap;
> };
>@@ -309,6 +312,9 @@ extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> struct device *dev, struct pasid_table_config *pasidt_binfo);
> extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
> struct device *dev);
>+extern int iommu_sva_invalidate(struct iommu_domain *domain,
>+ struct device *dev, struct tlb_invalidate_info *inv_info);
>+
> extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
> extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> phys_addr_t paddr, size_t size, int prot);
>@@ -720,6 +726,12 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> {
> }
>
>+static inline int iommu_sva_invalidate(struct iommu_domain *domain,
>+ struct device *dev, struct tlb_invalidate_info *inv_info)
>+{
>+ return -EINVAL;
>+}
>+
Would -ENODEV make more sense here?
> #endif /* CONFIG_IOMMU_API */
>
> #endif /* __LINUX_IOMMU_H */
>diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>index 9f7a6bf..4447943 100644
>--- a/include/uapi/linux/iommu.h
>+++ b/include/uapi/linux/iommu.h
>@@ -29,4 +29,83 @@ struct pasid_table_config {
> __u8 pasid_bits;
> };
>
>+/**
>+ * enum iommu_inv_granularity - Generic invalidation granularity
>+ *
>+ * When an invalidation request is sent to IOMMU to flush translation caches,
>+ * it may carry different granularity. These granularity levels are not specific
>+ * to a type of translation cache. For an example, PASID selective granularity
>+ * is only applicable to PASID cache invalidation.
>+ * This enum is a collection of granularities for all types of translation
>+ * caches. The idea is to make it easy for IOMMU model specific driver do
>+ * conversion from generic to model specific value.
>+ */
>+enum iommu_inv_granularity {
>+ IOMMU_INV_GRANU_DOMAIN = 1, /* all TLBs associated with a domain */
>+ IOMMU_INV_GRANU_DEVICE, /* caching structure associated with a
>+ * device ID
>+ */
>+ IOMMU_INV_GRANU_DOMAIN_PAGE, /* address range with a domain */
>+ IOMMU_INV_GRANU_ALL_PASID, /* cache of a given PASID */
>+ IOMMU_INV_GRANU_PASID_SEL, /* only invalidate specified PASID */
>+
>+ IOMMU_INV_GRANU_NG_ALL_PASID, /* non-global within all PASIDs */
>+ IOMMU_INV_GRANU_NG_PASID, /* non-global within a PASIDs */
>+ IOMMU_INV_GRANU_PAGE_PASID, /* page-selective within a PASID */
>+ IOMMU_INV_NR_GRANU,
>+};
>+
>+/** enum iommu_inv_type - Generic translation cache types for invalidation
>+ *
>+ * Invalidation requests sent to IOMMU may indicate which translation cache
>+ * to be operated on.
>+ * Combined with enum iommu_inv_granularity, model specific driver can do a
>+ * simple lookup to convert generic type to model specific value.
>+ */
>+enum iommu_inv_type {
>+ IOMMU_INV_TYPE_DTLB, /* device IOTLB */
>+ IOMMU_INV_TYPE_TLB, /* IOMMU paging structure cache */
>+ IOMMU_INV_TYPE_PASID, /* PASID cache */
>+ IOMMU_INV_TYPE_CONTEXT, /* device context entry cache */
>+ IOMMU_INV_NR_TYPE
>+};
>+
>+/**
>+ * Translation cache invalidation header that contains mandatory meta data.
>+ * @version: info format version, expecting future extesions
>+ * @type: type of translation cache to be invalidated
>+ */
>+struct tlb_invalidate_hdr {
>+ __u32 version;
>+#define TLB_INV_HDR_VERSION_1 1
>+ enum iommu_inv_type type;
>+};
>+
>+/**
>+ * Translation cache invalidation information, contains generic IOMMU
>+ * data which can be parsed based on model ID by model specific drivers.
>+ *
>+ * @granularity: requested invalidation granularity, type dependent
>+ * @size: 2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
>+ * @pasid: processor address space ID value per PCI spec.
>+ * @addr: page address to be invalidated
>+ * @flags IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID tagged,
>+ * @pasid validity can be
>+ * deduced from @granularity
>+ * IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
>+ * IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
>+ *
>+ */
>+struct tlb_invalidate_info {
>+ struct tlb_invalidate_hdr hdr;
>+ enum iommu_inv_granularity granularity;
>+ __u32 flags;
>+#define IOMMU_INVALIDATE_NO_PASID (1 << 0)
>+#define IOMMU_INVALIDATE_ADDR_LEAF (1 << 1)
>+#define IOMMU_INVALIDATE_GLOBAL_PAGE (1 << 2)
>+#define IOMMU_INVALIDATE_PASID_TAGGED (1 << 3)
>+ __u8 size;
>+ __u32 pasid;
>+ __u64 addr;
>+};
> #endif /* _UAPI_IOMMU_H */
>--
>2.7.4
>
>_______________________________________________
>iommu mailing list
>[email protected]
>https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Sat, 5 May 2018 15:19:43 -0700
Jerry Snitselaar <[email protected]> wrote:
> >
> >+static inline int iommu_sva_invalidate(struct iommu_domain *domain,
> >+ struct device *dev, struct tlb_invalidate_info
> >*inv_info) +{
> >+ return -EINVAL;
> >+}
> >+
>
> Would -ENODEV make more sense here?
>
yes, make sense. thanks!
> [...]
Hi Jacob,
Looks mostly good to me, I just have a couple more comments
On 04/05/18 19:07, Jacob Pan wrote:
> Now the passdown invalidation granularities look like:
> (sorted by coarseness), will send out in v5 patchset soon if no issues.
>
> /**
> * enum iommu_inv_granularity - Generic invalidation granularity
> *
> * @IOMMU_INV_GRANU_DOMAIN: Device context cache associated with a
> * domain ID.
> * @IOMMU_INV_GRANU_DEVICE: Device context cache associated with a
> * device ID
> * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID: TLB entries or PASID caches of all
> * PASIDs associated with a domain ID
> * @IOMMU_INV_GRANU_PASID_SEL: TLB entries or PASID cache associated
> * with a PASID and a domain
> * @IOMMU_INV_GRANU_PAGE_PASID: TLB entries of selected page range
> * within a PASID
> *
> * When an invalidation request is passed down to IOMMU to flush translation
> * caches, it may carry different granularity levels, which can be specific
> * to certain types of translation caches. For an example, PASID selective
> * granularity is only applicable PASID cache and IOTLB invalidation but for
> * device context caches.
Should it be "PASID selective granularity is only applicable to PASID
cache and IOTLB but not device context caches"?
> * This enum is a collection of granularities for all types of translation
> * caches. The idea is to make it easy for IOMMU model specific driver to
> * convert from generic to model specific value. Not all combinations between
> * translation caches and granularity levels are valid. Each IOMMU driver
> * can enforce check based on its own conversion table. The conversion is
> * based on 2D look-up with inputs as follows:
> * - translation cache types
> * - granularity
> * No global granularity is allowed in that passdown invalidation for an
> * assigned device should only impact the device or domain itself.
That last sentence is a bit confusing, because "global granularity"
might also refer to the "global" TLB flag which is allowed. In my
opinion you can leave this rationale out, I doubt userspace will ever
demand a mechanism for global invalidation.
> *
> * type | DTLB | TLB | PASID | CONTEXT
> * granule | | | |
> * -----------------+-----------+-----------+-----------+-----------
> * DOMAIN | | | | Y
> * DEVICE | | | | Y
I can't really see a use-case for DOMAIN and DEVICE. It might make more
sense to keep only DN_ALL_PASID, which would then also invalidate the
device context cache. But since they will be very rare events, factoring
them doesn't seem important.
> * DN_ALL_PASID | Y | Y | Y |
> * PASID_SEL | Y | Y | Y |
> * PAGE_PASID | | Y | |
Why not allow PAGE_PASID+DTLB? We need a way to invalidate individual
DTLB entries
Thanks,
Jean
On Tue, 8 May 2018 11:35:00 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> Hi Jacob,
>
> Looks mostly good to me, I just have a couple more comments
>
> On 04/05/18 19:07, Jacob Pan wrote:
> > Now the passdown invalidation granularities look like:
> > (sorted by coarseness), will send out in v5 patchset soon if no
> > issues.
> >
> > /**
> > * enum iommu_inv_granularity - Generic invalidation granularity
> > *
> > * @IOMMU_INV_GRANU_DOMAIN: Device context cache
> > associated with a
> > * domain ID.
> > * @IOMMU_INV_GRANU_DEVICE: Device context cache
> > associated with a
> > * device ID
> > * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID: TLB entries or PASID
> > caches of all
> > * PASIDs associated with a
> > domain ID
> > * @IOMMU_INV_GRANU_PASID_SEL: TLB entries or PASID
> > cache associated
> > * with a PASID and a domain
> > * @IOMMU_INV_GRANU_PAGE_PASID: TLB entries of
> > selected page range
> > * within a PASID
> > *
> > * When an invalidation request is passed down to IOMMU to flush
> > translation
> > * caches, it may carry different granularity levels, which can be
> > specific
> > * to certain types of translation caches. For an example, PASID
> > selective
> > * granularity is only applicable PASID cache and IOTLB
> > invalidation but for
> > * device context caches.
>
> Should it be "PASID selective granularity is only applicable to PASID
> cache and IOTLB but not device context caches"?
>
right, thanks!
> > * This enum is a collection of granularities for all types of
> > translation
> > * caches. The idea is to make it easy for IOMMU model specific
> > driver to
> > * convert from generic to model specific value. Not all
> > combinations between
> > * translation caches and granularity levels are valid. Each IOMMU
> > driver
> > * can enforce check based on its own conversion table. The
> > conversion is
> > * based on 2D look-up with inputs as follows:
> > * - translation cache types
> > * - granularity
> > * No global granularity is allowed in that passdown invalidation
> > for an
> > * assigned device should only impact the device or domain itself.
>
> That last sentence is a bit confusing, because "global granularity"
> might also refer to the "global" TLB flag which is allowed. In my
> opinion you can leave this rationale out, I doubt userspace will ever
> demand a mechanism for global invalidation.
>
yes, i can leave the last sentence out.
> > *
> > * type | DTLB | TLB | PASID | CONTEXT
> > * granule | | | |
> > * -----------------+-----------+-----------+-----------+-----------
> > * DOMAIN | | | | Y
> > * DEVICE | | | | Y
>
> I can't really see a use-case for DOMAIN and DEVICE. It might make
> more sense to keep only DN_ALL_PASID, which would then also
> invalidate the device context cache. But since they will be very rare
> events, factoring them doesn't seem important.
>
ok. we have no use for now either, was there for completeness. i will
remove for now.
> > * DN_ALL_PASID | Y | Y | Y |
> > * PASID_SEL | Y | Y | Y |
> > * PAGE_PASID | | Y | |
>
> Why not allow PAGE_PASID+DTLB? We need a way to invalidate individual
> DTLB entries
>
I was thinking PAGE_PASID+TLB includes PAGE_PASID+DTLB, but you are
right, DTLB should be a 'Y' here.
> Thanks,
> Jean
[Jacob Pan]
Hi Jacob,
> From: Jacob Pan [mailto:[email protected]]
> Sent: Tuesday, April 17, 2018 5:49 AM
> include/linux/iommu.h | 102
> +++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 100 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e963dbd..8968933 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -49,13 +49,17 @@ struct bus_type;
> struct device;
> struct iommu_domain;
> struct notifier_block;
> +struct iommu_fault_event;
>
> /* iommu fault flags */
> -#define IOMMU_FAULT_READ 0x0
> -#define IOMMU_FAULT_WRITE 0x1
> +#define IOMMU_FAULT_READ (1 << 0)
> +#define IOMMU_FAULT_WRITE (1 << 1)
> +#define IOMMU_FAULT_EXEC (1 << 2)
> +#define IOMMU_FAULT_PRIV (1 << 3)
>
> typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
> struct device *, unsigned long, int, void *);
> +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
>
> struct iommu_domain_geometry {
> dma_addr_t aperture_start; /* First address that can be mapped */
> @@ -264,6 +268,99 @@ struct iommu_device {
> struct device *dev;
> };
>
> +/* Generic fault types, can be expanded IRQ remapping fault */
> +enum iommu_fault_type {
> + IOMMU_FAULT_DMA_UNRECOV = 1, /* unrecoverable fault */
> + IOMMU_FAULT_PAGE_REQ, /* page request fault */
> +};
> +
> +enum iommu_fault_reason {
> + IOMMU_FAULT_REASON_UNKNOWN = 0,
> +
> + /* IOMMU internal error, no specific reason to report out */
> + IOMMU_FAULT_REASON_INTERNAL,
> +
> + /* Could not access the PASID table */
> + IOMMU_FAULT_REASON_PASID_FETCH,
> +
> + /*
> + * PASID is out of range (e.g. exceeds the maximum PASID
> + * supported by the IOMMU) or disabled.
> + */
> + IOMMU_FAULT_REASON_PASID_INVALID,
> +
> + /* Could not access the page directory (Invalid PASID entry) */
> + IOMMU_FAULT_REASON_PGD_FETCH,
> +
> + /* Could not access the page table entry (Bad address) */
> + IOMMU_FAULT_REASON_PTE_FETCH,
> +
> + /* Protection flag check failed */
> + IOMMU_FAULT_REASON_PERMISSION,
> +};
> +
> +/**
> + * struct iommu_fault_event - Generic per device fault data
> + *
> + * - PCI and non-PCI devices
> + * - Recoverable faults (e.g. page request), information based on PCI ATS
> + * and PASID spec.
> + * - Un-recoverable faults of device interest
> + * - DMA remapping and IRQ remapping faults
> +
> + * @type contains fault type.
> + * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
> + * faults are not reported
> + * @addr: tells the offending page address
> + * @pasid: contains process address space ID, used in shared virtual memory(SVM)
> + * @rid: requestor ID
> + * @page_req_group_id: page request group index
> + * @last_req: last request in a page request group
> + * @pasid_valid: indicates if the PRQ has a valid PASID
> + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> IOMMU_FAULT_WRITE
> + * @device_private: if present, uniquely identify device-specific
> + * private data for an individual page request.
> + * @iommu_private: used by the IOMMU driver for storing fault-specific
> + * data. Users should not modify this field before
> + * sending the fault response.
> + */
> +struct iommu_fault_event {
> + enum iommu_fault_type type;
> + enum iommu_fault_reason reason;
> + u64 addr;
> + u32 pasid;
> + u32 page_req_group_id;
> + u32 last_req : 1;
> + u32 pasid_valid : 1;
> + u32 prot;
I think userspace also needs to know the fault type, reason, pasid, addr, goup_id,
prot. So the definition should be included in uapi/Linux/iommu.h.
This comment also applies to "struct page_response_msg". Qemu also wants to
pass the page response to host.
> + u64 device_private;
> + u64 iommu_private;
These two seems to be in kernel driver specific data. May split the iommu_fault_event
definition into two parts. One part for the data required by both driver and userspace.
One part for in kernel driver specific data.
Thanks,
Yi Liu
On Sun, 20 May 2018 08:17:43 +0000
"Liu, Yi L" <[email protected]> wrote:
> Hi Jacob,
>
> > From: Jacob Pan [mailto:[email protected]]
> > Sent: Tuesday, April 17, 2018 5:49 AM
> > include/linux/iommu.h | 102
> > +++++++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 100 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index e963dbd..8968933 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -49,13 +49,17 @@ struct bus_type;
> > struct device;
> > struct iommu_domain;
> > struct notifier_block;
> > +struct iommu_fault_event;
> >
> > /* iommu fault flags */
> > -#define IOMMU_FAULT_READ 0x0
> > -#define IOMMU_FAULT_WRITE 0x1
> > +#define IOMMU_FAULT_READ (1 << 0)
> > +#define IOMMU_FAULT_WRITE (1 << 1)
> > +#define IOMMU_FAULT_EXEC (1 << 2)
> > +#define IOMMU_FAULT_PRIV (1 << 3)
> >
> > typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
> > struct device *, unsigned long, int, void
> > *); +typedef int (*iommu_dev_fault_handler_t)(struct
> > iommu_fault_event *, void *);
> >
> > struct iommu_domain_geometry {
> > dma_addr_t aperture_start; /* First address that can be
> > mapped */ @@ -264,6 +268,99 @@ struct iommu_device {
> > struct device *dev;
> > };
> >
> > +/* Generic fault types, can be expanded IRQ remapping fault */
> > +enum iommu_fault_type {
> > + IOMMU_FAULT_DMA_UNRECOV = 1, /* unrecoverable fault
> > */
> > + IOMMU_FAULT_PAGE_REQ, /* page request fault
> > */ +};
> > +
> > +enum iommu_fault_reason {
> > + IOMMU_FAULT_REASON_UNKNOWN = 0,
> > +
> > + /* IOMMU internal error, no specific reason to report out
> > */
> > + IOMMU_FAULT_REASON_INTERNAL,
> > +
> > + /* Could not access the PASID table */
> > + IOMMU_FAULT_REASON_PASID_FETCH,
> > +
> > + /*
> > + * PASID is out of range (e.g. exceeds the maximum PASID
> > + * supported by the IOMMU) or disabled.
> > + */
> > + IOMMU_FAULT_REASON_PASID_INVALID,
> > +
> > + /* Could not access the page directory (Invalid PASID
> > entry) */
> > + IOMMU_FAULT_REASON_PGD_FETCH,
> > +
> > + /* Could not access the page table entry (Bad address) */
> > + IOMMU_FAULT_REASON_PTE_FETCH,
> > +
> > + /* Protection flag check failed */
> > + IOMMU_FAULT_REASON_PERMISSION,
> > +};
> > +
> > +/**
> > + * struct iommu_fault_event - Generic per device fault data
> > + *
> > + * - PCI and non-PCI devices
> > + * - Recoverable faults (e.g. page request), information based on
> > PCI ATS
> > + * and PASID spec.
> > + * - Un-recoverable faults of device interest
> > + * - DMA remapping and IRQ remapping faults
> > +
> > + * @type contains fault type.
> > + * @reason fault reasons if relevant outside IOMMU driver, IOMMU
> > driver internal
> > + * faults are not reported
> > + * @addr: tells the offending page address
> > + * @pasid: contains process address space ID, used in shared
> > virtual memory(SVM)
> > + * @rid: requestor ID
> > + * @page_req_group_id: page request group index
> > + * @last_req: last request in a page request group
> > + * @pasid_valid: indicates if the PRQ has a valid PASID
> > + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> > IOMMU_FAULT_WRITE
> > + * @device_private: if present, uniquely identify device-specific
> > + * private data for an individual page request.
> > + * @iommu_private: used by the IOMMU driver for storing
> > fault-specific
> > + * data. Users should not modify this field before
> > + * sending the fault response.
> > + */
> > +struct iommu_fault_event {
> > + enum iommu_fault_type type;
> > + enum iommu_fault_reason reason;
> > + u64 addr;
> > + u32 pasid;
> > + u32 page_req_group_id;
> > + u32 last_req : 1;
> > + u32 pasid_valid : 1;
> > + u32 prot;
>
> I think userspace also needs to know the fault type, reason, pasid,
> addr, goup_id, prot. So the definition should be included in
> uapi/Linux/iommu.h.
>
> This comment also applies to "struct page_response_msg". Qemu also
> wants to pass the page response to host.
>
sounds good. i assume vfio layer would reuse these data.
> > + u64 device_private;
> > + u64 iommu_private;
>
> These two seems to be in kernel driver specific data. May split the
> iommu_fault_event definition into two parts. One part for the data
> required by both driver and userspace. One part for in kernel driver
> specific data.
>
even device and iommu private data can be potentially consumed by the
guest kernel for some special processing. But for now, we just copy
back to the response message. e.g. vt-d streaming page response
request is embedded in the iommu private data.
> Thanks,
> Yi Liu
>
[Jacob Pan]
On Fri, 20 Apr 2018 16:42:51 -0700
Jacob Pan <[email protected]> wrote:
> On Fri, 20 Apr 2018 19:25:34 +0100
> Jean-Philippe Brucker <[email protected]> wrote:
>
> > On Tue, Apr 17, 2018 at 08:10:47PM +0100, Alex Williamson wrote:
> > [...]
> > > > + /* Assign guest PASID table pointer and size order */
> > > > + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> > > > + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
> > >
> > > Where does this IOMMU API interface define that base_ptr is 4K
> > > aligned or the format of the PASID table? Are these all
> > > standardized or do they vary by host IOMMU? If they're standards,
> > > maybe we could note that and the spec which defines them when we
> > > declare base_ptr. If they're IOMMU specific then I don't
> > > understand how we'll match a user provided PASID table to the
> > > requirements and format of the host IOMMU. Thanks,
> >
> > On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so a guest
> > under a vSMMU might pass a pointer that's not aligned on 4k.
> >
> PASID table pointer for VT-d is 4K aligned.
> > Maybe this information could be part of the data passed to userspace
> > about IOMMU table formats and features? They're not part of this
> > series, but I think we wanted to communicate IOMMU-specific features
> > via sysfs.
> >
> Agreed, I believe Yi Liu is working on a sysfs interface such that QEMU
> can match IOMMU model and features.
Digging this up again since v5 still has this issue. The IOMMU API is
a kernel internal abstraction of the IOMMU. sysfs is a userspace
interface. Are we suggesting that the /only/ way to make use of the
internal IOMMU API here is to have a user provided opaque pasid table
that we can't even do minimal compatibility sanity testing on and we
simply hope that hardware covers all the fault conditions without
taking the host down with it? I guess we have to assume the latter
since the user has full control of the table, but I have a hard time
getting past lack of internal ability to use the interface and no
ability to provide even the slimmest sanity testing. Thanks,
Alex
> From: Alex Williamson [mailto:[email protected]]
> Sent: Wednesday, May 30, 2018 4:09 AM
>
> On Fri, 20 Apr 2018 16:42:51 -0700
> Jacob Pan <[email protected]> wrote:
>
> > On Fri, 20 Apr 2018 19:25:34 +0100
> > Jean-Philippe Brucker <[email protected]> wrote:
> >
> > > On Tue, Apr 17, 2018 at 08:10:47PM +0100, Alex Williamson wrote:
> > > [...]
> > > > > + /* Assign guest PASID table pointer and size order */
> > > > > + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> > > > > + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
> > > >
> > > > Where does this IOMMU API interface define that base_ptr is 4K
> > > > aligned or the format of the PASID table? Are these all
> > > > standardized or do they vary by host IOMMU? If they're standards,
> > > > maybe we could note that and the spec which defines them when we
> > > > declare base_ptr. If they're IOMMU specific then I don't
> > > > understand how we'll match a user provided PASID table to the
> > > > requirements and format of the host IOMMU. Thanks,
> > >
> > > On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so a
> guest
> > > under a vSMMU might pass a pointer that's not aligned on 4k.
> > >
> > PASID table pointer for VT-d is 4K aligned.
> > > Maybe this information could be part of the data passed to userspace
> > > about IOMMU table formats and features? They're not part of this
> > > series, but I think we wanted to communicate IOMMU-specific features
> > > via sysfs.
> > >
> > Agreed, I believe Yi Liu is working on a sysfs interface such that QEMU
> > can match IOMMU model and features.
>
> Digging this up again since v5 still has this issue. The IOMMU API is
> a kernel internal abstraction of the IOMMU. sysfs is a userspace
> interface. Are we suggesting that the /only/ way to make use of the
> internal IOMMU API here is to have a user provided opaque pasid table
> that we can't even do minimal compatibility sanity testing on and we
> simply hope that hardware covers all the fault conditions without
> taking the host down with it? I guess we have to assume the latter
> since the user has full control of the table, but I have a hard time
> getting past lack of internal ability to use the interface and no
> ability to provide even the slimmest sanity testing. Thanks,
>
checking size, alignment, ... is OK, which I think is already considered
by vendor IOMMU driver. However sanity testing table format might
be difficult. The initial table provided by guest is likely just all ZEROs.
whatever format violation may be caught only when a PASID entry
is updated...
Thanks
Kevin
On Wed, 30 May 2018 01:41:43 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson [mailto:[email protected]]
> > Sent: Wednesday, May 30, 2018 4:09 AM
> >
> > On Fri, 20 Apr 2018 16:42:51 -0700
> > Jacob Pan <[email protected]> wrote:
> >
> > > On Fri, 20 Apr 2018 19:25:34 +0100
> > > Jean-Philippe Brucker <[email protected]> wrote:
> > >
> > > > On Tue, Apr 17, 2018 at 08:10:47PM +0100, Alex Williamson wrote:
> > > > [...]
> > > > > > + /* Assign guest PASID table pointer and size order */
> > > > > > + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> > > > > > + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
> > > > >
> > > > > Where does this IOMMU API interface define that base_ptr is 4K
> > > > > aligned or the format of the PASID table? Are these all
> > > > > standardized or do they vary by host IOMMU? If they're standards,
> > > > > maybe we could note that and the spec which defines them when we
> > > > > declare base_ptr. If they're IOMMU specific then I don't
> > > > > understand how we'll match a user provided PASID table to the
> > > > > requirements and format of the host IOMMU. Thanks,
> > > >
> > > > On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so a
> > guest
> > > > under a vSMMU might pass a pointer that's not aligned on 4k.
> > > >
> > > PASID table pointer for VT-d is 4K aligned.
> > > > Maybe this information could be part of the data passed to userspace
> > > > about IOMMU table formats and features? They're not part of this
> > > > series, but I think we wanted to communicate IOMMU-specific features
> > > > via sysfs.
> > > >
> > > Agreed, I believe Yi Liu is working on a sysfs interface such that QEMU
> > > can match IOMMU model and features.
> >
> > Digging this up again since v5 still has this issue. The IOMMU API is
> > a kernel internal abstraction of the IOMMU. sysfs is a userspace
> > interface. Are we suggesting that the /only/ way to make use of the
> > internal IOMMU API here is to have a user provided opaque pasid table
> > that we can't even do minimal compatibility sanity testing on and we
> > simply hope that hardware covers all the fault conditions without
> > taking the host down with it? I guess we have to assume the latter
> > since the user has full control of the table, but I have a hard time
> > getting past lack of internal ability to use the interface and no
> > ability to provide even the slimmest sanity testing. Thanks,
> >
>
> checking size, alignment, ... is OK, which I think is already considered
> by vendor IOMMU driver. However sanity testing table format might
> be difficult. The initial table provided by guest is likely just all ZEROs.
> whatever format violation may be caught only when a PASID entry
> is updated...
There's sanity testing the actual contents of the table, which I agree
would be difficult and would likely require some sort of shadowing at
additional overhead, but what about even basic consistency checking?
For example, is it possible that due to hardware variations a user
might generate a table which works on some systems but not others? What
if two table formats are sufficiently similar that the IOMMU driver
puts an incompatible table in place but it continuously generates
faults, how do we debug that? As an intermediary in this whole process
I'd really rather be able to identify that the user claims to be
providing a TypeA table but the IOMMU only supports TypeB, so clearly
this won't work. I don't see that we have that capability. Thanks,
Alex
> From: Alex Williamson [mailto:[email protected]]
> Sent: Wednesday, May 30, 2018 11:18 AM
>
> On Wed, 30 May 2018 01:41:43 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson [mailto:[email protected]]
> > > Sent: Wednesday, May 30, 2018 4:09 AM
> > >
> > > On Fri, 20 Apr 2018 16:42:51 -0700
> > > Jacob Pan <[email protected]> wrote:
> > >
> > > > On Fri, 20 Apr 2018 19:25:34 +0100
> > > > Jean-Philippe Brucker <[email protected]> wrote:
> > > >
> > > > > On Tue, Apr 17, 2018 at 08:10:47PM +0100, Alex Williamson wrote:
> > > > > [...]
> > > > > > > + /* Assign guest PASID table pointer and size order */
> > > > > > > + ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> > > > > > > + (pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
> > > > > >
> > > > > > Where does this IOMMU API interface define that base_ptr is 4K
> > > > > > aligned or the format of the PASID table? Are these all
> > > > > > standardized or do they vary by host IOMMU? If they're standards,
> > > > > > maybe we could note that and the spec which defines them when
> we
> > > > > > declare base_ptr. If they're IOMMU specific then I don't
> > > > > > understand how we'll match a user provided PASID table to the
> > > > > > requirements and format of the host IOMMU. Thanks,
> > > > >
> > > > > On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so a
> > > guest
> > > > > under a vSMMU might pass a pointer that's not aligned on 4k.
> > > > >
> > > > PASID table pointer for VT-d is 4K aligned.
> > > > > Maybe this information could be part of the data passed to
> userspace
> > > > > about IOMMU table formats and features? They're not part of this
> > > > > series, but I think we wanted to communicate IOMMU-specific
> features
> > > > > via sysfs.
> > > > >
> > > > Agreed, I believe Yi Liu is working on a sysfs interface such that QEMU
> > > > can match IOMMU model and features.
> > >
> > > Digging this up again since v5 still has this issue. The IOMMU API is
> > > a kernel internal abstraction of the IOMMU. sysfs is a userspace
> > > interface. Are we suggesting that the /only/ way to make use of the
> > > internal IOMMU API here is to have a user provided opaque pasid table
> > > that we can't even do minimal compatibility sanity testing on and we
> > > simply hope that hardware covers all the fault conditions without
> > > taking the host down with it? I guess we have to assume the latter
> > > since the user has full control of the table, but I have a hard time
> > > getting past lack of internal ability to use the interface and no
> > > ability to provide even the slimmest sanity testing. Thanks,
> > >
> >
> > checking size, alignment, ... is OK, which I think is already considered
> > by vendor IOMMU driver. However sanity testing table format might
> > be difficult. The initial table provided by guest is likely just all ZEROs.
> > whatever format violation may be caught only when a PASID entry
> > is updated...
>
> There's sanity testing the actual contents of the table, which I agree
> would be difficult and would likely require some sort of shadowing at
> additional overhead, but what about even basic consistency checking?
> For example, is it possible that due to hardware variations a user
> might generate a table which works on some systems but not others?
> What
> if two table formats are sufficiently similar that the IOMMU driver
> puts an incompatible table in place but it continuously generates
> faults, how do we debug that? As an intermediary in this whole process
> I'd really rather be able to identify that the user claims to be
> providing a TypeA table but the IOMMU only supports TypeB, so clearly
> this won't work. I don't see that we have that capability. Thanks,
>
I remember we ever discussed to define some vendor/model ID,
which can be retrieved by user space and then passed back when
doing table binding. Then above simple model matching check can
be done accordingly. It is actually a basic requirement when using
virtio-iommu, same driver expecting to work on all vendor IOMMUs.
However I don't remember whether/where that logic is implemented
in this series (especially when there are two tracks moving in parallel).
I'll leave to Jacob/Jean to further comment.
Thanks
Kevin
On 30/05/18 04:45, Tian, Kevin wrote:
>>>>>> On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so a
>>>> guest
>>>>>> under a vSMMU might pass a pointer that's not aligned on 4k.
>>>>>>
>>>>> PASID table pointer for VT-d is 4K aligned.
>>>>>> Maybe this information could be part of the data passed to
>> userspace
>>>>>> about IOMMU table formats and features? They're not part of this
>>>>>> series, but I think we wanted to communicate IOMMU-specific
>> features
>>>>>> via sysfs.
>>>>>>
>>>>> Agreed, I believe Yi Liu is working on a sysfs interface such that QEMU
>>>>> can match IOMMU model and features.
>>>>
>>>> Digging this up again since v5 still has this issue. The IOMMU API is
>>>> a kernel internal abstraction of the IOMMU. sysfs is a userspace
>>>> interface. Are we suggesting that the /only/ way to make use of the
>>>> internal IOMMU API here is to have a user provided opaque pasid table
>>>> that we can't even do minimal compatibility sanity testing on and we
>>>> simply hope that hardware covers all the fault conditions without
>>>> taking the host down with it? I guess we have to assume the latter
>>>> since the user has full control of the table, but I have a hard time
>>>> getting past lack of internal ability to use the interface and no
>>>> ability to provide even the slimmest sanity testing. Thanks,
>>>>
>>>
>>> checking size, alignment, ... is OK, which I think is already considered
>>> by vendor IOMMU driver. However sanity testing table format might
>>> be difficult. The initial table provided by guest is likely just all ZEROs.
>>> whatever format violation may be caught only when a PASID entry
>>> is updated...
>>
>> There's sanity testing the actual contents of the table, which I agree
>> would be difficult and would likely require some sort of shadowing at
>> additional overhead, but what about even basic consistency checking?
>> For example, is it possible that due to hardware variations a user
>> might generate a table which works on some systems but not others?
>> What
>> if two table formats are sufficiently similar that the IOMMU driver
>> puts an incompatible table in place but it continuously generates
>> faults, how do we debug that? As an intermediary in this whole process
>> I'd really rather be able to identify that the user claims to be
>> providing a TypeA table but the IOMMU only supports TypeB, so clearly
>> this won't work. I don't see that we have that capability. Thanks,
>
> I remember we ever discussed to define some vendor/model ID,
> which can be retrieved by user space and then passed back when
> doing table binding. Then above simple model matching check can
> be done accordingly. It is actually a basic requirement when using
> virtio-iommu, same driver expecting to work on all vendor IOMMUs.
>
> However I don't remember whether/where that logic is implemented
> in this series (especially when there are two tracks moving in parallel).
> I'll leave to Jacob/Jean to further comment.
For Arm we do need some form of sanity checking. As each architecture
version brings a new set of features that may be supported and enabled
individually, we need to communicate fine-grained features to users.
They describes the general capability of the physical IOMMU, and also
which fields are available in the PASID table (entries are 512-bits and
leave some space for future extensions).
In the past I briefly tried using a ioctl-based interface through VFIO
only, but it seemed more complicated to extend than sysfs for this kind
of probing.
Note that the following is from my own prototype. I'm not sure how much
Yi Liu's implementation differs but I think this was roughly what we
agreed on last time. In sysfs an IOMMU device is described with:
* A model number, for example intel-vtd=1, arm-smmu-v3=2.
* Properties and features, describing in detail what the pIOMMU device
and driver support.
/sys/class/iommu/<iommu-dev>/<model>/<property>
For example an SMMUv3:
The model number is described as a property
/sys/class/iommu/smmu.0x00000000e0600000/arm-smmu-v3/model = 2
A few feature bits and values:
.../arm-smmu-v3/asid_bits // max address space ID bits, %d
.../arm-smmu-v3/ssid_bits // max substream ID (PASID) bits, %d
.../arm-smmu-v3/input_bits // max input address size, %d
.../arm-smmu-v3/output_bits // max output address size, %d
.../arm-smmu-v3/btm // broadcast TLB maintenance, enabled/disabled
.../arm-smmu-v3/httu // Hardware table update, access+dirty/access/none
.../arm-smmu-v3/stall // transaction stalling, enabled/disabled/force
(Note that the base pointer alignment previously discussed could be
implied by the model number, or added explicitly here.)
Which page table formats are supported:
.../arm-smmu-v3/pgtable_format/lpae-64
.../arm-smmu-v3/pgtable_format/v7s
I'm not sure yet what values these will have, they might simply contain
arbitrary format numbers because fields available in the page tables can
be deduced from the above features bits. (Out of laziness, in my
prototype I just describe a preferred format in a pgtable_format file)
As you can imagine I'd rather not pass the fine details back to the
kernel in bind_pasid_table. The list of features is growing, and
describing them is a pain. It could be done for debugging purpose, but
all we'd be achieving is telling the kernel that userspace has read the
values, not that the guest intends to use them. The guest selects
features by writing PASID table entries, which aren't read by the host.
If the guest writes invalid values in the PASID table then yes, we have
to rely on the hardware to contain the fault and not bring the host down
with it. If the IOMMU cannot do that, then the driver really shouldn't
implement bind_pasid_table... Otherwise, a fault while reading the PASID
table can be injected into the guest as an unrecoverable fault
(IOMMU_FAULT_REASON_PASID_INVALID or IOMMU_FAULT_REASON_PGD_FETCH in
patch 10) or printed by the host when debugging.
However I think the model number should be added to pasid_table_config.
For one thing it gives us a simple sanity-check, but it also tells which
other fields are valid in pasid_table_config. Arm-smmu-v3 needs at least
two additional 8-bit fields describing the PASID table format (number of
levels and PASID0 behaviour), which are written to device context tables
when installing the PASID table pointer.
Compatibility: new optional features are easy to add to a given model,
just add a new sysfs file. If in the future, the host describes a new
feature that is mandatory, or implements a different PASID table format,
how does it ensure that user understands it? Perhaps use a new model
number for this, e.g. "arm-smmu-v3-a=3", with similar features. I think
it would be the same if the host stops supporting a feature for a given
model, because they are ABI. But we can also define default values from
the start, for example "if ssid_bits file isn't present, default value
is 0 - PASID not supported"
Thanks,
Jean
On Wed, 30 May 2018 12:53:53 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On 30/05/18 04:45, Tian, Kevin wrote:
> >>>>>> On SMMUv3 the minimum alignment for base_ptr is 64 bytes, so
> >>>>>> a
> >>>> guest
> >>>>>> under a vSMMU might pass a pointer that's not aligned on 4k.
> >>>>>>
> >>>>> PASID table pointer for VT-d is 4K aligned.
> >>>>>> Maybe this information could be part of the data passed to
> >> userspace
> >>>>>> about IOMMU table formats and features? They're not part of
> >>>>>> this series, but I think we wanted to communicate
> >>>>>> IOMMU-specific
> >> features
> >>>>>> via sysfs.
> >>>>>>
> >>>>> Agreed, I believe Yi Liu is working on a sysfs interface such
> >>>>> that QEMU can match IOMMU model and features.
> >>>>
> >>>> Digging this up again since v5 still has this issue. The IOMMU
> >>>> API is a kernel internal abstraction of the IOMMU. sysfs is a
> >>>> userspace interface. Are we suggesting that the /only/ way to
> >>>> make use of the internal IOMMU API here is to have a user
> >>>> provided opaque pasid table that we can't even do minimal
> >>>> compatibility sanity testing on and we simply hope that hardware
> >>>> covers all the fault conditions without taking the host down
> >>>> with it? I guess we have to assume the latter since the user
> >>>> has full control of the table, but I have a hard time getting
> >>>> past lack of internal ability to use the interface and no
> >>>> ability to provide even the slimmest sanity testing. Thanks,
> >>>
> >>> checking size, alignment, ... is OK, which I think is already
> >>> considered by vendor IOMMU driver. However sanity testing table
> >>> format might be difficult. The initial table provided by guest is
> >>> likely just all ZEROs. whatever format violation may be caught
> >>> only when a PASID entry is updated...
> >>
> >> There's sanity testing the actual contents of the table, which I
> >> agree would be difficult and would likely require some sort of
> >> shadowing at additional overhead, but what about even basic
> >> consistency checking? For example, is it possible that due to
> >> hardware variations a user might generate a table which works on
> >> some systems but not others? What
> >> if two table formats are sufficiently similar that the IOMMU driver
> >> puts an incompatible table in place but it continuously generates
> >> faults, how do we debug that? As an intermediary in this whole
> >> process I'd really rather be able to identify that the user claims
> >> to be providing a TypeA table but the IOMMU only supports TypeB,
> >> so clearly this won't work. I don't see that we have that
> >> capability. Thanks,
> >
> > I remember we ever discussed to define some vendor/model ID,
> > which can be retrieved by user space and then passed back when
> > doing table binding. Then above simple model matching check can
> > be done accordingly. It is actually a basic requirement when using
> > virtio-iommu, same driver expecting to work on all vendor IOMMUs.
> >
> > However I don't remember whether/where that logic is implemented
> > in this series (especially when there are two tracks moving in
> > parallel). I'll leave to Jacob/Jean to further comment.
>
> For Arm we do need some form of sanity checking. As each architecture
> version brings a new set of features that may be supported and enabled
> individually, we need to communicate fine-grained features to users.
> They describes the general capability of the physical IOMMU, and also
> which fields are available in the PASID table (entries are 512-bits
> and leave some space for future extensions).
>
> In the past I briefly tried using a ioctl-based interface through VFIO
> only, but it seemed more complicated to extend than sysfs for this
> kind of probing.
>
> Note that the following is from my own prototype. I'm not sure how
> much Yi Liu's implementation differs but I think this was roughly
> what we agreed on last time. In sysfs an IOMMU device is described
> with:
>
> * A model number, for example intel-vtd=1, arm-smmu-v3=2.
> * Properties and features, describing in detail what the pIOMMU device
> and driver support.
>
> /sys/class/iommu/<iommu-dev>/<model>/<property>
>
> For example an SMMUv3:
>
> The model number is described as a property
> /sys/class/iommu/smmu.0x00000000e0600000/arm-smmu-v3/model = 2
>
> A few feature bits and values:
> .../arm-smmu-v3/asid_bits // max address space ID bits, %d
> .../arm-smmu-v3/ssid_bits // max substream ID (PASID) bits, %d
> .../arm-smmu-v3/input_bits // max input address size, %d
> .../arm-smmu-v3/output_bits // max output address size, %d
> .../arm-smmu-v3/btm // broadcast TLB maintenance,
> enabled/disabled .../arm-smmu-v3/httu // Hardware
> table update,
> access+dirty/access/none .../arm-smmu-v3/stall //
> transaction stalling, enabled/disabled/force
>
> (Note that the base pointer alignment previously discussed could be
> implied by the model number, or added explicitly here.)
>
> Which page table formats are supported:
> .../arm-smmu-v3/pgtable_format/lpae-64
> .../arm-smmu-v3/pgtable_format/v7s
> I'm not sure yet what values these will have, they might simply
> contain arbitrary format numbers because fields available in the page
> tables can be deduced from the above features bits. (Out of laziness,
> in my prototype I just describe a preferred format in a
> pgtable_format file)
>
> As you can imagine I'd rather not pass the fine details back to the
> kernel in bind_pasid_table. The list of features is growing, and
> describing them is a pain. It could be done for debugging purpose, but
> all we'd be achieving is telling the kernel that userspace has read
> the values, not that the guest intends to use them. The guest selects
> features by writing PASID table entries, which aren't read by the
> host.
>
> If the guest writes invalid values in the PASID table then yes, we
> have to rely on the hardware to contain the fault and not bring the
> host down with it. If the IOMMU cannot do that, then the driver
> really shouldn't implement bind_pasid_table... Otherwise, a fault
> while reading the PASID table can be injected into the guest as an
> unrecoverable fault (IOMMU_FAULT_REASON_PASID_INVALID or
> IOMMU_FAULT_REASON_PGD_FETCH in patch 10) or printed by the host when
> debugging.
>
> However I think the model number should be added to
> pasid_table_config. For one thing it gives us a simple sanity-check,
> but it also tells which other fields are valid in pasid_table_config.
> Arm-smmu-v3 needs at least two additional 8-bit fields describing the
> PASID table format (number of levels and PASID0 behaviour), which are
> written to device context tables when installing the PASID table
> pointer.
>
We had model number field in v2 of this patchset. My thought was that
since the config info is meant to be generic, we shouldn't include
model info. But I also think a simple sanity check can be useful,
would that be sufficient to address Alex's concern? Of course we still
need sysfs for more specific IOMMU features.
Would this work?
enum pasid_table_model {
PASID_TABLE_FORMAT_HOST,
PASID_TABLE_FORMAT_ARM_1LVL,
PASID_TABLE_FORMAT_ARM_2LVL,
PASID_TABLE_FORMAT_AMD,
PASID_TABLE_FORMAT_INTEL,
};
/**
* PASID table data used to bind guest PASID table to the host IOMMU. This will
* enable guest managed first level page tables.
* @version: for future extensions and identification of the data format
* @bytes: size of this structure
* @model: PASID table format for different IOMMU models
* @base_ptr: PASID table pointer
* @pasid_bits: number of bits supported in the guest PASID table, must be less
* or equal than the host supported PASID size.
*/
struct pasid_table_config {
__u32 version;
#define PASID_TABLE_CFG_VERSION_1 1
__u32 bytes;
enum pasid_table_model model;
__u64 base_ptr;
__u8 pasid_bits;
};
> Compatibility: new optional features are easy to add to a given model,
> just add a new sysfs file. If in the future, the host describes a new
> feature that is mandatory, or implements a different PASID table
> format, how does it ensure that user understands it? Perhaps use a
> new model number for this, e.g. "arm-smmu-v3-a=3", with similar
> features. I think it would be the same if the host stops supporting a
> feature for a given model, because they are ABI. But we can also
> define default values from the start, for example "if ssid_bits file
> isn't present, default value is 0 - PASID not supported"
>
> Thanks,
> Jean
[Jacob Pan]
On 30/05/18 20:52, Jacob Pan wrote:
>> However I think the model number should be added to
>> pasid_table_config. For one thing it gives us a simple sanity-check,
>> but it also tells which other fields are valid in pasid_table_config.
>> Arm-smmu-v3 needs at least two additional 8-bit fields describing the
>> PASID table format (number of levels and PASID0 behaviour), which are
>> written to device context tables when installing the PASID table
>> pointer.
>>
> We had model number field in v2 of this patchset. My thought was that
> since the config info is meant to be generic, we shouldn't include
> model info. But I also think a simple sanity check can be useful,
> would that be sufficient to address Alex's concern? Of course we still
> need sysfs for more specific IOMMU features.
>
> Would this work?
> enum pasid_table_model {
> PASID_TABLE_FORMAT_HOST,
> PASID_TABLE_FORMAT_ARM_1LVL,
> PASID_TABLE_FORMAT_ARM_2LVL,
I'd rather use a single PASID_TABLE_FORMAT_ARM, because "2LVL" may be
further split into 2LVL_4k or 2LVL_64k leaf tables... I think it's best
if I add an arch-specific field in pasid_table_config for that, and for
the PASID0 configuration, when adding FORMAT_ARM in a future patch
> PASID_TABLE_FORMAT_AMD,
> PASID_TABLE_FORMAT_INTEL,
> };
>
> /**
> * PASID table data used to bind guest PASID table to the host IOMMU. This will
> * enable guest managed first level page tables.
> * @version: for future extensions and identification of the data format
> * @bytes: size of this structure
> * @model: PASID table format for different IOMMU models
> * @base_ptr: PASID table pointer
> * @pasid_bits: number of bits supported in the guest PASID table, must be less
> * or equal than the host supported PASID size.
> */
> struct pasid_table_config {
> __u32 version;
> #define PASID_TABLE_CFG_VERSION_1 1
> __u32 bytes;
"bytes" could be passed by VFIO as argument to bind_pasid_table, since
it can deduce it from argsz
Thanks,
Jean
> enum pasid_table_model model;
> __u64 base_ptr;
> __u8 pasid_bits;
> };
>
>
>
>> Compatibility: new optional features are easy to add to a given model,
>> just add a new sysfs file. If in the future, the host describes a new
>> feature that is mandatory, or implements a different PASID table
>> format, how does it ensure that user understands it? Perhaps use a
>> new model number for this, e.g. "arm-smmu-v3-a=3", with similar
>> features. I think it would be the same if the host stops supporting a
>> feature for a given model, because they are ABI. But we can also
>> define default values from the start, for example "if ssid_bits file
>> isn't present, default value is 0 - PASID not supported"
>>
>> Thanks,
>> Jean
>
> [Jacob Pan]
>
On Thu, 31 May 2018 10:09:46 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On 30/05/18 20:52, Jacob Pan wrote:
> >> However I think the model number should be added to
> >> pasid_table_config. For one thing it gives us a simple
> >> sanity-check, but it also tells which other fields are valid in
> >> pasid_table_config. Arm-smmu-v3 needs at least two additional
> >> 8-bit fields describing the PASID table format (number of levels
> >> and PASID0 behaviour), which are written to device context tables
> >> when installing the PASID table pointer.
> >>
> > We had model number field in v2 of this patchset. My thought was
> > that since the config info is meant to be generic, we shouldn't
> > include model info. But I also think a simple sanity check can be
> > useful, would that be sufficient to address Alex's concern? Of
> > course we still need sysfs for more specific IOMMU features.
> >
> > Would this work?
> > enum pasid_table_model {
> > PASID_TABLE_FORMAT_HOST,
> > PASID_TABLE_FORMAT_ARM_1LVL,
> > PASID_TABLE_FORMAT_ARM_2LVL,
>
> I'd rather use a single PASID_TABLE_FORMAT_ARM, because "2LVL" may be
> further split into 2LVL_4k or 2LVL_64k leaf tables... I think it's
> best if I add an arch-specific field in pasid_table_config for that,
> and for the PASID0 configuration, when adding FORMAT_ARM in a future
> patch
>
sounds good. will only use PASID_TABLE_FORMAT_ARM.
> > PASID_TABLE_FORMAT_AMD,
> > PASID_TABLE_FORMAT_INTEL,
> > };
> >
> > /**
> > * PASID table data used to bind guest PASID table to the host
> > IOMMU. This will
> > * enable guest managed first level page tables.
> > * @version: for future extensions and identification of the data
> > format
> > * @bytes: size of this structure
> > * @model: PASID table format for different IOMMU models
> > * @base_ptr: PASID table pointer
> > * @pasid_bits: number of bits supported in the guest PASID
> > table, must be less
> > * or equal than the host supported PASID size.
> > */
> > struct pasid_table_config {
> > __u32 version;
> > #define PASID_TABLE_CFG_VERSION_1 1
> > __u32 bytes;
>
> "bytes" could be passed by VFIO as argument to bind_pasid_table, since
> it can deduce it from argsz
>
Are you suggesting we wrap this struct in a vfio struct with argsz? or
we directly use this struct?
I need to clarify how vfio will use this.
- User program:
struct pasid_table_config ptc = { .bytes = sizeof(ptc) };
ptc.version = 1;
ioctl(device, VFIO_DEVICE_BIND_PASID_TABLE, &ptc);
- Kernel:
minsz = offsetofend(struct pasid_table_config, pasid_bits);
if (ptc.bytes < minsz)
return -EINVAL;
On 05/06/18 18:32, Jacob Pan wrote:
>> "bytes" could be passed by VFIO as argument to bind_pasid_table, since
>> it can deduce it from argsz
>>
> Are you suggesting we wrap this struct in a vfio struct with argsz? or
> we directly use this struct?
>
> I need to clarify how vfio will use this.
Right, I think we've diverged a bit since the last discussion :)
> - User program:
> struct pasid_table_config ptc = { .bytes = sizeof(ptc) };
> ptc.version = 1;
> ioctl(device, VFIO_DEVICE_BIND_PASID_TABLE, &ptc);
Any reason to do the ioctl on device instead of container? As we're
binding address spaces we probably want a consistent view for the whole
container, like the MAP/UNMAP ioctls do.
As I remember it the userspace interface would use a VFIO header and the
BIND ioctl. I can't find the email in my archive though, so I might be
imagining it. This is what I remember, on the user side:
struct {
struct vfio_iommu_type1_bind hdr;
struct pasid_table_config cfg;
} bind = {
.hdr.argsz = sizeof(bind),
.hdr.flags = VFIO_IOMMU_BIND_PASID_TABLE,
/* cfg data here */
};
ioctl(container, VFIO_DEVICE_BIND, &bind);
But I don't feel strongly about the interface. However I'd suggest to
keep incremental versioning like the rest of VFIO, with argsz and flags,
instead of version numbers, because it's more flexible.
Initially the PTC struct would look like:
struct pasid_table_config {
u32 argsz; /* sizeof(pasid_table_config) */
u32 flags; /* Should be zero */
u64 base_ptr;
u8 model;
u8 pasid_bits;
};
(Even though it doesn't use a version field let's call this version 1
for the sake of the example)
------
If someone wants to add a new field to the structure, then they also add
a flag (let's call this version 2):
struct pasid_table_config {
u32 argsz;
#define PASID_TABLE_CONFIG_EXTN (1 << 0)
u32 flags;
u64 base_ptr;
u8 model;
u8 pasid_bits;
u64 some_extension;
};
* Assume user has a version 2 header and kernel has a version 1 header.
* If user doesn't want the extension, it doesn't set the EXTN flag.
The ioctl succeeds because the kernel checks that argsz >=
offsetofend(pasid_bits) and that (flags == 0).
* If user wants to use the extension, it sets the EXTN flag. The ioctl
fails because the kernel doesn't recognize the flag.
* Assume user has version 1 and kernel has version 2.
* User doesn't use the extension. The kernel still checks that
argsz >= offsetofend(pasid_bits), but also that (flags &
~PASID_TABLE_CONFIG_EXTN), which succeeds.
* User wants the extension, sets PASID_TABLE_CONFIG_EXTN. When
seeing the flag, the kernel additionally checks that argsz >=
offsetofend(some_extension), which succeeds.
------
Adding model-specific fields is a bit more complicated, because I think
they should always stay at the end of the struct. One solution is to add
padding for common extensions:
struct pasid_table_config {
u32 argsz;
u32 flags;
u64 base_ptr;
u8 model;
u8 pasid_bits;
u8 padding[64];
union {
struct {
u8 s1dss;
u8 s1fmt;
} model_arm;
struct {
u64 foo;
} model_bar;
};
};
(we might call this version 3, but can be added before or after version
2, it doesn't matter)
A subsequent extension can still add the "some_extension" field and a
flag. If the kernel sees model "ARM", then it checks argsz >=
offsetofend(model_arm). If it sees model "BAR" then it checks argsz >=
offsetofend(model_bar). A model could also have flags to make the
model-specific structure extensible.
The problem is when we run out of space in the padding area, but we
might not need much extensibility in the common part.
Thanks,
Jean
On Wed, 6 Jun 2018 12:20:51 +0100
Jean-Philippe Brucker <[email protected]> wrote:
> On 05/06/18 18:32, Jacob Pan wrote:
> >> "bytes" could be passed by VFIO as argument to bind_pasid_table,
> >> since it can deduce it from argsz
> >>
> > Are you suggesting we wrap this struct in a vfio struct with argsz?
> > or we directly use this struct?
> >
> > I need to clarify how vfio will use this.
>
> Right, I think we've diverged a bit since the last discussion :)
>
> > - User program:
> > struct pasid_table_config ptc = { .bytes = sizeof(ptc) };
> > ptc.version = 1;
> > ioctl(device, VFIO_DEVICE_BIND_PASID_TABLE, &ptc);
>
> Any reason to do the ioctl on device instead of container? As we're
> binding address spaces we probably want a consistent view for the
> whole container, like the MAP/UNMAP ioctls do.
>
I was thinking the pasid table storage is per device, it would be
more secure if the pasid table is contained within the device. We
should have one device per container in most cases.
in case of two or more devices in the same container shares the same
pasid table, isolation may not be good in that the second device can
dma with pasids it does not own but in the shared pasid table.
> As I remember it the userspace interface would use a VFIO header and
> the BIND ioctl. I can't find the email in my archive though, so I
> might be imagining it. This is what I remember, on the user side:
>
> struct {
> struct vfio_iommu_type1_bind hdr;
> struct pasid_table_config cfg;
> } bind = {
> .hdr.argsz = sizeof(bind),
> .hdr.flags = VFIO_IOMMU_BIND_PASID_TABLE,
> /* cfg data here */
> };
>
> ioctl(container, VFIO_DEVICE_BIND, &bind);
>
or maybe just use your VFIO_IOMMU_BIND command and vfio_iommu_type1_bind
with a new flag and PTC as the data. there can be future extensions,
bind pasid table can be too narrow. And i agree below using argsz and
flags are more flexible.
i.e.
/* takes pasid_table_config as data for flag VFIO_IOMMU_BIND_PASIDTBL */
struct vfio_iommu_type1_bind {
__u32 argsz;
__u32 flags;
#define VFIO_IOMMU_BIND_PROCESS (1 << 0)
#define VFIO_IOMMU_BIND_PASIDTBL (1 << 1)
__u8 data[];
};
pseudo code in kernel:
switch (bind.flags) {
case VFIO_IOMMU_BIND_PROCESS:
return vfio_iommu_type1_bind_process(iommu, (void *)arg,
&bind);
case VFIO_IOMMU_BIND_PASIDTBL:
return vfio_iommu_type1_bind_pasid_tbl(iommu, &bind);
}
vfio_iommu_type1_bind_pasid_tbl(iommu, bind)
{
/* loop through domain list, group, device */
struct pasid_table_cfg *ptc = bind->data;
iommu_bind_pasid_table(domain, device, ptc);
}
>
> But I don't feel strongly about the interface. However I'd suggest to
> keep incremental versioning like the rest of VFIO, with argsz and
> flags, instead of version numbers, because it's more flexible.
>
> Initially the PTC struct would look like:
> struct pasid_table_config {
> u32 argsz; /* sizeof(pasid_table_config) */
> u32 flags; /* Should be zero */
> u64 base_ptr;
> u8 model;
> u8 pasid_bits;
> };
>
> (Even though it doesn't use a version field let's call this version 1
> for the sake of the example)
>
> ------
> If someone wants to add a new field to the structure, then they also
> add a flag (let's call this version 2):
>
> struct pasid_table_config {
> u32 argsz;
> #define PASID_TABLE_CONFIG_EXTN (1 << 0)
> u32 flags;
> u64 base_ptr;
> u8 model;
> u8 pasid_bits;
> u64 some_extension;
> };
>
> * Assume user has a version 2 header and kernel has a version 1
> header.
> * If user doesn't want the extension, it doesn't set the EXTN flag.
> The ioctl succeeds because the kernel checks that argsz >=
> offsetofend(pasid_bits) and that (flags == 0).
> * If user wants to use the extension, it sets the EXTN flag. The
> ioctl fails because the kernel doesn't recognize the flag.
> * Assume user has version 1 and kernel has version 2.
> * User doesn't use the extension. The kernel still checks that
> argsz >= offsetofend(pasid_bits), but also that (flags &
> ~PASID_TABLE_CONFIG_EXTN), which succeeds.
> * User wants the extension, sets PASID_TABLE_CONFIG_EXTN. When
> seeing the flag, the kernel additionally checks that argsz >=
> offsetofend(some_extension), which succeeds.
>
> ------
> Adding model-specific fields is a bit more complicated, because I
> think they should always stay at the end of the struct. One solution
> is to add padding for common extensions:
>
> struct pasid_table_config {
> u32 argsz;
> u32 flags;
> u64 base_ptr;
> u8 model;
> u8 pasid_bits;
> u8 padding[64];
>
> union {
> struct {
> u8 s1dss;
> u8 s1fmt;
> } model_arm;
> struct {
> u64 foo;
> } model_bar;
> };
> };
>
> (we might call this version 3, but can be added before or after
> version 2, it doesn't matter)
>
> A subsequent extension can still add the "some_extension" field and a
> flag. If the kernel sees model "ARM", then it checks argsz >=
> offsetofend(model_arm). If it sees model "BAR" then it checks argsz >=
> offsetofend(model_bar). A model could also have flags to make the
> model-specific structure extensible.
>
> The problem is when we run out of space in the padding area, but we
> might not need much extensibility in the common part.
>
> Thanks,
> Jean
[Jacob Pan]
On 06/06/18 22:22, Jacob Pan wrote:
> On Wed, 6 Jun 2018 12:20:51 +0100
> Jean-Philippe Brucker <[email protected]> wrote:
>
>> On 05/06/18 18:32, Jacob Pan wrote:
>>>> "bytes" could be passed by VFIO as argument to bind_pasid_table,
>>>> since it can deduce it from argsz
>>>>
>>> Are you suggesting we wrap this struct in a vfio struct with argsz?
>>> or we directly use this struct?
>>>
>>> I need to clarify how vfio will use this.
>>
>> Right, I think we've diverged a bit since the last discussion :)
>>
>>> - User program:
>>> struct pasid_table_config ptc = { .bytes = sizeof(ptc) };
>>> ptc.version = 1;
>>> ioctl(device, VFIO_DEVICE_BIND_PASID_TABLE, &ptc);
>>
>> Any reason to do the ioctl on device instead of container? As we're
>> binding address spaces we probably want a consistent view for the
>> whole container, like the MAP/UNMAP ioctls do.
>>
> I was thinking the pasid table storage is per device, it would be
> more secure if the pasid table is contained within the device. We
> should have one device per container in most cases.
> in case of two or more devices in the same container shares the same
> pasid table, isolation may not be good in that the second device can
> dma with pasids it does not own but in the shared pasid table.
The situation seems similar to map/unmap interface: if two devices are
in the same container, they are not isolated from each others, they
access the same address space. One device can access mappings that were
created for the other, and it's a feature rather than a security issue.
In a non-SVA configuration, if user wants to isolate two devices (the
usual case), they will use different containers. With SVA I think they
should keep doing that. But that's probably a matter of taste more than
a technical problem.
My issue with doing the ioctl on device, though, is that we tell users
that we can isolate PASIDs at device granularity, which isn't
necessarily the case. If two PCI devices are in the same group because
they aren't isolated by ACS (they can do p2p), then a BIND_PASID_TABLE
call on one device might allow the other device to see the same address
spaces, even if that other device doesn't have a pasid table.
In my host-sva patches I don't allow bind if there's more than one
device in the group, but that's only to keep the series simple, and I
don't think we should prevent SVA support for multi-device groups from
being added later (some people might actually want p2p + PASID). So if
not on containers, the ioctl should at least be on groups. Otherwise
we'll make false promises to users and might run into trouble later.
>> As I remember it the userspace interface would use a VFIO header and
>> the BIND ioctl. I can't find the email in my archive though, so I
>> might be imagining it. This is what I remember, on the user side:
>>
>> struct {
>> struct vfio_iommu_type1_bind hdr;
>> struct pasid_table_config cfg;
>> } bind = {
>> .hdr.argsz = sizeof(bind),
>> .hdr.flags = VFIO_IOMMU_BIND_PASID_TABLE,
>> /* cfg data here */
>> };
>>
>> ioctl(container, VFIO_DEVICE_BIND, &bind);
>>
> or maybe just use your VFIO_IOMMU_BIND command and vfio_iommu_type1_bind
> with a new flag and PTC as the data. there can be future extensions,
> bind pasid table can be too narrow. And i agree below using argsz and
> flags are more flexible.
>
> i.e.
> /* takes pasid_table_config as data for flag VFIO_IOMMU_BIND_PASIDTBL */
> struct vfio_iommu_type1_bind {
> __u32 argsz;
> __u32 flags;
> #define VFIO_IOMMU_BIND_PROCESS (1 << 0)
> #define VFIO_IOMMU_BIND_PASIDTBL (1 << 1)
> __u8 data[];
> };
>
> pseudo code in kernel:
> switch (bind.flags) {
> case VFIO_IOMMU_BIND_PROCESS:
> return vfio_iommu_type1_bind_process(iommu, (void *)arg,
> &bind);
> case VFIO_IOMMU_BIND_PASIDTBL:
> return vfio_iommu_type1_bind_pasid_tbl(iommu, &bind);
> }
>
> vfio_iommu_type1_bind_pasid_tbl(iommu, bind)
> {
> /* loop through domain list, group, device */
> struct pasid_table_cfg *ptc = bind->data;
> iommu_bind_pasid_table(domain, device, ptc);
> }
Seems sensible
Thanks,
Jean