2016-11-03 21:39:49

by Eric Auger

[permalink] [raw]
Subject: [RFC 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 (Alt II)

Following Will & Robin's suggestions, this series attempts to propose
an alternative to [1] where the host would arbitrarily decide the
location of the IOVA MSI window and would be able to report to the
userspace the list of reserved IOVA regions that cannot be used
along with VFIO_IOMMU_MAP_DMA. This would allow the userspace to react
in case of conflict.

Userspace can retrieve all the reserved regions through the VFIO_IOMMU_GET_INFO
IOCTL by querying the new RESV_IOVA_RANGE chained capability. Each reserved
IOVA range is put in a separate capability.

At IOMMU level, the reserved regions are stored in an iommu_domain list
which is populated on each device attachment. An IOMMU add_reserved_regions
callback specializes the registration of the reserved regions.

On x86, the [FEE0_0000h - FEF0_000h] MSI window is registered (NOT tested).

On ARM, the PCI host bridge windows (ACS check to be added?) + the MSI IOVA
reserved regions are populated by the arm-smmu driver. Currently the MSI
IOVA region is arbitrarily located at 0x8000000 and 1MB sized. An IOVA domain
is created in add_reserved_regions callback. Then MSIs are transparently
mapped using this IOVA domain.

This series currently does not address some features addressed in [1]:
- MSI IOVA size requirement computation
- IRQ safety assessment

This RFC was just tested on ARM Overdrive with QEMU and is sent to help
potential discussions at LPC. Additionnal development + testing is needed.

2 tentative fixes may be submitted separately:
- vfio: fix vfio_info_cap_add/shift
- iommu/iova: fix __alloc_and_insert_iova_range

Best Regards

Eric

[1] [PATCH v14 00/16] KVM PCIe/MSI passthrough on ARM/ARM64
https://lkml.org/lkml/2016/10/12/347

Git: complete series available at
https://github.com/eauger/linux/tree/v4.9-rc3-reserved-rfc


Eric Auger (7):
vfio: fix vfio_info_cap_add/shift
iommu/iova: fix __alloc_and_insert_iova_range
iommu: Add a list of iommu_reserved_region in iommu_domain
vfio/type1: Introduce RESV_IOVA_RANGE capability
iommu: Handle the list of reserved regions
iommu/vt-d: Implement add_reserved_regions callback
iommu/arm-smmu: implement add_reserved_regions callback

Robin Murphy (1):
iommu/dma: Allow MSI-only cookies

drivers/iommu/arm-smmu.c | 63 +++++++++++++++++++++++++++++++++++++++++
drivers/iommu/dma-iommu.c | 39 +++++++++++++++++++++++++
drivers/iommu/intel-iommu.c | 48 ++++++++++++++++++++++---------
drivers/iommu/iommu.c | 25 ++++++++++++++++
drivers/iommu/iova.c | 2 +-
drivers/vfio/vfio.c | 5 ++--
drivers/vfio/vfio_iommu_type1.c | 63 ++++++++++++++++++++++++++++++++++++++++-
include/linux/dma-iommu.h | 9 ++++++
include/linux/iommu.h | 23 +++++++++++++++
include/uapi/linux/vfio.h | 16 ++++++++++-
10 files changed, 275 insertions(+), 18 deletions(-)

--
1.9.1


2016-11-03 21:40:01

by Eric Auger

[permalink] [raw]
Subject: [RFC 2/8] iommu/iova: fix __alloc_and_insert_iova_range

Fix the size check within start_pfn and limit_pfn.

Signed-off-by: Eric Auger <[email protected]>

---

the issue was observed when playing with 1 page iova domain with
higher iova reserved.
---
drivers/iommu/iova.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e23001b..ee29dbf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -147,7 +147,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
if (!curr) {
if (size_aligned)
pad_size = iova_get_pad_size(size, limit_pfn);
- if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
+ if ((iovad->start_pfn + size + pad_size - 1) > limit_pfn) {
spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
return -ENOMEM;
}
--
1.9.1

2016-11-03 21:40:05

by Eric Auger

[permalink] [raw]
Subject: [RFC 3/8] iommu/dma: Allow MSI-only cookies

From: Robin Murphy <[email protected]>

IOMMU domain users such as VFIO face a similar problem to DMA API ops
with regard to mapping MSI messages in systems where the MSI write is
subject to IOMMU translation. With the relevant infrastructure now in
place for managed DMA domains, it's actually really simple for other
users to piggyback off that and reap the benefits without giving up
their own IOVA management, and without having to reinvent their own
wheel in the MSI layer.

Allow such users to opt into automatic MSI remapping by dedicating a
region of their IOVA space to a managed cookie.

Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Eric Auger <[email protected]>

---

v13 -> v14:
- restore reserve_iova for iova >= base + size

v1 -> v13 incorpration:
- compared to Robin's version
- add NULL last param to iommu_dma_init_domain
- set the msi_geometry aperture
- I removed
if (base < U64_MAX - size)
reserve_iova(iovad, iova_pfn(iovad, base + size), ULONG_MAX);
don't get why we would reserve something out of the scope of the iova domain?
what do I miss?
---
drivers/iommu/dma-iommu.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/dma-iommu.h | 9 +++++++++
2 files changed, 48 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c5ab866..d45f9a0 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -716,3 +716,42 @@ void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg)
msg->address_lo += lower_32_bits(msi_page->iova);
}
}
+
+/**
+ * iommu_get_dma_msi_region_cookie - Configure a domain for MSI remapping only
+ * @domain: IOMMU domain to prepare
+ * @base: Base address of IOVA region to use as the MSI remapping aperture
+ * @size: Size of the desired MSI aperture
+ *
+ * Users who manage their own IOVA allocation and do not want DMA API support,
+ * but would still like to take advantage of automatic MSI remapping, can use
+ * this to initialise their own domain appropriately.
+ */
+int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
+ dma_addr_t base, u64 size)
+{
+ struct iommu_dma_cookie *cookie;
+ struct iova_domain *iovad;
+ int ret;
+
+ if (domain->type == IOMMU_DOMAIN_DMA)
+ return -EINVAL;
+
+ ret = iommu_get_dma_cookie(domain);
+ if (ret)
+ return ret;
+
+ ret = iommu_dma_init_domain(domain, base, size, NULL);
+ if (ret) {
+ iommu_put_dma_cookie(domain);
+ return ret;
+ }
+
+ cookie = domain->iova_cookie;
+ iovad = &cookie->iovad;
+ if (base < U64_MAX - size)
+ reserve_iova(iovad, iova_pfn(iovad, base + size), ULONG_MAX);
+
+ return 0;
+}
+EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie);
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 32c5890..05ab5b4 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -67,6 +67,9 @@ void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
/* The DMA API isn't _quite_ the whole story, though... */
void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg);

+int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
+ dma_addr_t base, u64 size);
+
#else

struct iommu_domain;
@@ -90,6 +93,12 @@ static inline void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg)
{
}

+static inline int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
+ dma_addr_t base, u64 size)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_IOMMU_DMA */
#endif /* __KERNEL__ */
#endif /* __DMA_IOMMU_H */
--
1.9.1

2016-11-03 21:40:10

by Eric Auger

[permalink] [raw]
Subject: [RFC 4/8] iommu: Add a list of iommu_reserved_region in iommu_domain

Introduce a new iommu_reserved_region struct. This embodies
an IOVA reserved region that cannot be used along with the IOMMU
API. The list is protected by a dedicated mutex.

An iommu domain now owns a list of those.

Signed-off-by: Eric Auger <[email protected]>

---
---
drivers/iommu/iommu.c | 2 ++
include/linux/iommu.h | 17 +++++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9a2f196..0af07492 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1061,6 +1061,8 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,

domain->ops = bus->iommu_ops;
domain->type = type;
+ INIT_LIST_HEAD(&domain->reserved_regions);
+ mutex_init(&domain->resv_mutex);
/* Assume all sizes by default; the driver may override this later */
domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 436dc21..0f2eb64 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -84,6 +84,8 @@ struct iommu_domain {
void *handler_token;
struct iommu_domain_geometry geometry;
void *iova_cookie;
+ struct list_head reserved_regions;
+ struct mutex resv_mutex; /* protects the reserved region list */
};

enum iommu_cap {
@@ -131,6 +133,21 @@ struct iommu_dm_region {
int prot;
};

+/**
+ * struct iommu_reserved_region - descriptor for a reserved iova region
+ * @list: Linked list pointers
+ * @start: IOVA base address of the region
+ * @length: Length of the region in bytes
+ */
+struct iommu_reserved_region {
+ struct list_head list;
+ dma_addr_t start;
+ size_t length;
+};
+
+#define iommu_reserved_region_for_each(resv, d) \
+ list_for_each_entry(resv, &(d)->reserved_regions, list)
+
#ifdef CONFIG_IOMMU_API

/**
--
1.9.1

2016-11-03 21:40:15

by Eric Auger

[permalink] [raw]
Subject: [RFC 5/8] vfio/type1: Introduce RESV_IOVA_RANGE capability

This patch allows the user-space to retrieve the reserved IOVA range(s),
if any. The implementation is based on capability chains, now also added
to VFIO_IOMMU_GET_INFO.

Signed-off-by: Eric Auger <[email protected]>

---
---
drivers/vfio/vfio_iommu_type1.c | 63 ++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 16 ++++++++++-
2 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba1942..0d53ac1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -965,6 +965,42 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
return ret;
}

+static int add_resv_iova_range(struct vfio_info_cap *caps, u64 start, u64 end)
+{
+ struct vfio_iommu_type1_info_cap_resv_iova_range *cap;
+ struct vfio_info_cap_header *header;
+
+ header = vfio_info_cap_add(caps, sizeof(*cap),
+ VFIO_IOMMU_TYPE1_INFO_CAP_RESV_IOVA_RANGE, 1);
+
+ if (IS_ERR(header))
+ return PTR_ERR(header);
+
+ cap = container_of(header,
+ struct vfio_iommu_type1_info_cap_resv_iova_range,
+ header);
+
+ cap->start = start;
+ cap->end = end;
+ return 0;
+}
+
+static int build_resv_iova_range_caps(struct vfio_iommu *iommu,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_domain *domain;
+ struct iommu_reserved_region *resv;
+
+ domain = list_first_entry(&iommu->domain_list,
+ struct vfio_domain, next);
+
+ iommu_reserved_region_for_each(resv, domain->domain)
+ add_resv_iova_range(caps, resv->start,
+ resv->start + resv->length - 1);
+
+ return 0;
+}
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -986,8 +1022,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
}
} else if (cmd == VFIO_IOMMU_GET_INFO) {
struct vfio_iommu_type1_info info;
+ struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+ int ret;

- minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+ minsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);

if (copy_from_user(&info, (void __user *)arg, minsz))
return -EFAULT;
@@ -999,6 +1037,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

info.iova_pgsizes = vfio_pgsize_bitmap(iommu);

+ ret = build_resv_iova_range_caps(iommu, &caps);
+ if (ret)
+ return ret;
+
+ if (caps.size) {
+ info.flags |= VFIO_IOMMU_INFO_CAPS;
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ vfio_info_cap_shift(&caps, sizeof(info));
+ if (copy_to_user((void __user *)arg +
+ sizeof(info), caps.buf,
+ caps.size)) {
+ kfree(caps.buf);
+ return -EFAULT;
+ }
+ info.cap_offset = sizeof(info);
+ }
+
+ kfree(caps.buf);
+ }
+
return copy_to_user((void __user *)arg, &info, minsz) ?
-EFAULT : 0;

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a211..ae96f2c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -488,7 +488,21 @@ struct vfio_iommu_type1_info {
__u32 argsz;
__u32 flags;
#define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
- __u64 iova_pgsizes; /* Bitmap of supported page sizes */
+#define VFIO_IOMMU_INFO_CAPS (1 << 1) /* Info supports caps */
+ __u64 iova_pgsizes; /* Bitmap of supported page sizes */
+ __u32 cap_offset; /* Offset within info struct of first cap */
+ __u32 __resv;
+};
+
+/*
+ * The RESV_IOVA_RANGE capability allows to report the reserved IOVA range(s),
+ * if any.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_RESV_IOVA_RANGE 1
+struct vfio_iommu_type1_info_cap_resv_iova_range {
+ struct vfio_info_cap_header header;
+ __u64 start;
+ __u64 end;
};

#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
--
1.9.1

2016-11-03 21:39:54

by Eric Auger

[permalink] [raw]
Subject: [RFC 1/8] vfio: fix vfio_info_cap_add/shift

Capability header next field is an offset relative to the start of
the INFO buffer. tmp->next is assigned the proper value but iterations
implemented in vfio_info_cap_add and vfio_info_cap_shift use next
as an offset between the headers. When coping with multiple capabilities
this leads to an Oops.

Signed-off-by: Eric Auger <[email protected]>
---
drivers/vfio/vfio.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index d1d70e0..1e838d1 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1763,7 +1763,7 @@ struct vfio_info_cap_header *vfio_info_cap_add(struct vfio_info_cap *caps,
header->version = version;

/* Add to the end of the capability chain */
- for (tmp = caps->buf; tmp->next; tmp = (void *)tmp + tmp->next)
+ for (tmp = buf; tmp->next; tmp = buf + tmp->next)
; /* nothing */

tmp->next = caps->size;
@@ -1776,8 +1776,9 @@ struct vfio_info_cap_header *vfio_info_cap_add(struct vfio_info_cap *caps,
void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
{
struct vfio_info_cap_header *tmp;
+ void *buf = (void *)caps->buf;

- for (tmp = caps->buf; tmp->next; tmp = (void *)tmp + tmp->next - offset)
+ for (tmp = buf; tmp->next; tmp = buf + tmp->next - offset)
tmp->next += offset;
}
EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
--
1.9.1

2016-11-03 21:40:31

by Eric Auger

[permalink] [raw]
Subject: [RFC 8/8] iommu/arm-smmu: implement add_reserved_regions callback

The function populates the list of reserved regions with the
PCI host bridge windows and the MSI IOVA range.

At the moment an arbitray MSI IOVA window is set at 0x8000000
of size 1MB.

Signed-off-by: Eric Auger <[email protected]>
---
drivers/iommu/arm-smmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index c841eb7..9d1db19 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1533,6 +1533,68 @@ static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
return iommu_fwspec_add_ids(dev, &fwid, 1);
}

+static int add_pci_window_reserved_regions(struct iommu_domain *domain,
+ struct pci_dev *dev)
+{
+ struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
+ struct iommu_reserved_region *region;
+ struct resource_entry *window;
+ phys_addr_t start;
+ size_t length;
+
+ resource_list_for_each_entry(window, &bridge->windows) {
+ if (resource_type(window->res) != IORESOURCE_MEM &&
+ resource_type(window->res) != IORESOURCE_IO)
+ continue;
+
+ start = window->res->start - window->offset;
+ length = window->res->end - window->res->start + 1;
+
+ iommu_reserved_region_for_each(region, domain) {
+ if (region->start == start && region->length == length)
+ continue;
+ }
+ region = kzalloc(sizeof(*region), GFP_KERNEL);
+ if (!region)
+ return -ENOMEM;
+
+ region->start = start;
+ region->length = length;
+
+ list_add_tail(&region->list, &domain->reserved_regions);
+ }
+ return 0;
+}
+
+static int arm_smmu_add_reserved_regions(struct iommu_domain *domain,
+ struct device *device)
+{
+ struct iommu_reserved_region *region;
+ int ret = 0;
+
+ /* An arbitrary 1MB region starting at 0x8000000 is reserved for MSIs */
+ if (!domain->iova_cookie) {
+
+ region = kzalloc(sizeof(*region), GFP_KERNEL);
+ if (!region)
+ return -ENOMEM;
+
+ region->start = 0x8000000;
+ region->length = 0x100000;
+ list_add_tail(&region->list, &domain->reserved_regions);
+
+ ret = iommu_get_dma_msi_region_cookie(domain,
+ region->start, region->length);
+ if (ret)
+ return ret;
+ }
+
+ if (dev_is_pci(device))
+ ret = add_pci_window_reserved_regions(domain,
+ to_pci_dev(device));
+ return ret;
+}
+
static struct iommu_ops arm_smmu_ops = {
.capable = arm_smmu_capable,
.domain_alloc = arm_smmu_domain_alloc,
@@ -1548,6 +1610,7 @@ static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
.domain_get_attr = arm_smmu_domain_get_attr,
.domain_set_attr = arm_smmu_domain_set_attr,
.of_xlate = arm_smmu_of_xlate,
+ .add_reserved_regions = arm_smmu_add_reserved_regions,
.pgsize_bitmap = -1UL, /* Restricted during device attach */
};

--
1.9.1

2016-11-03 21:40:28

by Eric Auger

[permalink] [raw]
Subject: [RFC 7/8] iommu/vt-d: Implement add_reserved_regions callback

Implement the add_reserved_regions callback by registering
the [FEE0_0000h - FEF0_000h] 1MB range as a reserved region
(MSI address space).

Signed-off-by: Eric Auger <[email protected]>
---
drivers/iommu/intel-iommu.c | 48 +++++++++++++++++++++++++++++++++------------
1 file changed, 35 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a4407ea..fb64418 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5173,6 +5173,27 @@ static void intel_iommu_remove_device(struct device *dev)
iommu_device_unlink(iommu->iommu_dev, dev);
}

+static int intel_add_reserved_regions(struct iommu_domain *domain,
+ struct device *device)
+{
+ struct iommu_reserved_region *region;
+ const dma_addr_t msi_base = 0xfee00000;
+ const size_t msi_length = 0x00100000;
+
+ iommu_reserved_region_for_each(region, domain) {
+ if (region->start == msi_base && region->length == msi_length)
+ continue;
+ }
+ region = kzalloc(sizeof(*region), GFP_KERNEL);
+ if (!region)
+ return -ENOMEM;
+
+ region->start = msi_base;
+ region->length = msi_length;
+ list_add_tail(&region->list, &domain->reserved_regions);
+ return 0;
+}
+
#ifdef CONFIG_INTEL_IOMMU_SVM
int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev *sdev)
{
@@ -5282,19 +5303,20 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
#endif /* CONFIG_INTEL_IOMMU_SVM */

static const struct iommu_ops intel_iommu_ops = {
- .capable = intel_iommu_capable,
- .domain_alloc = intel_iommu_domain_alloc,
- .domain_free = intel_iommu_domain_free,
- .attach_dev = intel_iommu_attach_device,
- .detach_dev = intel_iommu_detach_device,
- .map = intel_iommu_map,
- .unmap = intel_iommu_unmap,
- .map_sg = default_iommu_map_sg,
- .iova_to_phys = intel_iommu_iova_to_phys,
- .add_device = intel_iommu_add_device,
- .remove_device = intel_iommu_remove_device,
- .device_group = pci_device_group,
- .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
+ .capable = intel_iommu_capable,
+ .domain_alloc = intel_iommu_domain_alloc,
+ .domain_free = intel_iommu_domain_free,
+ .attach_dev = intel_iommu_attach_device,
+ .detach_dev = intel_iommu_detach_device,
+ .map = intel_iommu_map,
+ .unmap = intel_iommu_unmap,
+ .map_sg = default_iommu_map_sg,
+ .iova_to_phys = intel_iommu_iova_to_phys,
+ .add_device = intel_iommu_add_device,
+ .remove_device = intel_iommu_remove_device,
+ .add_reserved_regions = intel_iommu_add_reserved_regions,
+ .device_group = pci_device_group,
+ .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
};

static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
--
1.9.1

2016-11-03 21:40:26

by Eric Auger

[permalink] [raw]
Subject: [RFC 6/8] iommu: Handle the list of reserved regions

A new callback is introduced, add_reserved_regions. This function
aims at populating the iommu domain reserved region list with
regions associated with the device. The function is called on
device attachment. The list is freed on iommu_domain_free().

Signed-off-by: Eric Auger <[email protected]>
---
drivers/iommu/iommu.c | 23 +++++++++++++++++++++++
include/linux/iommu.h | 6 ++++++
2 files changed, 29 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0af07492..300af44 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1077,6 +1077,10 @@ struct iommu_domain *iommu_domain_alloc(struct bus_type *bus)

void iommu_domain_free(struct iommu_domain *domain)
{
+ struct iommu_reserved_region *entry, *next;
+
+ list_for_each_entry_safe(entry, next, &domain->reserved_regions, list)
+ kfree(entry);
domain->ops->domain_free(domain);
}
EXPORT_SYMBOL_GPL(iommu_domain_free);
@@ -1089,6 +1093,15 @@ static int __iommu_attach_device(struct iommu_domain *domain,
return -ENODEV;

ret = domain->ops->attach_dev(domain, dev);
+ if (ret)
+ return ret;
+
+ if (domain->type == IOMMU_DOMAIN_UNMANAGED) {
+ mutex_lock(&domain->resv_mutex);
+ ret = iommu_add_reserved_regions(domain, dev);
+ mutex_unlock(&domain->resv_mutex);
+ }
+
if (!ret)
trace_attach_device_to_domain(dev);
return ret;
@@ -1548,6 +1561,16 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
}
EXPORT_SYMBOL_GPL(iommu_domain_set_attr);

+int iommu_add_reserved_regions(struct iommu_domain *domain, struct device *dev)
+{
+ const struct iommu_ops *ops = dev->bus->iommu_ops;
+
+ if (ops && ops->add_reserved_regions)
+ return ops->add_reserved_regions(domain, dev);
+ else
+ return 0;
+}
+
void iommu_get_dm_regions(struct device *dev, struct list_head *list)
{
const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0f2eb64..7748a1b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -170,6 +170,7 @@ struct iommu_reserved_region {
* @get_dm_regions: Request list of direct mapping requirements for a device
* @put_dm_regions: Free list of direct mapping requirements for a device
* @apply_dm_region: Temporary helper call-back for iova reserved ranges
+ * @add_reserved_regions: Populate reserved regions for a device
* @domain_window_enable: Configure and enable a particular window for a domain
* @domain_window_disable: Disable a particular window for a domain
* @domain_set_windows: Set the number of windows for a domain
@@ -207,6 +208,10 @@ struct iommu_ops {
void (*apply_dm_region)(struct device *dev, struct iommu_domain *domain,
struct iommu_dm_region *region);

+ /* Add reserved regions for a device */
+ int (*add_reserved_regions)(struct iommu_domain *domain,
+ struct device *device);
+
/* Window handling functions */
int (*domain_window_enable)(struct iommu_domain *domain, u32 wnd_nr,
phys_addr_t paddr, u64 size, int prot);
@@ -289,6 +294,7 @@ struct device *iommu_device_create(struct device *parent, void *drvdata,
void iommu_device_destroy(struct device *dev);
int iommu_device_link(struct device *dev, struct device *link);
void iommu_device_unlink(struct device *dev, struct device *link);
+int iommu_add_reserved_regions(struct iommu_domain *domain, struct device *dev);

/* Window handling function prototypes */
extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
--
1.9.1

2016-11-04 04:02:07

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 (Alt II)

On Thu, 3 Nov 2016 21:39:30 +0000
Eric Auger <[email protected]> wrote:

> Following Will & Robin's suggestions, this series attempts to propose
> an alternative to [1] where the host would arbitrarily decide the
> location of the IOVA MSI window and would be able to report to the
> userspace the list of reserved IOVA regions that cannot be used
> along with VFIO_IOMMU_MAP_DMA. This would allow the userspace to react
> in case of conflict.
>
> Userspace can retrieve all the reserved regions through the VFIO_IOMMU_GET_INFO
> IOCTL by querying the new RESV_IOVA_RANGE chained capability. Each reserved
> IOVA range is put in a separate capability.

Doesn't it make more sense to describe the non-holes (ie. what I
can use for DMA) rather the holes (what I can't use for DMA)? For
example on VT-d, the IOMMU not only has the block of MSI addresses
handled through interrupt remapping, but it also has a maximum address
width. Rather than describing the reserved space we could describe the
usable DMA ranges above and below that reserved block.

Anyway, there's also a pretty harsh problem that I came up with in
talking to Will. If the platform describes a fixed IOVA range as
reserved, that's great for the use case when a VM is instantiated with
a device attached, but it seems like it nearly excludes the case of
hotplugging a device. We can't dynamically decide that a set of RAM
pages in the VM cannot be used as a DMA target. Does the user need to
create the VM with a predefined hole that lines up with the reserved
regions for this platform? How do they know the reserved regions for
this platform? How would we handle migration where an assigned device
hot-add might not occur until after we've migrated to a slightly
different platform from the one we started on, that might have
different reserved memory requirements?

We can always have QEMU reject hot-adding the device if the reserved
region overlaps existing guest RAM, but I don't even really see how we
advise users to give them a reasonable chance of avoiding that
possibility. Apparently there are also ARM platforms where MSI pages
cannot be remapped to support the previous programmable user/VM
address, is it even worthwhile to support those platforms? Does that
decision influence whether user programmable MSI reserved regions are
really a second class citizen to fixed reserved regions? I expect
we'll be talking about this tomorrow morning, but I certainly haven't
come up with any viable solutions to this. Thanks,

Alex

> At IOMMU level, the reserved regions are stored in an iommu_domain list
> which is populated on each device attachment. An IOMMU add_reserved_regions
> callback specializes the registration of the reserved regions.
>
> On x86, the [FEE0_0000h - FEF0_000h] MSI window is registered (NOT tested).
>
> On ARM, the PCI host bridge windows (ACS check to be added?) + the MSI IOVA
> reserved regions are populated by the arm-smmu driver. Currently the MSI
> IOVA region is arbitrarily located at 0x8000000 and 1MB sized. An IOVA domain
> is created in add_reserved_regions callback. Then MSIs are transparently
> mapped using this IOVA domain.
>
> This series currently does not address some features addressed in [1]:
> - MSI IOVA size requirement computation
> - IRQ safety assessment
>
> This RFC was just tested on ARM Overdrive with QEMU and is sent to help
> potential discussions at LPC. Additionnal development + testing is needed.
>
> 2 tentative fixes may be submitted separately:
> - vfio: fix vfio_info_cap_add/shift
> - iommu/iova: fix __alloc_and_insert_iova_range
>
> Best Regards
>
> Eric
>
> [1] [PATCH v14 00/16] KVM PCIe/MSI passthrough on ARM/ARM64
> https://lkml.org/lkml/2016/10/12/347
>
> Git: complete series available at
> https://github.com/eauger/linux/tree/v4.9-rc3-reserved-rfc
>
>
> Eric Auger (7):
> vfio: fix vfio_info_cap_add/shift
> iommu/iova: fix __alloc_and_insert_iova_range
> iommu: Add a list of iommu_reserved_region in iommu_domain
> vfio/type1: Introduce RESV_IOVA_RANGE capability
> iommu: Handle the list of reserved regions
> iommu/vt-d: Implement add_reserved_regions callback
> iommu/arm-smmu: implement add_reserved_regions callback
>
> Robin Murphy (1):
> iommu/dma: Allow MSI-only cookies
>
> drivers/iommu/arm-smmu.c | 63 +++++++++++++++++++++++++++++++++++++++++
> drivers/iommu/dma-iommu.c | 39 +++++++++++++++++++++++++
> drivers/iommu/intel-iommu.c | 48 ++++++++++++++++++++++---------
> drivers/iommu/iommu.c | 25 ++++++++++++++++
> drivers/iommu/iova.c | 2 +-
> drivers/vfio/vfio.c | 5 ++--
> drivers/vfio/vfio_iommu_type1.c | 63 ++++++++++++++++++++++++++++++++++++++++-
> include/linux/dma-iommu.h | 9 ++++++
> include/linux/iommu.h | 23 +++++++++++++++
> include/uapi/linux/vfio.h | 16 ++++++++++-
> 10 files changed, 275 insertions(+), 18 deletions(-)
>

2016-11-08 02:45:59

by Will Deacon

[permalink] [raw]
Subject: Summary of LPC guest MSI discussion in Santa Fe (was: Re: [RFC 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 (Alt II))

Hi all,

I figured this was a reasonable post to piggy-back on for the LPC minutes
relating to guest MSIs on arm64.

On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:
> We can always have QEMU reject hot-adding the device if the reserved
> region overlaps existing guest RAM, but I don't even really see how we
> advise users to give them a reasonable chance of avoiding that
> possibility. Apparently there are also ARM platforms where MSI pages
> cannot be remapped to support the previous programmable user/VM
> address, is it even worthwhile to support those platforms? Does that
> decision influence whether user programmable MSI reserved regions are
> really a second class citizen to fixed reserved regions? I expect
> we'll be talking about this tomorrow morning, but I certainly haven't
> come up with any viable solutions to this. Thanks,

At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
microconference. I presented some slides to illustrate some of the issues
we're trying to solve:

http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf

Punit took some notes (thanks!) on the etherpad here:

https://etherpad.openstack.org/p/LPC2016_PCI

although the discussion was pretty lively and jumped about, so I've had
to go from memory where the notes didn't capture everything that was
said.

To summarise, arm64 platforms differ in their handling of MSIs when compared
to x86:

1. The physical memory map is not standardised (Jon pointed out that
this is something that was realised late on)
2. MSIs are usually treated the same as DMA writes, in that they must be
mapped by the SMMU page tables so that they target a physical MSI
doorbell
3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
doorbell built into the PCI RC)
4. Platforms typically have some set of addresses that abort before
reaching the SMMU (e.g. because the PCI identifies them as P2P).

All of this means that userspace (QEMU) needs to identify the memory
regions corresponding to points (3) and (4) and ensure that they are
not allocated in the guest physical (IPA) space. For platforms that can
remap the MSI doorbell as in (2), then some space also needs to be
allocated for that.

Rather than treat these as separate problems, a better interface is to
tell userspace about a set of reserved regions, and have this include
the MSI doorbell, irrespective of whether or not it can be remapped.
Don suggested that we statically pick an address for the doorbell in a
similar way to x86, and have the kernel map it there. We could even pick
0xfee00000. If it conflicts with a reserved region on the platform (due
to (4)), then we'd obviously have to (deterministically?) allocate it
somewhere else, but probably within the bottom 4G.

The next question is how to tell userspace about all of the reserved
regions. Initially, the idea was to extend VFIO, however Alex pointed
out a horrible scenario:

1. QEMU spawns a VM on system 0
2. VM is migrated to system 1
3. QEMU attempts to passthrough a device using PCI hotplug

In this scenario, the guest memory map is chosen at step (1), yet there
is no VFIO fd available to determine the reserved regions. Furthermore,
the reserved regions may vary between system 0 and system 1. This pretty
much rules out using VFIO to determine the reserved regions. Alex suggested
that the SMMU driver can advertise the regions via /sys/class/iommu/. This
would solve part of the problem, but migration between systems with
different memory maps can still cause problems if the reserved regions
of the new system conflict with the guest memory map chosen by QEMU.
Jon pointed out that most people are pretty conservative about hardware
choices when migrating between them -- that is, they may only migrate
between different revisions of the same SoC, or they know ahead of time
all of the memory maps they want to support and this could be communicated
by way of configuration to libvirt. It would be up to QEMU to fail the
hotplug if it detected a conflict. Alex asked if there was a security
issue with DMA bypassing the SMMU, but there aren't currently any systems
where that is known to happen. Such a system would surely not be safe for
passthrough.

Ben mused that a way to handle conflicts dynamically might be to hotplug
on the entire host bridge in the guest, passing firmware tables describing
the new reserved regions as a property of the host bridge. Whilst this
may well solve the issue, it was largely considered future work due to
its invasive nature and dependency on firmware tables (and guest support)
that do not currently exist.

Will

2016-11-08 14:27:33

by Eric Auger

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

Hi Will,

On 08/11/2016 03:45, Will Deacon wrote:
> Hi all,
>
> I figured this was a reasonable post to piggy-back on for the LPC minutes
> relating to guest MSIs on arm64.
>
> On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:
>> We can always have QEMU reject hot-adding the device if the reserved
>> region overlaps existing guest RAM, but I don't even really see how we
>> advise users to give them a reasonable chance of avoiding that
>> possibility. Apparently there are also ARM platforms where MSI pages
>> cannot be remapped to support the previous programmable user/VM
>> address, is it even worthwhile to support those platforms? Does that
>> decision influence whether user programmable MSI reserved regions are
>> really a second class citizen to fixed reserved regions? I expect
>> we'll be talking about this tomorrow morning, but I certainly haven't
>> come up with any viable solutions to this. Thanks,
>
> At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
> microconference. I presented some slides to illustrate some of the issues
> we're trying to solve:
>
> http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf
>
> Punit took some notes (thanks!) on the etherpad here:
>
> https://etherpad.openstack.org/p/LPC2016_PCI

Thanks to both of you for the minutes and slides. Unfortunately I could
not travel but my ears were burning ;-)
>
> although the discussion was pretty lively and jumped about, so I've had
> to go from memory where the notes didn't capture everything that was
> said.
>
> To summarise, arm64 platforms differ in their handling of MSIs when compared
> to x86:
>
> 1. The physical memory map is not standardised (Jon pointed out that
> this is something that was realised late on)
> 2. MSIs are usually treated the same as DMA writes, in that they must be
> mapped by the SMMU page tables so that they target a physical MSI
> doorbell
> 3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
> doorbell built into the PCI RC)
> 4. Platforms typically have some set of addresses that abort before
> reaching the SMMU (e.g. because the PCI identifies them as P2P).
>
> All of this means that userspace (QEMU) needs to identify the memory
> regions corresponding to points (3) and (4) and ensure that they are
> not allocated in the guest physical (IPA) space. For platforms that can
> remap the MSI doorbell as in (2), then some space also needs to be
> allocated for that.
>
> Rather than treat these as separate problems, a better interface is to
> tell userspace about a set of reserved regions, and have this include
> the MSI doorbell, irrespective of whether or not it can be remapped.
> Don suggested that we statically pick an address for the doorbell in a
> similar way to x86, and have the kernel map it there. We could even pick
> 0xfee00000. If it conflicts with a reserved region on the platform (due
> to (4)), then we'd obviously have to (deterministically?) allocate it
> somewhere else, but probably within the bottom 4G.

This is tentatively achieved now with
[1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
(http://www.mail-archive.com/[email protected]/msg1264506.html)
>
> The next question is how to tell userspace about all of the reserved
> regions. Initially, the idea was to extend VFIO, however Alex pointed
> out a horrible scenario:
>
> 1. QEMU spawns a VM on system 0
> 2. VM is migrated to system 1
> 3. QEMU attempts to passthrough a device using PCI hotplug
>
> In this scenario, the guest memory map is chosen at step (1), yet there
> is no VFIO fd available to determine the reserved regions. Furthermore,
> the reserved regions may vary between system 0 and system 1. This pretty
> much rules out using VFIO to determine the reserved regions.Alex suggested
> that the SMMU driver can advertise the regions via /sys/class/iommu/. This
> would solve part of the problem, but migration between systems with
> different memory maps can still cause problems if the reserved regions
> of the new system conflict with the guest memory map chosen by QEMU.


OK so I understand we do not want anymore the VFIO chain capability API
(patch 5 of above series) but we prefer a sysfs approach instead.

I understand the sysfs approach which allows the userspace to get the
info earlier and independently on VFIO. Keeping in mind current QEMU
virt - which is not the only userspace - will not do much from this info
until we bring upheavals in virt address space management. So if I am
not wrong, at the moment the main action to be undertaken is the
rejection of the PCI hotplug in case we detect a collision.

I can respin [1]
- studying and taking into account Robin's comments about dm_regions
similarities
- removing the VFIO capability chain and replacing this by a sysfs API

Would that be OK?

What about Alex comments who wanted to report the usable memory ranges
instead of unusable memory ranges?

Also did you have a chance to discuss the following items:
1) the VFIO irq safety assessment
2) the MSI reserved size computation (is an arbitrary size OK?)

Thanks

Eric

> Jon pointed out that most people are pretty conservative about hardware
> choices when migrating between them -- that is, they may only migrate
> between different revisions of the same SoC, or they know ahead of time
> all of the memory maps they want to support and this could be communicated
> by way of configuration to libvirt. It would be up to QEMU to fail the
> hotplug if it detected a conflict. Alex asked if there was a security
> issue with DMA bypassing the SMMU, but there aren't currently any systems
> where that is known to happen. Such a system would surely not be safe for
> passthrough.
>
> Ben mused that a way to handle conflicts dynamically might be to hotplug
> on the entire host bridge in the guest, passing firmware tables describing
> the new reserved regions as a property of the host bridge. Whilst this
> may well solve the issue, it was largely considered future work due to
> its invasive nature and dependency on firmware tables (and guest support)
> that do not currently exist.
>
> Will
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

2016-11-08 16:02:24

by Donald Dutile

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/07/2016 09:45 PM, Will Deacon wrote:
> Hi all,
>
> I figured this was a reasonable post to piggy-back on for the LPC minutes
> relating to guest MSIs on arm64.
>
> On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:
>> We can always have QEMU reject hot-adding the device if the reserved
>> region overlaps existing guest RAM, but I don't even really see how we
>> advise users to give them a reasonable chance of avoiding that
>> possibility. Apparently there are also ARM platforms where MSI pages
>> cannot be remapped to support the previous programmable user/VM
>> address, is it even worthwhile to support those platforms? Does that
>> decision influence whether user programmable MSI reserved regions are
>> really a second class citizen to fixed reserved regions? I expect
>> we'll be talking about this tomorrow morning, but I certainly haven't
>> come up with any viable solutions to this. Thanks,
>
> At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
> microconference. I presented some slides to illustrate some of the issues
> we're trying to solve:
>
> http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf
>
> Punit took some notes (thanks!) on the etherpad here:
>
> https://etherpad.openstack.org/p/LPC2016_PCI
>
> although the discussion was pretty lively and jumped about, so I've had
> to go from memory where the notes didn't capture everything that was
> said.
>
> To summarise, arm64 platforms differ in their handling of MSIs when compared
> to x86:
>
> 1. The physical memory map is not standardised (Jon pointed out that
> this is something that was realised late on)
> 2. MSIs are usually treated the same as DMA writes, in that they must be
> mapped by the SMMU page tables so that they target a physical MSI
> doorbell
> 3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
> doorbell built into the PCI RC)
Chaulk this case to 'the learning curve'.
Q35 chipset (the one being use for x86-PCIe qemu model) had no intr-remap hw,
only DMA addrs destined for real memory. assigned-device intrs had to be caught
by kvm & injected into guests, and yes, a DoS was possible... and thus,
the intr-remap support being done after initial iommu support.


> 4. Platforms typically have some set of addresses that abort before
> reaching the SMMU (e.g. because the PCI identifies them as P2P).
ARM platforms that don't implement the equivalent of ACS (in PCI bridges within
a PCIe switch) are either not device-assignment capable, or the IOMMU domain
expands across the entire peer-to-peer (sub-)tree.
ACS(-like) functionality is a fundamental component to the security model,
as is the IOMMU itself. Without it, it's equivalent to not having an IOMMU.

Dare I ask?: Can these systems, or parts of these systems, just be deemed
"incomplete" or "not 100% secure" wrt device assignment, and other systems
can or will be ???
Not much different then the first x86 systems that tried to get it right
the first few times... :-/
I'm hearing (parts of) systems that are just not properly designed
for device-assignment use-case, probably b/c this (system) architecture
hasn't been pulled together from the variouis component architectures
(CPU, SMMU, IOT, etc.).

>
> All of this means that userspace (QEMU) needs to identify the memory
> regions corresponding to points (3) and (4) and ensure that they are
> not allocated in the guest physical (IPA) space. For platforms that can
> remap the MSI doorbell as in (2), then some space also needs to be
> allocated for that.
>
Again, proper ACS features/control eliminates this need.
A (multi-function) device should never be able to perform
IO to itself via its PCIe interface. Bridge-ACS pushes everything
up to SMMU for desitination resolution.
Without ACS, I don't see how a guest is migratible from one system to
another, unless the system-migration-group consists of system that
are exactly the same (wrt IO) [less/more system memory &/or cpu does
not affect VM system map.

Again, the initial Linux implementation did not have ACS,
but was 'resolved' by the default/common system mapping putting the PCI devices
into an area that was blocked from memory use (generally 3G->4G space).
ARM may not have that single, simple implementation, but a method to
indicated reserved regions, and then a check for matching/non-matching reserved
regions for guest migration, is the only way I see to resolve this issue
until ACS is sufficiently supported int the hw subsystems to be used
for device-assignment.

> Rather than treat these as separate problems, a better interface is to
> tell userspace about a set of reserved regions, and have this include
> the MSI doorbell, irrespective of whether or not it can be remapped.
> Don suggested that we statically pick an address for the doorbell in a
> similar way to x86, and have the kernel map it there. We could even pick
> 0xfee00000.
I suggest picking a 'relative-fixed' address: the last n-pages of system memory
address space, i.e.,
0xfff[....]fee00000. ... sign-extended 0xfee00000.
That way, no matter how large memory is, there is no hole, it's just the
last 2M of memory ... every system has an end of memory. :-p

> If it conflicts with a reserved region on the platform (due
>to (4)), then we'd obviously have to (deterministically?) allocate it
> somewhere else, but probably within the bottom 4G.
>
why? It's more likely for a hw platform to use this <4G address range
for device mmio, then the upper-most address space for device space. Even if
platforms use upper-most address space now, it's not rocket science to subtract
upper 2M from existing use, then allocate device space from there (downward).
... and if you pull a 'there are systems that have hard-wired addresses
in upper-most 2M', then we can look at if we can quirk these systems, or
just not support them for this use-case.

> The next question is how to tell userspace about all of the reserved
> regions. Initially, the idea was to extend VFIO, however Alex pointed
Won't need to, if upper memory space is passed; take upper 2M and done. ;-)
For now, you'll need a whole new qemu paradigm, for a (hopefully) short-term
problem. I suggest coming up with a short-term, default 'safe' place for
devices & memory to avoid the qemu disruption.

> out a horrible scenario:
>
> 1. QEMU spawns a VM on system 0
> 2. VM is migrated to system 1
> 3. QEMU attempts to passthrough a device using PCI hotplug
>
> In this scenario, the guest memory map is chosen at step (1), yet there
> is no VFIO fd available to determine the reserved regions. Furthermore,
> the reserved regions may vary between system 0 and system 1. This pretty
> much rules out using VFIO to determine the reserved regions. Alex suggested
> that the SMMU driver can advertise the regions via /sys/class/iommu/. This
> would solve part of the problem, but migration between systems with
> different memory maps can still cause problems if the reserved regions
> of the new system conflict with the guest memory map chosen by QEMU.
> Jon pointed out that most people are pretty conservative about hardware
> choices when migrating between them -- that is, they may only migrate
> between different revisions of the same SoC, or they know ahead of time
> all of the memory maps they want to support and this could be communicated
> by way of configuration to libvirt. It would be up to QEMU to fail the
> hotplug if it detected a conflict. Alex asked if there was a security
> issue with DMA bypassing the SMMU, but there aren't currently any systems
> where that is known to happen. Such a system would surely not be safe for
> passthrough.
>
> Ben mused that a way to handle conflicts dynamically might be to hotplug
> on the entire host bridge in the guest, passing firmware tables describing
> the new reserved regions as a property of the host bridge. Whilst this
> may well solve the issue, it was largely considered future work due to
> its invasive nature and dependency on firmware tables (and guest support)
> that do not currently exist.
>
> Will
>

2016-11-08 17:55:53

by Will Deacon

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote:
> On 08/11/2016 03:45, Will Deacon wrote:
> > Rather than treat these as separate problems, a better interface is to
> > tell userspace about a set of reserved regions, and have this include
> > the MSI doorbell, irrespective of whether or not it can be remapped.
> > Don suggested that we statically pick an address for the doorbell in a
> > similar way to x86, and have the kernel map it there. We could even pick
> > 0xfee00000. If it conflicts with a reserved region on the platform (due
> > to (4)), then we'd obviously have to (deterministically?) allocate it
> > somewhere else, but probably within the bottom 4G.
>
> This is tentatively achieved now with
> [1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
> (http://www.mail-archive.com/[email protected]/msg1264506.html)

Yup, I saw that fly by. Hopefully some of the internals can be reused
with the current thinking on user ABI.

> > The next question is how to tell userspace about all of the reserved
> > regions. Initially, the idea was to extend VFIO, however Alex pointed
> > out a horrible scenario:
> >
> > 1. QEMU spawns a VM on system 0
> > 2. VM is migrated to system 1
> > 3. QEMU attempts to passthrough a device using PCI hotplug
> >
> > In this scenario, the guest memory map is chosen at step (1), yet there
> > is no VFIO fd available to determine the reserved regions. Furthermore,
> > the reserved regions may vary between system 0 and system 1. This pretty
> > much rules out using VFIO to determine the reserved regions.Alex suggested
> > that the SMMU driver can advertise the regions via /sys/class/iommu/. This
> > would solve part of the problem, but migration between systems with
> > different memory maps can still cause problems if the reserved regions
> > of the new system conflict with the guest memory map chosen by QEMU.
>
>
> OK so I understand we do not want anymore the VFIO chain capability API
> (patch 5 of above series) but we prefer a sysfs approach instead.

Right.

> I understand the sysfs approach which allows the userspace to get the
> info earlier and independently on VFIO. Keeping in mind current QEMU
> virt - which is not the only userspace - will not do much from this info
> until we bring upheavals in virt address space management. So if I am
> not wrong, at the moment the main action to be undertaken is the
> rejection of the PCI hotplug in case we detect a collision.

I don't think so; it should be up to userspace to reject the hotplug.
If userspace doesn't have support for the regions, then that's fine --
you just end up in a situation where the CPU page table maps memory
somewhere that the device can't see. In other words, you'll end up with
spurious DMA failures, but that's exactly what happens with current systems
if you passthrough an overlapping region (Robin demonstrated this on Juno).

Additionally, you can imagine some future support where you can tell the
guest not to use certain regions of its memory for DMA. In this case, you
wouldn't want to refuse the hotplug in the case of overlapping regions.

Really, I think the kernel side just needs to enumerate the fixed reserved
regions, place the doorbell at a fixed address and then advertise these
via sysfs.

> I can respin [1]
> - studying and taking into account Robin's comments about dm_regions
> similarities
> - removing the VFIO capability chain and replacing this by a sysfs API

Ideally, this would be reusable between different SMMU drivers so the sysfs
entries have the same format etc.

> Would that be OK?

Sounds good to me. Are you in a position to prototype something on the qemu
side once we've got kernel-side agreement?

> What about Alex comments who wanted to report the usable memory ranges
> instead of unusable memory ranges?
>
> Also did you have a chance to discuss the following items:
> 1) the VFIO irq safety assessment

The discussion really focussed on system topology, as opposed to properties
of the doorbell. Regardless of how the device talks to the doorbell, if
the doorbell can't protect against things like MSI spoofing, then it's
unsafe. My opinion is that we shouldn't allow passthrough by default on
systems with unsafe doorbells (we could piggyback on allow_unsafe_interrupts
cmdline option to VFIO).

A first step would be making all this opt-in, and only supporting GICv3
ITS for now.

> 2) the MSI reserved size computation (is an arbitrary size OK?)

If we fix the base address, we could fix a size too. However, we'd still
need to enumerate the doorbells to check that they fit in the region we
have. If not, then we can warn during boot and treat it the same way as
a resource conflict (that is, reallocate the region in some deterministic
way).

Will

2016-11-08 19:02:47

by Donald Dutile

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/08/2016 12:54 PM, Will Deacon wrote:
> On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote:
>> On 08/11/2016 03:45, Will Deacon wrote:
>>> Rather than treat these as separate problems, a better interface is to
>>> tell userspace about a set of reserved regions, and have this include
>>> the MSI doorbell, irrespective of whether or not it can be remapped.
>>> Don suggested that we statically pick an address for the doorbell in a
>>> similar way to x86, and have the kernel map it there. We could even pick
>>> 0xfee00000. If it conflicts with a reserved region on the platform (due
>>> to (4)), then we'd obviously have to (deterministically?) allocate it
>>> somewhere else, but probably within the bottom 4G.
>> This is tentatively achieved now with
>> [1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
>> (http://www.mail-archive.com/[email protected]/msg1264506.html)
> Yup, I saw that fly by. Hopefully some of the internals can be reused
> with the current thinking on user ABI.
>
>>> The next question is how to tell userspace about all of the reserved
>>> regions. Initially, the idea was to extend VFIO, however Alex pointed
>>> out a horrible scenario:
>>>
>>> 1. QEMU spawns a VM on system 0
>>> 2. VM is migrated to system 1
>>> 3. QEMU attempts to passthrough a device using PCI hotplug
>>>
>>> In this scenario, the guest memory map is chosen at step (1), yet there
>>> is no VFIO fd available to determine the reserved regions. Furthermore,
>>> the reserved regions may vary between system 0 and system 1. This pretty
>>> much rules out using VFIO to determine the reserved regions.Alex suggested
>>> that the SMMU driver can advertise the regions via /sys/class/iommu/. This
>>> would solve part of the problem, but migration between systems with
>>> different memory maps can still cause problems if the reserved regions
>>> of the new system conflict with the guest memory map chosen by QEMU.
>>
>> OK so I understand we do not want anymore the VFIO chain capability API
>> (patch 5 of above series) but we prefer a sysfs approach instead.
> Right.
>
>> I understand the sysfs approach which allows the userspace to get the
>> info earlier and independently on VFIO. Keeping in mind current QEMU
>> virt - which is not the only userspace - will not do much from this info
>> until we bring upheavals in virt address space management. So if I am
>> not wrong, at the moment the main action to be undertaken is the
>> rejection of the PCI hotplug in case we detect a collision.
> I don't think so; it should be up to userspace to reject the hotplug.
> If userspace doesn't have support for the regions, then that's fine --
> you just end up in a situation where the CPU page table maps memory
> somewhere that the device can't see. In other words, you'll end up with
> spurious DMA failures, but that's exactly what happens with current systems
> if you passthrough an overlapping region (Robin demonstrated this on Juno).
>
> Additionally, you can imagine some future support where you can tell the
> guest not to use certain regions of its memory for DMA. In this case, you
> wouldn't want to refuse the hotplug in the case of overlapping regions.
>
> Really, I think the kernel side just needs to enumerate the fixed reserved
> regions, place the doorbell at a fixed address and then advertise these
> via sysfs.
>
>> I can respin [1]
>> - studying and taking into account Robin's comments about dm_regions
>> similarities
>> - removing the VFIO capability chain and replacing this by a sysfs API
> Ideally, this would be reusable between different SMMU drivers so the sysfs
> entries have the same format etc.
>
>> Would that be OK?
> Sounds good to me. Are you in a position to prototype something on the qemu
> side once we've got kernel-side agreement?
>
>> What about Alex comments who wanted to report the usable memory ranges
>> instead of unusable memory ranges?
>>
>> Also did you have a chance to discuss the following items:
>> 1) the VFIO irq safety assessment
> The discussion really focussed on system topology, as opposed to properties
> of the doorbell. Regardless of how the device talks to the doorbell, if
> the doorbell can't protect against things like MSI spoofing, then it's
> unsafe. My opinion is that we shouldn't allow passthrough by default on
> systems with unsafe doorbells (we could piggyback on allow_unsafe_interrupts
> cmdline option to VFIO).
>
> A first step would be making all this opt-in, and only supporting GICv3
> ITS for now.
You're trying to support a config that is < GICv3 and no ITS ? ...
That would be the equiv. of x86 pre-intr-remap, and that's why allow_unsafe_interrupts
hook was created ... to enable devel/kick-the-tires.
>> 2) the MSI reserved size computation (is an arbitrary size OK?)
> If we fix the base address, we could fix a size too. However, we'd still
> need to enumerate the doorbells to check that they fit in the region we
> have. If not, then we can warn during boot and treat it the same way as
> a resource conflict (that is, reallocate the region in some deterministic
> way).
>
> Will

2016-11-08 19:10:23

by Will Deacon

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Tue, Nov 08, 2016 at 02:02:39PM -0500, Don Dutile wrote:
> On 11/08/2016 12:54 PM, Will Deacon wrote:
> >A first step would be making all this opt-in, and only supporting GICv3
> >ITS for now.
> You're trying to support a config that is < GICv3 and no ITS ? ...
> That would be the equiv. of x86 pre-intr-remap, and that's why allow_unsafe_interrupts
> hook was created ... to enable devel/kick-the-tires.

Yup, that's exactly what I was envisaging. For systems that can't do
passthrough safely, we'll always have to throw a devel switch.

Will

2016-11-08 20:29:30

by Christoffer Dall

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe (was: Re: [RFC 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 (Alt II))

Hi Will,

On Tue, Nov 08, 2016 at 02:45:59AM +0000, Will Deacon wrote:
> Hi all,
>
> I figured this was a reasonable post to piggy-back on for the LPC minutes
> relating to guest MSIs on arm64.
>
> On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:
> > We can always have QEMU reject hot-adding the device if the reserved
> > region overlaps existing guest RAM, but I don't even really see how we
> > advise users to give them a reasonable chance of avoiding that
> > possibility. Apparently there are also ARM platforms where MSI pages
> > cannot be remapped to support the previous programmable user/VM
> > address, is it even worthwhile to support those platforms? Does that
> > decision influence whether user programmable MSI reserved regions are
> > really a second class citizen to fixed reserved regions? I expect
> > we'll be talking about this tomorrow morning, but I certainly haven't
> > come up with any viable solutions to this. Thanks,
>
> At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
> microconference. I presented some slides to illustrate some of the issues
> we're trying to solve:
>
> http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf
>
> Punit took some notes (thanks!) on the etherpad here:
>
> https://etherpad.openstack.org/p/LPC2016_PCI
>
> although the discussion was pretty lively and jumped about, so I've had
> to go from memory where the notes didn't capture everything that was
> said.
>
> To summarise, arm64 platforms differ in their handling of MSIs when compared
> to x86:
>
> 1. The physical memory map is not standardised (Jon pointed out that
> this is something that was realised late on)
> 2. MSIs are usually treated the same as DMA writes, in that they must be
> mapped by the SMMU page tables so that they target a physical MSI
> doorbell
> 3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
> doorbell built into the PCI RC)
> 4. Platforms typically have some set of addresses that abort before
> reaching the SMMU (e.g. because the PCI identifies them as P2P).
>
> All of this means that userspace (QEMU) needs to identify the memory
> regions corresponding to points (3) and (4) and ensure that they are
> not allocated in the guest physical (IPA) space. For platforms that can
> remap the MSI doorbell as in (2), then some space also needs to be
> allocated for that.
>
> Rather than treat these as separate problems, a better interface is to
> tell userspace about a set of reserved regions, and have this include
> the MSI doorbell, irrespective of whether or not it can be remapped.

Is my understanding correct, that you need to tell userspace about the
location of the doorbell (in the IOVA space) in case (2), because even
though the configuration of the device is handled by the (host) kernel
through trapping of the BARs, we have to avoid the VFIO user programming
the device to create other DMA transactions to this particular address,
since that will obviously conflict and either not produce the desired
DMA transactions or result in unintended weird interrupts?

Thanks,
Christoffer

2016-11-08 23:35:19

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe (was: Re: [RFC 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 (Alt II))

On Tue, 8 Nov 2016 21:29:22 +0100
Christoffer Dall <[email protected]> wrote:

> Hi Will,
>
> On Tue, Nov 08, 2016 at 02:45:59AM +0000, Will Deacon wrote:
> > Hi all,
> >
> > I figured this was a reasonable post to piggy-back on for the LPC minutes
> > relating to guest MSIs on arm64.
> >
> > On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:
> > > We can always have QEMU reject hot-adding the device if the reserved
> > > region overlaps existing guest RAM, but I don't even really see how we
> > > advise users to give them a reasonable chance of avoiding that
> > > possibility. Apparently there are also ARM platforms where MSI pages
> > > cannot be remapped to support the previous programmable user/VM
> > > address, is it even worthwhile to support those platforms? Does that
> > > decision influence whether user programmable MSI reserved regions are
> > > really a second class citizen to fixed reserved regions? I expect
> > > we'll be talking about this tomorrow morning, but I certainly haven't
> > > come up with any viable solutions to this. Thanks,
> >
> > At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
> > microconference. I presented some slides to illustrate some of the issues
> > we're trying to solve:
> >
> > http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf
> >
> > Punit took some notes (thanks!) on the etherpad here:
> >
> > https://etherpad.openstack.org/p/LPC2016_PCI
> >
> > although the discussion was pretty lively and jumped about, so I've had
> > to go from memory where the notes didn't capture everything that was
> > said.
> >
> > To summarise, arm64 platforms differ in their handling of MSIs when compared
> > to x86:
> >
> > 1. The physical memory map is not standardised (Jon pointed out that
> > this is something that was realised late on)
> > 2. MSIs are usually treated the same as DMA writes, in that they must be
> > mapped by the SMMU page tables so that they target a physical MSI
> > doorbell
> > 3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
> > doorbell built into the PCI RC)
> > 4. Platforms typically have some set of addresses that abort before
> > reaching the SMMU (e.g. because the PCI identifies them as P2P).
> >
> > All of this means that userspace (QEMU) needs to identify the memory
> > regions corresponding to points (3) and (4) and ensure that they are
> > not allocated in the guest physical (IPA) space. For platforms that can
> > remap the MSI doorbell as in (2), then some space also needs to be
> > allocated for that.
> >
> > Rather than treat these as separate problems, a better interface is to
> > tell userspace about a set of reserved regions, and have this include
> > the MSI doorbell, irrespective of whether or not it can be remapped.
>
> Is my understanding correct, that you need to tell userspace about the
> location of the doorbell (in the IOVA space) in case (2), because even
> though the configuration of the device is handled by the (host) kernel
> through trapping of the BARs, we have to avoid the VFIO user programming
> the device to create other DMA transactions to this particular address,
> since that will obviously conflict and either not produce the desired
> DMA transactions or result in unintended weird interrupts?

Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
it's potentially a DMA target and we'll get bogus data on DMA read from
the device, and lose data and potentially trigger spurious interrupts on
DMA write from the device. Thanks,

Alex

2016-11-09 02:52:45

by Donald Dutile

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/08/2016 06:35 PM, Alex Williamson wrote:
> On Tue, 8 Nov 2016 21:29:22 +0100
> Christoffer Dall <[email protected]> wrote:
>
>> Hi Will,
>>
>> On Tue, Nov 08, 2016 at 02:45:59AM +0000, Will Deacon wrote:
>>> Hi all,
>>>
>>> I figured this was a reasonable post to piggy-back on for the LPC minutes
>>> relating to guest MSIs on arm64.
>>>
>>> On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:
>>>> We can always have QEMU reject hot-adding the device if the reserved
>>>> region overlaps existing guest RAM, but I don't even really see how we
>>>> advise users to give them a reasonable chance of avoiding that
>>>> possibility. Apparently there are also ARM platforms where MSI pages
>>>> cannot be remapped to support the previous programmable user/VM
>>>> address, is it even worthwhile to support those platforms? Does that
>>>> decision influence whether user programmable MSI reserved regions are
>>>> really a second class citizen to fixed reserved regions? I expect
>>>> we'll be talking about this tomorrow morning, but I certainly haven't
>>>> come up with any viable solutions to this. Thanks,
>>>
>>> At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
>>> microconference. I presented some slides to illustrate some of the issues
>>> we're trying to solve:
>>>
>>> http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf
>>>
>>> Punit took some notes (thanks!) on the etherpad here:
>>>
>>> https://etherpad.openstack.org/p/LPC2016_PCI
>>>
>>> although the discussion was pretty lively and jumped about, so I've had
>>> to go from memory where the notes didn't capture everything that was
>>> said.
>>>
>>> To summarise, arm64 platforms differ in their handling of MSIs when compared
>>> to x86:
>>>
>>> 1. The physical memory map is not standardised (Jon pointed out that
>>> this is something that was realised late on)
>>> 2. MSIs are usually treated the same as DMA writes, in that they must be
>>> mapped by the SMMU page tables so that they target a physical MSI
>>> doorbell
>>> 3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
>>> doorbell built into the PCI RC)
>>> 4. Platforms typically have some set of addresses that abort before
>>> reaching the SMMU (e.g. because the PCI identifies them as P2P).
>>>
>>> All of this means that userspace (QEMU) needs to identify the memory
>>> regions corresponding to points (3) and (4) and ensure that they are
>>> not allocated in the guest physical (IPA) space. For platforms that can
>>> remap the MSI doorbell as in (2), then some space also needs to be
>>> allocated for that.
>>>
>>> Rather than treat these as separate problems, a better interface is to
>>> tell userspace about a set of reserved regions, and have this include
>>> the MSI doorbell, irrespective of whether or not it can be remapped.
>>
>> Is my understanding correct, that you need to tell userspace about the
>> location of the doorbell (in the IOVA space) in case (2), because even
>> though the configuration of the device is handled by the (host) kernel
>> through trapping of the BARs, we have to avoid the VFIO user programming
>> the device to create other DMA transactions to this particular address,
>> since that will obviously conflict and either not produce the desired
>> DMA transactions or result in unintended weird interrupts?
>
> Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
> it's potentially a DMA target and we'll get bogus data on DMA read from
> the device, and lose data and potentially trigger spurious interrupts on
> DMA write from the device. Thanks,
>
> Alex
>
That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
they address match before the SMMU checks are done. if
all DMA addrs had to go through SMMU first, then the DMA access could
be ignored/rejected.
For bare-metal, memory can't be put in the same place as MSI addrs, or
DMA could never reach it. So, only a virt issue, unless the VMs mem address
range mimic the host layout.

- Don

2016-11-09 07:43:22

by Eric Auger

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

Hi Will,
On 08/11/2016 20:02, Don Dutile wrote:
> On 11/08/2016 12:54 PM, Will Deacon wrote:
>> On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote:
>>> On 08/11/2016 03:45, Will Deacon wrote:
>>>> Rather than treat these as separate problems, a better interface is to
>>>> tell userspace about a set of reserved regions, and have this include
>>>> the MSI doorbell, irrespective of whether or not it can be remapped.
>>>> Don suggested that we statically pick an address for the doorbell in a
>>>> similar way to x86, and have the kernel map it there. We could even
>>>> pick
>>>> 0xfee00000. If it conflicts with a reserved region on the platform (due
>>>> to (4)), then we'd obviously have to (deterministically?) allocate it
>>>> somewhere else, but probably within the bottom 4G.
>>> This is tentatively achieved now with
>>> [1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
>>> (http://www.mail-archive.com/[email protected]/msg1264506.html)
>>>
>> Yup, I saw that fly by. Hopefully some of the internals can be reused
>> with the current thinking on user ABI.
>>
>>>> The next question is how to tell userspace about all of the reserved
>>>> regions. Initially, the idea was to extend VFIO, however Alex pointed
>>>> out a horrible scenario:
>>>>
>>>> 1. QEMU spawns a VM on system 0
>>>> 2. VM is migrated to system 1
>>>> 3. QEMU attempts to passthrough a device using PCI hotplug
>>>>
>>>> In this scenario, the guest memory map is chosen at step (1), yet there
>>>> is no VFIO fd available to determine the reserved regions. Furthermore,
>>>> the reserved regions may vary between system 0 and system 1. This
>>>> pretty
>>>> much rules out using VFIO to determine the reserved regions.Alex
>>>> suggested
>>>> that the SMMU driver can advertise the regions via
>>>> /sys/class/iommu/. This
>>>> would solve part of the problem, but migration between systems with
>>>> different memory maps can still cause problems if the reserved regions
>>>> of the new system conflict with the guest memory map chosen by QEMU.
>>>
>>> OK so I understand we do not want anymore the VFIO chain capability API
>>> (patch 5 of above series) but we prefer a sysfs approach instead.
>> Right.
>>
>>> I understand the sysfs approach which allows the userspace to get the
>>> info earlier and independently on VFIO. Keeping in mind current QEMU
>>> virt - which is not the only userspace - will not do much from this info
>>> until we bring upheavals in virt address space management. So if I am
>>> not wrong, at the moment the main action to be undertaken is the
>>> rejection of the PCI hotplug in case we detect a collision.
>> I don't think so; it should be up to userspace to reject the hotplug.
>> If userspace doesn't have support for the regions, then that's fine --
>> you just end up in a situation where the CPU page table maps memory
>> somewhere that the device can't see. In other words, you'll end up with
>> spurious DMA failures, but that's exactly what happens with current
>> systems
>> if you passthrough an overlapping region (Robin demonstrated this on
>> Juno).
>>
>> Additionally, you can imagine some future support where you can tell the
>> guest not to use certain regions of its memory for DMA. In this case, you
>> wouldn't want to refuse the hotplug in the case of overlapping regions.
>>
>> Really, I think the kernel side just needs to enumerate the fixed
>> reserved
>> regions, place the doorbell at a fixed address and then advertise these
>> via sysfs.
>>
>>> I can respin [1]
>>> - studying and taking into account Robin's comments about dm_regions
>>> similarities
>>> - removing the VFIO capability chain and replacing this by a sysfs API
>> Ideally, this would be reusable between different SMMU drivers so the
>> sysfs
>> entries have the same format etc.
>>
>>> Would that be OK?
>> Sounds good to me. Are you in a position to prototype something on the
>> qemu
>> side once we've got kernel-side agreement?
yes sure.
>>
>>> What about Alex comments who wanted to report the usable memory ranges
>>> instead of unusable memory ranges?
>>>
>>> Also did you have a chance to discuss the following items:
>>> 1) the VFIO irq safety assessment
>> The discussion really focussed on system topology, as opposed to
>> properties
>> of the doorbell. Regardless of how the device talks to the doorbell, if
>> the doorbell can't protect against things like MSI spoofing, then it's
>> unsafe. My opinion is that we shouldn't allow passthrough by default on
>> systems with unsafe doorbells (we could piggyback on
>> allow_unsafe_interrupts
>> cmdline option to VFIO).
OK.
>>
>> A first step would be making all this opt-in, and only supporting GICv3
>> ITS for now.
> You're trying to support a config that is < GICv3 and no ITS ? ...
> That would be the equiv. of x86 pre-intr-remap, and that's why
> allow_unsafe_interrupts
> hook was created ... to enable devel/kick-the-tires.
>>> 2) the MSI reserved size computation (is an arbitrary size OK?)
>> If we fix the base address, we could fix a size too. However, we'd still
>> need to enumerate the doorbells to check that they fit in the region we
>> have. If not, then we can warn during boot and treat it the same way as
>> a resource conflict (that is, reallocate the region in some deterministic
>> way).
OK

Thanks

Eric
>>
>> Will
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2016-11-09 17:03:26

by Will Deacon

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
> On 11/08/2016 06:35 PM, Alex Williamson wrote:
> >On Tue, 8 Nov 2016 21:29:22 +0100
> >Christoffer Dall <[email protected]> wrote:
> >>Is my understanding correct, that you need to tell userspace about the
> >>location of the doorbell (in the IOVA space) in case (2), because even
> >>though the configuration of the device is handled by the (host) kernel
> >>through trapping of the BARs, we have to avoid the VFIO user programming
> >>the device to create other DMA transactions to this particular address,
> >>since that will obviously conflict and either not produce the desired
> >>DMA transactions or result in unintended weird interrupts?

Yes, that's the crux of the issue.

> >Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
> >it's potentially a DMA target and we'll get bogus data on DMA read from
> >the device, and lose data and potentially trigger spurious interrupts on
> >DMA write from the device. Thanks,
> >
> That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
> they address match before the SMMU checks are done. if
> all DMA addrs had to go through SMMU first, then the DMA access could
> be ignored/rejected.

That's actually not true :( The SMMU can't generally distinguish between MSI
writes and DMA writes, so it would just see a write transaction to the
doorbell address, regardless of how it was generated by the endpoint.

Will

2016-11-09 18:59:12

by Donald Dutile

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/09/2016 12:03 PM, Will Deacon wrote:
> On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
>> On 11/08/2016 06:35 PM, Alex Williamson wrote:
>>> On Tue, 8 Nov 2016 21:29:22 +0100
>>> Christoffer Dall <[email protected]> wrote:
>>>> Is my understanding correct, that you need to tell userspace about the
>>>> location of the doorbell (in the IOVA space) in case (2), because even
>>>> though the configuration of the device is handled by the (host) kernel
>>>> through trapping of the BARs, we have to avoid the VFIO user programming
>>>> the device to create other DMA transactions to this particular address,
>>>> since that will obviously conflict and either not produce the desired
>>>> DMA transactions or result in unintended weird interrupts?
>
> Yes, that's the crux of the issue.
>
>>> Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
>>> it's potentially a DMA target and we'll get bogus data on DMA read from
>>> the device, and lose data and potentially trigger spurious interrupts on
>>> DMA write from the device. Thanks,
>>>
>> That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
>> they address match before the SMMU checks are done. if
>> all DMA addrs had to go through SMMU first, then the DMA access could
>> be ignored/rejected.
>
> That's actually not true :( The SMMU can't generally distinguish between MSI
> writes and DMA writes, so it would just see a write transaction to the
> doorbell address, regardless of how it was generated by the endpoint.
>
> Will
>
So, we have real systems where MSI doorbells are placed at the same IOVA
that could have memory for a guest, but not at the same IOVA as memory on real hw ?
How are memory holes passed to SMMU so it doesn't have this issue for bare-metal
(assign an IOVA that overlaps an MSI doorbell address)?




2016-11-09 19:23:07

by Christoffer Dall

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, Nov 09, 2016 at 01:59:07PM -0500, Don Dutile wrote:
> On 11/09/2016 12:03 PM, Will Deacon wrote:
> >On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
> >>On 11/08/2016 06:35 PM, Alex Williamson wrote:
> >>>On Tue, 8 Nov 2016 21:29:22 +0100
> >>>Christoffer Dall <[email protected]> wrote:
> >>>>Is my understanding correct, that you need to tell userspace about the
> >>>>location of the doorbell (in the IOVA space) in case (2), because even
> >>>>though the configuration of the device is handled by the (host) kernel
> >>>>through trapping of the BARs, we have to avoid the VFIO user programming
> >>>>the device to create other DMA transactions to this particular address,
> >>>>since that will obviously conflict and either not produce the desired
> >>>>DMA transactions or result in unintended weird interrupts?
> >
> >Yes, that's the crux of the issue.
> >
> >>>Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
> >>>it's potentially a DMA target and we'll get bogus data on DMA read from
> >>>the device, and lose data and potentially trigger spurious interrupts on
> >>>DMA write from the device. Thanks,
> >>>
> >>That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
> >>they address match before the SMMU checks are done. if
> >>all DMA addrs had to go through SMMU first, then the DMA access could
> >>be ignored/rejected.
> >
> >That's actually not true :( The SMMU can't generally distinguish between MSI
> >writes and DMA writes, so it would just see a write transaction to the
> >doorbell address, regardless of how it was generated by the endpoint.
> >
> >Will
> >
> So, we have real systems where MSI doorbells are placed at the same IOVA
> that could have memory for a guest

I don't think this is a property of a hardware system. THe problem is
userspace not knowing where in the IOVA space the kernel is going to
place the doorbell, so you can end up (basically by chance) that some
IPA range of guest memory overlaps with the IOVA space for the doorbell.


>, but not at the same IOVA as memory on real hw ?

On real hardware without an IOMMU the system designer would have to
separate the IOVA and RAM in the physical address space. With an IOMMU,
the SMMU driver just makes sure to allocate separate regions in the IOVA
space.

The challenge, as I understand it, happens with the VM, because the VM
doesn't allocate the IOVA for the MSI doorbell itself, but the host
kernel does this, independently from the attributes (e.g. memory map) of
the VM.

Because the IOVA is a single resource, but with two independent entities
allocating chunks of it (the host kernel for the MSI doorbell IOVA, and
the VFIO user for other DMA operations), you have to provide some
coordination between those to entities to avoid conflicts. In the case
of KVM, the two entities are the host kernel and the VFIO user (QEMU/the
VM), and the host kernel informs the VFIO user to never attempt to use
the doorbell IOVA already reserved by the host kernel for DMA.

One way to do that is to ensure that the IPA space of the VFIO user
corresponding to the doorbell IOVA is simply not valid, ie. the reserved
regions that avoid for example QEMU to allocate RAM there.

(I suppose it's technically possible to get around this issue by letting
QEMU place RAM wherever it wants but tell the guest to never use a
particular subset of its RAM for DMA, because that would conflict with
the doorbell IOVA or be seen as p2p transactions. But I think we all
probably agree that it's a disgusting idea.)

> How are memory holes passed to SMMU so it doesn't have this issue for bare-metal
> (assign an IOVA that overlaps an MSI doorbell address)?
>

As I understand it, the SMMU driver manages the whole IOVA space when
VFIO is *not* involved, so it simply allocates non-overlapping regions.

The problem occurs when you have two independent entities essentially
attempting to mange the same resource (and the problem is exacerbated by
the VM potentially allocating slots in the IOVA space which may have
other limitations it doesn't know about, for example the p2p regions,
because the VM doesn't know anything about the topology of the
underlying physical system).

Christoffer

2016-11-09 20:01:21

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, 9 Nov 2016 20:23:03 +0100
Christoffer Dall <[email protected]> wrote:

> On Wed, Nov 09, 2016 at 01:59:07PM -0500, Don Dutile wrote:
> > On 11/09/2016 12:03 PM, Will Deacon wrote:
> > >On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
> > >>On 11/08/2016 06:35 PM, Alex Williamson wrote:
> > >>>On Tue, 8 Nov 2016 21:29:22 +0100
> > >>>Christoffer Dall <[email protected]> wrote:
> > >>>>Is my understanding correct, that you need to tell userspace about the
> > >>>>location of the doorbell (in the IOVA space) in case (2), because even
> > >>>>though the configuration of the device is handled by the (host) kernel
> > >>>>through trapping of the BARs, we have to avoid the VFIO user programming
> > >>>>the device to create other DMA transactions to this particular address,
> > >>>>since that will obviously conflict and either not produce the desired
> > >>>>DMA transactions or result in unintended weird interrupts?
> > >
> > >Yes, that's the crux of the issue.
> > >
> > >>>Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
> > >>>it's potentially a DMA target and we'll get bogus data on DMA read from
> > >>>the device, and lose data and potentially trigger spurious interrupts on
> > >>>DMA write from the device. Thanks,
> > >>>
> > >>That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
> > >>they address match before the SMMU checks are done. if
> > >>all DMA addrs had to go through SMMU first, then the DMA access could
> > >>be ignored/rejected.
> > >
> > >That's actually not true :( The SMMU can't generally distinguish between MSI
> > >writes and DMA writes, so it would just see a write transaction to the
> > >doorbell address, regardless of how it was generated by the endpoint.
> > >
> > >Will
> > >
> > So, we have real systems where MSI doorbells are placed at the same IOVA
> > that could have memory for a guest
>
> I don't think this is a property of a hardware system. THe problem is
> userspace not knowing where in the IOVA space the kernel is going to
> place the doorbell, so you can end up (basically by chance) that some
> IPA range of guest memory overlaps with the IOVA space for the doorbell.
>
>
> >, but not at the same IOVA as memory on real hw ?
>
> On real hardware without an IOMMU the system designer would have to
> separate the IOVA and RAM in the physical address space. With an IOMMU,
> the SMMU driver just makes sure to allocate separate regions in the IOVA
> space.
>
> The challenge, as I understand it, happens with the VM, because the VM
> doesn't allocate the IOVA for the MSI doorbell itself, but the host
> kernel does this, independently from the attributes (e.g. memory map) of
> the VM.
>
> Because the IOVA is a single resource, but with two independent entities
> allocating chunks of it (the host kernel for the MSI doorbell IOVA, and
> the VFIO user for other DMA operations), you have to provide some
> coordination between those to entities to avoid conflicts. In the case
> of KVM, the two entities are the host kernel and the VFIO user (QEMU/the
> VM), and the host kernel informs the VFIO user to never attempt to use
> the doorbell IOVA already reserved by the host kernel for DMA.
>
> One way to do that is to ensure that the IPA space of the VFIO user
> corresponding to the doorbell IOVA is simply not valid, ie. the reserved
> regions that avoid for example QEMU to allocate RAM there.
>
> (I suppose it's technically possible to get around this issue by letting
> QEMU place RAM wherever it wants but tell the guest to never use a
> particular subset of its RAM for DMA, because that would conflict with
> the doorbell IOVA or be seen as p2p transactions. But I think we all
> probably agree that it's a disgusting idea.)

Well, it's not like QEMU or libvirt stumbling through sysfs to figure
out where holes could be in order to instantiate a VM with matching
holes, just in case someone might decide to hot-add a device into the
VM, at some point, and hopefully they don't migrate the VM to another
host with a different layout first, is all that much less disgusting or
foolproof. It's just that in order to dynamically remove a page as a
possible DMA target we require a paravirt channel, such as a balloon
driver that's able to pluck a specific page. In some ways it's
actually less disgusting, but it puts some prerequisites on
enlightening the guest OS. Thanks,

Alex

2016-11-09 20:11:24

by Robin Murphy

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 09/11/16 18:59, Don Dutile wrote:
> On 11/09/2016 12:03 PM, Will Deacon wrote:
>> On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
>>> On 11/08/2016 06:35 PM, Alex Williamson wrote:
>>>> On Tue, 8 Nov 2016 21:29:22 +0100
>>>> Christoffer Dall <[email protected]> wrote:
>>>>> Is my understanding correct, that you need to tell userspace about the
>>>>> location of the doorbell (in the IOVA space) in case (2), because even
>>>>> though the configuration of the device is handled by the (host) kernel
>>>>> through trapping of the BARs, we have to avoid the VFIO user
>>>>> programming
>>>>> the device to create other DMA transactions to this particular
>>>>> address,
>>>>> since that will obviously conflict and either not produce the desired
>>>>> DMA transactions or result in unintended weird interrupts?
>>
>> Yes, that's the crux of the issue.
>>
>>>> Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
>>>> it's potentially a DMA target and we'll get bogus data on DMA read from
>>>> the device, and lose data and potentially trigger spurious
>>>> interrupts on
>>>> DMA write from the device. Thanks,
>>>>
>>> That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
>>> they address match before the SMMU checks are done. if
>>> all DMA addrs had to go through SMMU first, then the DMA access could
>>> be ignored/rejected.
>>
>> That's actually not true :( The SMMU can't generally distinguish
>> between MSI
>> writes and DMA writes, so it would just see a write transaction to the
>> doorbell address, regardless of how it was generated by the endpoint.
>>
>> Will
>>
> So, we have real systems where MSI doorbells are placed at the same IOVA
> that could have memory for a guest, but not at the same IOVA as memory
> on real hw ?

MSI doorbells integral to PCIe root complexes (and thus untranslatable)
typically have a programmable address, so could be anywhere. In the more
general category of "special hardware addresses", QEMU's default ARM
guest memory map puts RAM starting at 0x40000000; on the ARM Juno
platform, that happens to be where PCI config space starts; as Juno's
PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
the PCI bus to a guest (all of it, given the lack of ACS), the root
complex just sees the guest's attempts to DMA to "memory" as the device
attempting to access config space and aborts them.

> How are memory holes passed to SMMU so it doesn't have this issue for
> bare-metal
> (assign an IOVA that overlaps an MSI doorbell address)?

When we *are* in full control of the IOVA space, we just carve out what
we can find as best we can - see iova_reserve_pci_windows() in
dma-iommu.c, which isn't really all that different to what x86 does
(e.g. init_reserved_iova_ranges() in amd-iommu.c). Note that we don't
actually have any way currently to discover upstream MSI doorbells
(ponder dw_pcie_msi_init() in pcie-designware.c for an example of the
problem) - the specific MSI support we have in DMA ops at the moment
only covers GICv2m or GICv3 ITS downstream of translation, but
fortunately that's the typical relevant use-case on current platforms.

Robin.

2016-11-09 20:31:45

by Will Deacon

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
> On Wed, Nov 09, 2016 at 01:59:07PM -0500, Don Dutile wrote:
> > On 11/09/2016 12:03 PM, Will Deacon wrote:
> > >On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
> > >>On 11/08/2016 06:35 PM, Alex Williamson wrote:
> > >>>Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
> > >>>it's potentially a DMA target and we'll get bogus data on DMA read from
> > >>>the device, and lose data and potentially trigger spurious interrupts on
> > >>>DMA write from the device. Thanks,
> > >>>
> > >>That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
> > >>they address match before the SMMU checks are done. if
> > >>all DMA addrs had to go through SMMU first, then the DMA access could
> > >>be ignored/rejected.
> > >
> > >That's actually not true :( The SMMU can't generally distinguish between MSI
> > >writes and DMA writes, so it would just see a write transaction to the
> > >doorbell address, regardless of how it was generated by the endpoint.
> > >
> > So, we have real systems where MSI doorbells are placed at the same IOVA
> > that could have memory for a guest
>
> I don't think this is a property of a hardware system. THe problem is
> userspace not knowing where in the IOVA space the kernel is going to
> place the doorbell, so you can end up (basically by chance) that some
> IPA range of guest memory overlaps with the IOVA space for the doorbell.

I think the case that Don has in mind is where the host is using the SMMU
for DMA mapping. In that case, yes, the IOVAs assigned by things like
dma_map_single mustn't collide with any fixed MSI mapping. We currently take
care to avoid PCI windows, but nobody has added the code for the fixed MSI
mappings yet (I think we should put the onus on the people with the broken
systems for that). Depending on how the firmware describes the fixed MSI
address, either the irqchip driver can take care of it in compose_msi_msg,
or we could do something in the iommu_dma_map_msi_msg path to ensure that
the fixed region is preallocated in the msi_page_list.

I'm less fussed about this issue because there's not a user ABI involved,
so it can all be fixed later.

> >, but not at the same IOVA as memory on real hw ?
>
> On real hardware without an IOMMU the system designer would have to
> separate the IOVA and RAM in the physical address space. With an IOMMU,
> the SMMU driver just makes sure to allocate separate regions in the IOVA
> space.
>
> The challenge, as I understand it, happens with the VM, because the VM
> doesn't allocate the IOVA for the MSI doorbell itself, but the host
> kernel does this, independently from the attributes (e.g. memory map) of
> the VM.
>
> Because the IOVA is a single resource, but with two independent entities
> allocating chunks of it (the host kernel for the MSI doorbell IOVA, and
> the VFIO user for other DMA operations), you have to provide some
> coordination between those to entities to avoid conflicts. In the case
> of KVM, the two entities are the host kernel and the VFIO user (QEMU/the
> VM), and the host kernel informs the VFIO user to never attempt to use
> the doorbell IOVA already reserved by the host kernel for DMA.
>
> One way to do that is to ensure that the IPA space of the VFIO user
> corresponding to the doorbell IOVA is simply not valid, ie. the reserved
> regions that avoid for example QEMU to allocate RAM there.
>
> (I suppose it's technically possible to get around this issue by letting
> QEMU place RAM wherever it wants but tell the guest to never use a
> particular subset of its RAM for DMA, because that would conflict with
> the doorbell IOVA or be seen as p2p transactions. But I think we all
> probably agree that it's a disgusting idea.)

Disgusting, yes, but Ben's idea of hotplugging on the host controller with
firmware tables describing the reserved regions is something that we could
do in the distant future. In the meantime, I don't think that VFIO should
explicitly reject overlapping mappings if userspace asks for them.

Will

2016-11-09 22:17:15

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, 9 Nov 2016 20:31:45 +0000
Will Deacon <[email protected]> wrote:
> On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
> >
> > (I suppose it's technically possible to get around this issue by letting
> > QEMU place RAM wherever it wants but tell the guest to never use a
> > particular subset of its RAM for DMA, because that would conflict with
> > the doorbell IOVA or be seen as p2p transactions. But I think we all
> > probably agree that it's a disgusting idea.)
>
> Disgusting, yes, but Ben's idea of hotplugging on the host controller with
> firmware tables describing the reserved regions is something that we could
> do in the distant future. In the meantime, I don't think that VFIO should
> explicitly reject overlapping mappings if userspace asks for them.

I'm confused by the last sentence here, rejecting user mappings that
overlap reserved ranges, such as MSI doorbell pages, is exactly how
we'd reject hot-adding a device when we meet such a conflict. If we
don't reject such a mapping, we're knowingly creating a situation that
potentially leads to data loss. Minimally, QEMU would need to know
about the reserved region, map around it through VFIO, and take
responsibility (somehow) for making sure that region is never used for
DMA. Thanks,

Alex

2016-11-09 22:25:27

by Will Deacon

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, Nov 09, 2016 at 03:17:09PM -0700, Alex Williamson wrote:
> On Wed, 9 Nov 2016 20:31:45 +0000
> Will Deacon <[email protected]> wrote:
> > On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
> > >
> > > (I suppose it's technically possible to get around this issue by letting
> > > QEMU place RAM wherever it wants but tell the guest to never use a
> > > particular subset of its RAM for DMA, because that would conflict with
> > > the doorbell IOVA or be seen as p2p transactions. But I think we all
> > > probably agree that it's a disgusting idea.)
> >
> > Disgusting, yes, but Ben's idea of hotplugging on the host controller with
> > firmware tables describing the reserved regions is something that we could
> > do in the distant future. In the meantime, I don't think that VFIO should
> > explicitly reject overlapping mappings if userspace asks for them.
>
> I'm confused by the last sentence here, rejecting user mappings that
> overlap reserved ranges, such as MSI doorbell pages, is exactly how
> we'd reject hot-adding a device when we meet such a conflict. If we
> don't reject such a mapping, we're knowingly creating a situation that
> potentially leads to data loss. Minimally, QEMU would need to know
> about the reserved region, map around it through VFIO, and take
> responsibility (somehow) for making sure that region is never used for
> DMA. Thanks,

Yes, but my point is that it should be up to QEMU to abort the hotplug, not
the host kernel, since there may be ways in which a guest can tolerate the
overlapping region (e.g. by avoiding that range of memory for DMA).

Will

2016-11-09 23:25:04

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, 9 Nov 2016 22:25:22 +0000
Will Deacon <[email protected]> wrote:

> On Wed, Nov 09, 2016 at 03:17:09PM -0700, Alex Williamson wrote:
> > On Wed, 9 Nov 2016 20:31:45 +0000
> > Will Deacon <[email protected]> wrote:
> > > On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
> > > >
> > > > (I suppose it's technically possible to get around this issue by letting
> > > > QEMU place RAM wherever it wants but tell the guest to never use a
> > > > particular subset of its RAM for DMA, because that would conflict with
> > > > the doorbell IOVA or be seen as p2p transactions. But I think we all
> > > > probably agree that it's a disgusting idea.)
> > >
> > > Disgusting, yes, but Ben's idea of hotplugging on the host controller with
> > > firmware tables describing the reserved regions is something that we could
> > > do in the distant future. In the meantime, I don't think that VFIO should
> > > explicitly reject overlapping mappings if userspace asks for them.
> >
> > I'm confused by the last sentence here, rejecting user mappings that
> > overlap reserved ranges, such as MSI doorbell pages, is exactly how
> > we'd reject hot-adding a device when we meet such a conflict. If we
> > don't reject such a mapping, we're knowingly creating a situation that
> > potentially leads to data loss. Minimally, QEMU would need to know
> > about the reserved region, map around it through VFIO, and take
> > responsibility (somehow) for making sure that region is never used for
> > DMA. Thanks,
>
> Yes, but my point is that it should be up to QEMU to abort the hotplug, not
> the host kernel, since there may be ways in which a guest can tolerate the
> overlapping region (e.g. by avoiding that range of memory for DMA).

The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
of IOVAs to a range of virtual addresses for a given device. If VFIO
cannot reasonably fulfill that contract, it must fail. It's up to QEMU
how to manage the hotplug and what memory regions it asks VFIO to map
for a device, but VFIO must reject mappings that it (or the SMMU by
virtue of using the IOMMU API) know to overlap reserved ranges. So I
still disagree with the referenced statement. Thanks,

Alex

2016-11-09 23:38:50

by Will Deacon

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, Nov 09, 2016 at 04:24:58PM -0700, Alex Williamson wrote:
> On Wed, 9 Nov 2016 22:25:22 +0000
> Will Deacon <[email protected]> wrote:
>
> > On Wed, Nov 09, 2016 at 03:17:09PM -0700, Alex Williamson wrote:
> > > On Wed, 9 Nov 2016 20:31:45 +0000
> > > Will Deacon <[email protected]> wrote:
> > > > On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
> > > > >
> > > > > (I suppose it's technically possible to get around this issue by letting
> > > > > QEMU place RAM wherever it wants but tell the guest to never use a
> > > > > particular subset of its RAM for DMA, because that would conflict with
> > > > > the doorbell IOVA or be seen as p2p transactions. But I think we all
> > > > > probably agree that it's a disgusting idea.)
> > > >
> > > > Disgusting, yes, but Ben's idea of hotplugging on the host controller with
> > > > firmware tables describing the reserved regions is something that we could
> > > > do in the distant future. In the meantime, I don't think that VFIO should
> > > > explicitly reject overlapping mappings if userspace asks for them.
> > >
> > > I'm confused by the last sentence here, rejecting user mappings that
> > > overlap reserved ranges, such as MSI doorbell pages, is exactly how
> > > we'd reject hot-adding a device when we meet such a conflict. If we
> > > don't reject such a mapping, we're knowingly creating a situation that
> > > potentially leads to data loss. Minimally, QEMU would need to know
> > > about the reserved region, map around it through VFIO, and take
> > > responsibility (somehow) for making sure that region is never used for
> > > DMA. Thanks,
> >
> > Yes, but my point is that it should be up to QEMU to abort the hotplug, not
> > the host kernel, since there may be ways in which a guest can tolerate the
> > overlapping region (e.g. by avoiding that range of memory for DMA).
>
> The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
> of IOVAs to a range of virtual addresses for a given device. If VFIO
> cannot reasonably fulfill that contract, it must fail. It's up to QEMU
> how to manage the hotplug and what memory regions it asks VFIO to map
> for a device, but VFIO must reject mappings that it (or the SMMU by
> virtue of using the IOMMU API) know to overlap reserved ranges. So I
> still disagree with the referenced statement. Thanks,

I think that's a pity. Not only does it mean that both QEMU and the kernel
have more work to do (the former has to carve up its mapping requests,
whilst the latter has to check that it is indeed doing this), but it also
precludes the use of hugepage mappings on the IOMMU because of reserved
regions. For example, a 4k hole someplace may mean we can't put down 1GB
table entries for the guest memory in the SMMU.

All this seems to do is add complexity and decrease performance. For what?
QEMU has to go read the reserved regions from someplace anyway. It's also
the way that VFIO works *today* on arm64 wrt reserved regions, it just has
no way to identify those holes at present.

Will

2016-11-10 00:00:02

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, 9 Nov 2016 23:38:50 +0000
Will Deacon <[email protected]> wrote:

> On Wed, Nov 09, 2016 at 04:24:58PM -0700, Alex Williamson wrote:
> > On Wed, 9 Nov 2016 22:25:22 +0000
> > Will Deacon <[email protected]> wrote:
> >
> > > On Wed, Nov 09, 2016 at 03:17:09PM -0700, Alex Williamson wrote:
> > > > On Wed, 9 Nov 2016 20:31:45 +0000
> > > > Will Deacon <[email protected]> wrote:
> > > > > On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
> > > > > >
> > > > > > (I suppose it's technically possible to get around this issue by letting
> > > > > > QEMU place RAM wherever it wants but tell the guest to never use a
> > > > > > particular subset of its RAM for DMA, because that would conflict with
> > > > > > the doorbell IOVA or be seen as p2p transactions. But I think we all
> > > > > > probably agree that it's a disgusting idea.)
> > > > >
> > > > > Disgusting, yes, but Ben's idea of hotplugging on the host controller with
> > > > > firmware tables describing the reserved regions is something that we could
> > > > > do in the distant future. In the meantime, I don't think that VFIO should
> > > > > explicitly reject overlapping mappings if userspace asks for them.
> > > >
> > > > I'm confused by the last sentence here, rejecting user mappings that
> > > > overlap reserved ranges, such as MSI doorbell pages, is exactly how
> > > > we'd reject hot-adding a device when we meet such a conflict. If we
> > > > don't reject such a mapping, we're knowingly creating a situation that
> > > > potentially leads to data loss. Minimally, QEMU would need to know
> > > > about the reserved region, map around it through VFIO, and take
> > > > responsibility (somehow) for making sure that region is never used for
> > > > DMA. Thanks,
> > >
> > > Yes, but my point is that it should be up to QEMU to abort the hotplug, not
> > > the host kernel, since there may be ways in which a guest can tolerate the
> > > overlapping region (e.g. by avoiding that range of memory for DMA).
> >
> > The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
> > of IOVAs to a range of virtual addresses for a given device. If VFIO
> > cannot reasonably fulfill that contract, it must fail. It's up to QEMU
> > how to manage the hotplug and what memory regions it asks VFIO to map
> > for a device, but VFIO must reject mappings that it (or the SMMU by
> > virtue of using the IOMMU API) know to overlap reserved ranges. So I
> > still disagree with the referenced statement. Thanks,
>
> I think that's a pity. Not only does it mean that both QEMU and the kernel
> have more work to do (the former has to carve up its mapping requests,
> whilst the latter has to check that it is indeed doing this), but it also
> precludes the use of hugepage mappings on the IOMMU because of reserved
> regions. For example, a 4k hole someplace may mean we can't put down 1GB
> table entries for the guest memory in the SMMU.
>
> All this seems to do is add complexity and decrease performance. For what?
> QEMU has to go read the reserved regions from someplace anyway. It's also
> the way that VFIO works *today* on arm64 wrt reserved regions, it just has
> no way to identify those holes at present.

Sure, that sucks, but how is the alternative even an option? The user
asked to map something, we can't, if we allow that to happen now it's a
bug. Put the MSI doorbells somewhere that this won't be an issue. If
the platform has it fixed somewhere that this is an issue, don't use
that platform. The correctness of the interface is more important than
catering to a poorly designed system layout IMO. Thanks,

Alex

2016-11-10 00:14:51

by Eric Auger

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

Hi,

On 10/11/2016 00:59, Alex Williamson wrote:
> On Wed, 9 Nov 2016 23:38:50 +0000
> Will Deacon <[email protected]> wrote:
>
>> On Wed, Nov 09, 2016 at 04:24:58PM -0700, Alex Williamson wrote:
>>> On Wed, 9 Nov 2016 22:25:22 +0000
>>> Will Deacon <[email protected]> wrote:
>>>
>>>> On Wed, Nov 09, 2016 at 03:17:09PM -0700, Alex Williamson wrote:
>>>>> On Wed, 9 Nov 2016 20:31:45 +0000
>>>>> Will Deacon <[email protected]> wrote:
>>>>>> On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
>>>>>>>
>>>>>>> (I suppose it's technically possible to get around this issue by letting
>>>>>>> QEMU place RAM wherever it wants but tell the guest to never use a
>>>>>>> particular subset of its RAM for DMA, because that would conflict with
>>>>>>> the doorbell IOVA or be seen as p2p transactions. But I think we all
>>>>>>> probably agree that it's a disgusting idea.)
>>>>>>
>>>>>> Disgusting, yes, but Ben's idea of hotplugging on the host controller with
>>>>>> firmware tables describing the reserved regions is something that we could
>>>>>> do in the distant future. In the meantime, I don't think that VFIO should
>>>>>> explicitly reject overlapping mappings if userspace asks for them.
>>>>>
>>>>> I'm confused by the last sentence here, rejecting user mappings that
>>>>> overlap reserved ranges, such as MSI doorbell pages, is exactly how
>>>>> we'd reject hot-adding a device when we meet such a conflict. If we
>>>>> don't reject such a mapping, we're knowingly creating a situation that
>>>>> potentially leads to data loss. Minimally, QEMU would need to know
>>>>> about the reserved region, map around it through VFIO, and take
>>>>> responsibility (somehow) for making sure that region is never used for
>>>>> DMA. Thanks,
>>>>
>>>> Yes, but my point is that it should be up to QEMU to abort the hotplug, not
>>>> the host kernel, since there may be ways in which a guest can tolerate the
>>>> overlapping region (e.g. by avoiding that range of memory for DMA).
>>>
>>> The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
>>> of IOVAs to a range of virtual addresses for a given device. If VFIO
>>> cannot reasonably fulfill that contract, it must fail. It's up to QEMU
>>> how to manage the hotplug and what memory regions it asks VFIO to map
>>> for a device, but VFIO must reject mappings that it (or the SMMU by
>>> virtue of using the IOMMU API) know to overlap reserved ranges. So I
>>> still disagree with the referenced statement. Thanks,
>>
>> I think that's a pity. Not only does it mean that both QEMU and the kernel
>> have more work to do (the former has to carve up its mapping requests,
>> whilst the latter has to check that it is indeed doing this), but it also
>> precludes the use of hugepage mappings on the IOMMU because of reserved
>> regions. For example, a 4k hole someplace may mean we can't put down 1GB
>> table entries for the guest memory in the SMMU.
>>
>> All this seems to do is add complexity and decrease performance. For what?
>> QEMU has to go read the reserved regions from someplace anyway. It's also
>> the way that VFIO works *today* on arm64 wrt reserved regions, it just has
>> no way to identify those holes at present.
>
> Sure, that sucks, but how is the alternative even an option? The user
> asked to map something, we can't, if we allow that to happen now it's a
> bug. Put the MSI doorbells somewhere that this won't be an issue. If
> the platform has it fixed somewhere that this is an issue, don't use
> that platform. The correctness of the interface is more important than
> catering to a poorly designed system layout IMO. Thanks,

Besides above problematic, I started to prototype the sysfs API. A first
issue I face is the reserved regions become global to the iommu instead
of characterizing the iommu_domain, ie. the "reserved_regions" attribute
file sits below an iommu instance (~
/sys/class/iommu/dmar0/intel-iommu/reserved_regions ||
/sys/class/iommu/arm-smmu0/arm-smmu/reserved_regions).

MSI reserved window can be considered global to the IOMMU. However PCIe
host bridge P2P regions rather are per iommu-domain.

Do you confirm the attribute file should contain both global reserved
regions and all per iommu_domain reserved regions?

Thoughts?

Thanks

Eric
>
> Alex
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

2016-11-10 00:55:26

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Thu, 10 Nov 2016 01:14:42 +0100
Auger Eric <[email protected]> wrote:

> Hi,
>
> On 10/11/2016 00:59, Alex Williamson wrote:
> > On Wed, 9 Nov 2016 23:38:50 +0000
> > Will Deacon <[email protected]> wrote:
> >
> >> On Wed, Nov 09, 2016 at 04:24:58PM -0700, Alex Williamson wrote:
> >>> On Wed, 9 Nov 2016 22:25:22 +0000
> >>> Will Deacon <[email protected]> wrote:
> >>>
> >>>> On Wed, Nov 09, 2016 at 03:17:09PM -0700, Alex Williamson wrote:
> >>>>> On Wed, 9 Nov 2016 20:31:45 +0000
> >>>>> Will Deacon <[email protected]> wrote:
> >>>>>> On Wed, Nov 09, 2016 at 08:23:03PM +0100, Christoffer Dall wrote:
> >>>>>>>
> >>>>>>> (I suppose it's technically possible to get around this issue by letting
> >>>>>>> QEMU place RAM wherever it wants but tell the guest to never use a
> >>>>>>> particular subset of its RAM for DMA, because that would conflict with
> >>>>>>> the doorbell IOVA or be seen as p2p transactions. But I think we all
> >>>>>>> probably agree that it's a disgusting idea.)
> >>>>>>
> >>>>>> Disgusting, yes, but Ben's idea of hotplugging on the host controller with
> >>>>>> firmware tables describing the reserved regions is something that we could
> >>>>>> do in the distant future. In the meantime, I don't think that VFIO should
> >>>>>> explicitly reject overlapping mappings if userspace asks for them.
> >>>>>
> >>>>> I'm confused by the last sentence here, rejecting user mappings that
> >>>>> overlap reserved ranges, such as MSI doorbell pages, is exactly how
> >>>>> we'd reject hot-adding a device when we meet such a conflict. If we
> >>>>> don't reject such a mapping, we're knowingly creating a situation that
> >>>>> potentially leads to data loss. Minimally, QEMU would need to know
> >>>>> about the reserved region, map around it through VFIO, and take
> >>>>> responsibility (somehow) for making sure that region is never used for
> >>>>> DMA. Thanks,
> >>>>
> >>>> Yes, but my point is that it should be up to QEMU to abort the hotplug, not
> >>>> the host kernel, since there may be ways in which a guest can tolerate the
> >>>> overlapping region (e.g. by avoiding that range of memory for DMA).
> >>>
> >>> The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
> >>> of IOVAs to a range of virtual addresses for a given device. If VFIO
> >>> cannot reasonably fulfill that contract, it must fail. It's up to QEMU
> >>> how to manage the hotplug and what memory regions it asks VFIO to map
> >>> for a device, but VFIO must reject mappings that it (or the SMMU by
> >>> virtue of using the IOMMU API) know to overlap reserved ranges. So I
> >>> still disagree with the referenced statement. Thanks,
> >>
> >> I think that's a pity. Not only does it mean that both QEMU and the kernel
> >> have more work to do (the former has to carve up its mapping requests,
> >> whilst the latter has to check that it is indeed doing this), but it also
> >> precludes the use of hugepage mappings on the IOMMU because of reserved
> >> regions. For example, a 4k hole someplace may mean we can't put down 1GB
> >> table entries for the guest memory in the SMMU.
> >>
> >> All this seems to do is add complexity and decrease performance. For what?
> >> QEMU has to go read the reserved regions from someplace anyway. It's also
> >> the way that VFIO works *today* on arm64 wrt reserved regions, it just has
> >> no way to identify those holes at present.
> >
> > Sure, that sucks, but how is the alternative even an option? The user
> > asked to map something, we can't, if we allow that to happen now it's a
> > bug. Put the MSI doorbells somewhere that this won't be an issue. If
> > the platform has it fixed somewhere that this is an issue, don't use
> > that platform. The correctness of the interface is more important than
> > catering to a poorly designed system layout IMO. Thanks,
>
> Besides above problematic, I started to prototype the sysfs API. A first
> issue I face is the reserved regions become global to the iommu instead
> of characterizing the iommu_domain, ie. the "reserved_regions" attribute
> file sits below an iommu instance (~
> /sys/class/iommu/dmar0/intel-iommu/reserved_regions ||
> /sys/class/iommu/arm-smmu0/arm-smmu/reserved_regions).
>
> MSI reserved window can be considered global to the IOMMU. However PCIe
> host bridge P2P regions rather are per iommu-domain.
>
> Do you confirm the attribute file should contain both global reserved
> regions and all per iommu_domain reserved regions?
>
> Thoughts?

I don't think we have any business describing IOVA addresses consumed
by peer devices in an IOMMU sysfs file. If it's a separate device it
should be exposed by examining the rest of the topology. Regions
consumed by PCI endpoints and interconnects are already exposed in
sysfs. In fact, is this perhaps a more accurate model for these MSI
controllers too? Perhaps they should be exposed in the bus topology
somewhere as consuming the IOVA range. If DMA to an IOVA is consumed
by an intermediate device before it hits the IOMMU vs not being
translated as specified by the user at the IOMMU, I'm less inclined to
call that something VFIO should reject. However, instantiating a VM
with holes to account for every potential peer device seems like it
borders on insanity. Thanks,

Alex

2016-11-10 02:01:30

by Will Deacon

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, Nov 09, 2016 at 05:55:17PM -0700, Alex Williamson wrote:
> On Thu, 10 Nov 2016 01:14:42 +0100
> Auger Eric <[email protected]> wrote:
> > On 10/11/2016 00:59, Alex Williamson wrote:
> > > On Wed, 9 Nov 2016 23:38:50 +0000
> > > Will Deacon <[email protected]> wrote:
> > >> On Wed, Nov 09, 2016 at 04:24:58PM -0700, Alex Williamson wrote:
> > >>> The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
> > >>> of IOVAs to a range of virtual addresses for a given device. If VFIO
> > >>> cannot reasonably fulfill that contract, it must fail. It's up to QEMU
> > >>> how to manage the hotplug and what memory regions it asks VFIO to map
> > >>> for a device, but VFIO must reject mappings that it (or the SMMU by
> > >>> virtue of using the IOMMU API) know to overlap reserved ranges. So I
> > >>> still disagree with the referenced statement. Thanks,
> > >>
> > >> I think that's a pity. Not only does it mean that both QEMU and the kernel
> > >> have more work to do (the former has to carve up its mapping requests,
> > >> whilst the latter has to check that it is indeed doing this), but it also
> > >> precludes the use of hugepage mappings on the IOMMU because of reserved
> > >> regions. For example, a 4k hole someplace may mean we can't put down 1GB
> > >> table entries for the guest memory in the SMMU.
> > >>
> > >> All this seems to do is add complexity and decrease performance. For what?
> > >> QEMU has to go read the reserved regions from someplace anyway. It's also
> > >> the way that VFIO works *today* on arm64 wrt reserved regions, it just has
> > >> no way to identify those holes at present.
> > >
> > > Sure, that sucks, but how is the alternative even an option? The user
> > > asked to map something, we can't, if we allow that to happen now it's a
> > > bug. Put the MSI doorbells somewhere that this won't be an issue. If
> > > the platform has it fixed somewhere that this is an issue, don't use
> > > that platform. The correctness of the interface is more important than
> > > catering to a poorly designed system layout IMO. Thanks,
> >
> > Besides above problematic, I started to prototype the sysfs API. A first
> > issue I face is the reserved regions become global to the iommu instead
> > of characterizing the iommu_domain, ie. the "reserved_regions" attribute
> > file sits below an iommu instance (~
> > /sys/class/iommu/dmar0/intel-iommu/reserved_regions ||
> > /sys/class/iommu/arm-smmu0/arm-smmu/reserved_regions).
> >
> > MSI reserved window can be considered global to the IOMMU. However PCIe
> > host bridge P2P regions rather are per iommu-domain.

I don't think we can treat them as per-domain, given that we want to
enumerate this stuff before we've decided to do a hotplug (and therefore
don't have a domain).

> >
> > Do you confirm the attribute file should contain both global reserved
> > regions and all per iommu_domain reserved regions?
> >
> > Thoughts?
>
> I don't think we have any business describing IOVA addresses consumed
> by peer devices in an IOMMU sysfs file. If it's a separate device it
> should be exposed by examining the rest of the topology. Regions
> consumed by PCI endpoints and interconnects are already exposed in
> sysfs. In fact, is this perhaps a more accurate model for these MSI
> controllers too? Perhaps they should be exposed in the bus topology
> somewhere as consuming the IOVA range. If DMA to an IOVA is consumed
> by an intermediate device before it hits the IOMMU vs not being
> translated as specified by the user at the IOMMU, I'm less inclined to
> call that something VFIO should reject.

Oh, so perhaps we've been talking past each other. In all of these cases,
the SMMU can translate the access if it makes it that far. The issue is
that not all accesses do make it that far, because they may be "consumed"
by another device, such as an MSI doorbell or another endpoint. In other
words, I don't envisage a scenario where e.g. some address range just
bypasses the SMMU straight to memory. I realise now that that's not clear
from the slides I presented.

> However, instantiating a VM
> with holes to account for every potential peer device seems like it
> borders on insanity. Thanks,

Ok, so rather than having a list of reserved regions under the iommu node,
you're proposing that each region is attributed to the device that "owns"
(consumes) it? I think that can work, but we need to make sure that:

(a) The topology is walkable from userspace (where do you start?)

(b) It also works for platform (non-PCI) devices, that lack much in the
way of bus hierarchy

(c) It doesn't require Linux to have a driver bound to a device in order
for the ranges consumed by that device to be advertised (again,
more of an issue for non-PCI).

How is this currently advertised for PCI? I'd really like to use the same
scheme irrespective of the bus type.

Will

2016-11-10 11:14:50

by Eric Auger

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

Hi Will, Alex,

On 10/11/2016 03:01, Will Deacon wrote:
> On Wed, Nov 09, 2016 at 05:55:17PM -0700, Alex Williamson wrote:
>> On Thu, 10 Nov 2016 01:14:42 +0100
>> Auger Eric <[email protected]> wrote:
>>> On 10/11/2016 00:59, Alex Williamson wrote:
>>>> On Wed, 9 Nov 2016 23:38:50 +0000
>>>> Will Deacon <[email protected]> wrote:
>>>>> On Wed, Nov 09, 2016 at 04:24:58PM -0700, Alex Williamson wrote:
>>>>>> The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
>>>>>> of IOVAs to a range of virtual addresses for a given device. If VFIO
>>>>>> cannot reasonably fulfill that contract, it must fail. It's up to QEMU
>>>>>> how to manage the hotplug and what memory regions it asks VFIO to map
>>>>>> for a device, but VFIO must reject mappings that it (or the SMMU by
>>>>>> virtue of using the IOMMU API) know to overlap reserved ranges. So I
>>>>>> still disagree with the referenced statement. Thanks,
>>>>>
>>>>> I think that's a pity. Not only does it mean that both QEMU and the kernel
>>>>> have more work to do (the former has to carve up its mapping requests,
>>>>> whilst the latter has to check that it is indeed doing this), but it also
>>>>> precludes the use of hugepage mappings on the IOMMU because of reserved
>>>>> regions. For example, a 4k hole someplace may mean we can't put down 1GB
>>>>> table entries for the guest memory in the SMMU.
>>>>>
>>>>> All this seems to do is add complexity and decrease performance. For what?
>>>>> QEMU has to go read the reserved regions from someplace anyway. It's also
>>>>> the way that VFIO works *today* on arm64 wrt reserved regions, it just has
>>>>> no way to identify those holes at present.
>>>>
>>>> Sure, that sucks, but how is the alternative even an option? The user
>>>> asked to map something, we can't, if we allow that to happen now it's a
>>>> bug. Put the MSI doorbells somewhere that this won't be an issue. If
>>>> the platform has it fixed somewhere that this is an issue, don't use
>>>> that platform. The correctness of the interface is more important than
>>>> catering to a poorly designed system layout IMO. Thanks,
>>>
>>> Besides above problematic, I started to prototype the sysfs API. A first
>>> issue I face is the reserved regions become global to the iommu instead
>>> of characterizing the iommu_domain, ie. the "reserved_regions" attribute
>>> file sits below an iommu instance (~
>>> /sys/class/iommu/dmar0/intel-iommu/reserved_regions ||
>>> /sys/class/iommu/arm-smmu0/arm-smmu/reserved_regions).
>>>
>>> MSI reserved window can be considered global to the IOMMU. However PCIe
>>> host bridge P2P regions rather are per iommu-domain.
>
> I don't think we can treat them as per-domain, given that we want to
> enumerate this stuff before we've decided to do a hotplug (and therefore
> don't have a domain).
That's the issue indeed. We need to wait for the PCIe device to be
connected to the iommu. Only on the VFIO SET_IOMMU we get the
comprehensive list of P2P regions that can impact IOVA mapping for this
iommu. This removes any advantage of sysfs API over previous VFIO
capability chain API for P2P IOVA range enumeration at early stage.

>
>>>
>>> Do you confirm the attribute file should contain both global reserved
>>> regions and all per iommu_domain reserved regions?
>>>
>>> Thoughts?
>>
>> I don't think we have any business describing IOVA addresses consumed
>> by peer devices in an IOMMU sysfs file. If it's a separate device it
>> should be exposed by examining the rest of the topology. Regions
>> consumed by PCI endpoints and interconnects are already exposed in
>> sysfs. In fact, is this perhaps a more accurate model for these MSI
>> controllers too? Perhaps they should be exposed in the bus topology
>> somewhere as consuming the IOVA range.
Currently on x86 the P2P regions are not checked when allowing
passthrough. Aren't we more papist that the pope? As Don mentioned,
shouldn't we simply consider that a platform that does not support
proper ACS is not candidate for safe passthrough, like Juno?

At least we can state the feature also is missing on x86 and it would be
nice to report the risk to the userspace and urge him to opt-in.

To me taking into account those P2P still is controversial and induce
the bulk of the complexity. Considering the migration use case discussed
at LPC while only handling the MSI problem looks much easier.
host can choose an MSI base that is QEMU mach-virt friendly, ie. non RAM
region. Problem is to satisfy all potential uses though. When migrating,
mach-virt still is being used so there should not be any collision. Am I
missing some migration weird use cases here? Of course if we take into
consideration new host PCIe P2P regions this becomes completely different.

We still have the good old v14 where the user space chose where MSI
IOVA's are put without any risk of collision ;-)

>> If DMA to an IOVA is consumed
>> by an intermediate device before it hits the IOMMU vs not being
>> translated as specified by the user at the IOMMU, I'm less inclined to
>> call that something VFIO should reject.
>
> Oh, so perhaps we've been talking past each other. In all of these cases,
> the SMMU can translate the access if it makes it that far. The issue is
> that not all accesses do make it that far, because they may be "consumed"
> by another device, such as an MSI doorbell or another endpoint. In other
> words, I don't envisage a scenario where e.g. some address range just
> bypasses the SMMU straight to memory. I realise now that that's not clear
> from the slides I presented.
>
>> However, instantiating a VM
>> with holes to account for every potential peer device seems like it
>> borders on insanity. Thanks,
>
> Ok, so rather than having a list of reserved regions under the iommu node,
> you're proposing that each region is attributed to the device that "owns"
> (consumes) it? I think that can work, but we need to make sure that:
>
> (a) The topology is walkable from userspace (where do you start?)
>
> (b) It also works for platform (non-PCI) devices, that lack much in the
> way of bus hierarchy
>
> (c) It doesn't require Linux to have a driver bound to a device in order
> for the ranges consumed by that device to be advertised (again,
> more of an issue for non-PCI).
Looks a long way ...

Thanks

Eric

>
> How is this currently advertised for PCI? I'd really like to use the same
> scheme irrespective of the bus type.
>
> Will
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

2016-11-10 14:40:11

by Joerg Roedel

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, Nov 09, 2016 at 01:01:14PM -0700, Alex Williamson wrote:
> Well, it's not like QEMU or libvirt stumbling through sysfs to figure
> out where holes could be in order to instantiate a VM with matching
> holes, just in case someone might decide to hot-add a device into the
> VM, at some point, and hopefully they don't migrate the VM to another
> host with a different layout first, is all that much less disgusting or
> foolproof. It's just that in order to dynamically remove a page as a
> possible DMA target we require a paravirt channel, such as a balloon
> driver that's able to pluck a specific page. In some ways it's
> actually less disgusting, but it puts some prerequisites on
> enlightening the guest OS. Thanks,

I think it is much simpler if libvirt/qemu just go through all
potentially assignable devices on a system and pre-exclude any addresses
from guest RAM beforehand, rather than doing something like this with
paravirt/ballooning when a device is hot-added. There is no guarantee
that you can take a page away from a linux-guest.


Joerg

2016-11-10 14:52:04

by Joerg Roedel

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Thu, Nov 10, 2016 at 01:14:42AM +0100, Auger Eric wrote:
> Besides above problematic, I started to prototype the sysfs API. A first
> issue I face is the reserved regions become global to the iommu instead
> of characterizing the iommu_domain, ie. the "reserved_regions" attribute
> file sits below an iommu instance (~
> /sys/class/iommu/dmar0/intel-iommu/reserved_regions ||
> /sys/class/iommu/arm-smmu0/arm-smmu/reserved_regions).
>
> MSI reserved window can be considered global to the IOMMU. However PCIe
> host bridge P2P regions rather are per iommu-domain.
>
> Do you confirm the attribute file should contain both global reserved
> regions and all per iommu_domain reserved regions?
>
> Thoughts?

This information is best attached to the sysfs-representation of
iommu-groups. The file should then contain the superset of all reserved
regions of the devices the group contains. This makes it usable later to
also describe RMRR/Unity-mapped regions on x86 there and make them
assignable to guests as well.


Joerg

2016-11-10 15:18:53

by Joerg Roedel

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Wed, Nov 09, 2016 at 08:11:16PM +0000, Robin Murphy wrote:
> When we *are* in full control of the IOVA space, we just carve out what
> we can find as best we can - see iova_reserve_pci_windows() in
> dma-iommu.c, which isn't really all that different to what x86 does
> (e.g. init_reserved_iova_ranges() in amd-iommu.c).

Yeah, that code was actually written with a look at what the Intel
driver does. I don't really like that it goes over all resources and
reserves them individually (not only because it is not hotplug-safe).

I have to check whether there is a nice and generic way to find out the
root-bridge windows and just reserve them in the iova-space. That would
be easier and more reliable.


Joerg

2016-11-10 17:07:09

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Thu, 10 Nov 2016 15:40:07 +0100
Joerg Roedel <[email protected]> wrote:

> On Wed, Nov 09, 2016 at 01:01:14PM -0700, Alex Williamson wrote:
> > Well, it's not like QEMU or libvirt stumbling through sysfs to figure
> > out where holes could be in order to instantiate a VM with matching
> > holes, just in case someone might decide to hot-add a device into the
> > VM, at some point, and hopefully they don't migrate the VM to another
> > host with a different layout first, is all that much less disgusting or
> > foolproof. It's just that in order to dynamically remove a page as a
> > possible DMA target we require a paravirt channel, such as a balloon
> > driver that's able to pluck a specific page. In some ways it's
> > actually less disgusting, but it puts some prerequisites on
> > enlightening the guest OS. Thanks,
>
> I think it is much simpler if libvirt/qemu just go through all
> potentially assignable devices on a system and pre-exclude any addresses
> from guest RAM beforehand, rather than doing something like this with
> paravirt/ballooning when a device is hot-added. There is no guarantee
> that you can take a page away from a linux-guest.

Right, I'm not advocating a paravirt ballooning approach, just pointing
out that they're both terrible, just in different ways. Thanks,

Alex

2016-11-10 17:46:33

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Thu, 10 Nov 2016 12:14:40 +0100
Auger Eric <[email protected]> wrote:

> Hi Will, Alex,
>
> On 10/11/2016 03:01, Will Deacon wrote:
> > On Wed, Nov 09, 2016 at 05:55:17PM -0700, Alex Williamson wrote:
> >> On Thu, 10 Nov 2016 01:14:42 +0100
> >> Auger Eric <[email protected]> wrote:
> >>> On 10/11/2016 00:59, Alex Williamson wrote:
> >>>> On Wed, 9 Nov 2016 23:38:50 +0000
> >>>> Will Deacon <[email protected]> wrote:
> >>>>> On Wed, Nov 09, 2016 at 04:24:58PM -0700, Alex Williamson wrote:
> >>>>>> The VFIO_IOMMU_MAP_DMA ioctl is a contract, the user ask to map a range
> >>>>>> of IOVAs to a range of virtual addresses for a given device. If VFIO
> >>>>>> cannot reasonably fulfill that contract, it must fail. It's up to QEMU
> >>>>>> how to manage the hotplug and what memory regions it asks VFIO to map
> >>>>>> for a device, but VFIO must reject mappings that it (or the SMMU by
> >>>>>> virtue of using the IOMMU API) know to overlap reserved ranges. So I
> >>>>>> still disagree with the referenced statement. Thanks,
> >>>>>
> >>>>> I think that's a pity. Not only does it mean that both QEMU and the kernel
> >>>>> have more work to do (the former has to carve up its mapping requests,
> >>>>> whilst the latter has to check that it is indeed doing this), but it also
> >>>>> precludes the use of hugepage mappings on the IOMMU because of reserved
> >>>>> regions. For example, a 4k hole someplace may mean we can't put down 1GB
> >>>>> table entries for the guest memory in the SMMU.
> >>>>>
> >>>>> All this seems to do is add complexity and decrease performance. For what?
> >>>>> QEMU has to go read the reserved regions from someplace anyway. It's also
> >>>>> the way that VFIO works *today* on arm64 wrt reserved regions, it just has
> >>>>> no way to identify those holes at present.
> >>>>
> >>>> Sure, that sucks, but how is the alternative even an option? The user
> >>>> asked to map something, we can't, if we allow that to happen now it's a
> >>>> bug. Put the MSI doorbells somewhere that this won't be an issue. If
> >>>> the platform has it fixed somewhere that this is an issue, don't use
> >>>> that platform. The correctness of the interface is more important than
> >>>> catering to a poorly designed system layout IMO. Thanks,
> >>>
> >>> Besides above problematic, I started to prototype the sysfs API. A first
> >>> issue I face is the reserved regions become global to the iommu instead
> >>> of characterizing the iommu_domain, ie. the "reserved_regions" attribute
> >>> file sits below an iommu instance (~
> >>> /sys/class/iommu/dmar0/intel-iommu/reserved_regions ||
> >>> /sys/class/iommu/arm-smmu0/arm-smmu/reserved_regions).
> >>>
> >>> MSI reserved window can be considered global to the IOMMU. However PCIe
> >>> host bridge P2P regions rather are per iommu-domain.
> >
> > I don't think we can treat them as per-domain, given that we want to
> > enumerate this stuff before we've decided to do a hotplug (and therefore
> > don't have a domain).
> That's the issue indeed. We need to wait for the PCIe device to be
> connected to the iommu. Only on the VFIO SET_IOMMU we get the
> comprehensive list of P2P regions that can impact IOVA mapping for this
> iommu. This removes any advantage of sysfs API over previous VFIO
> capability chain API for P2P IOVA range enumeration at early stage.

For use through vfio we know that an iommu_domain is minimally composed
of an iommu_group and we can find all the p2p resources of that group
referencing /proc/iomem, at least for PCI-based groups. This is the
part that I don't think any sort of iommu sysfs attributes should be
duplicating.

> >>> Do you confirm the attribute file should contain both global reserved
> >>> regions and all per iommu_domain reserved regions?
> >>>
> >>> Thoughts?
> >>
> >> I don't think we have any business describing IOVA addresses consumed
> >> by peer devices in an IOMMU sysfs file. If it's a separate device it
> >> should be exposed by examining the rest of the topology. Regions
> >> consumed by PCI endpoints and interconnects are already exposed in
> >> sysfs. In fact, is this perhaps a more accurate model for these MSI
> >> controllers too? Perhaps they should be exposed in the bus topology
> >> somewhere as consuming the IOVA range.
> Currently on x86 the P2P regions are not checked when allowing
> passthrough. Aren't we more papist that the pope? As Don mentioned,
> shouldn't we simply consider that a platform that does not support
> proper ACS is not candidate for safe passthrough, like Juno?

There are two sides here, there's the kernel side vfio and there's how
QEMU makes use of vfio. On the kernel side, we create iommu groups as
the set of devices we consider isolated, that doesn't necessarily mean
that there isn't p2p within the group, in fact that potential often
determines the composition of the group. It's the user's problem how
to deal with that potential. When I talk about the contract with
userspace, I consider that to be at the iommu mapping, ie. for
transactions that actually make it to the iommu. In the case of x86,
we know that DMA mappings overlapping the MSI doorbells won't be
translated correctly, it's not a valid mapping for that range, and
therefore the iommu driver backing the IOMMU API should describe that
reserved range and reject mappings to it. For devices downstream of
the IOMMU, whether they be p2p or MSI controllers consuming fixed IOVA
space, I consider these to be problems beyond the scope of the IOMMU
API, and maybe that's where we've been going wrong all along.

Users like QEMU can currently discover potential p2p conflicts by
looking at the composition of an iommu group and taking into account
the host PCI resources of each device. We don't currently do this,
though we probably should. The reason we typically don't run into
problems with this is that (again) x86 has a fairly standard memory
layout. Potential p2p regions are typically in an MMIO hole in the
host that sufficiently matches an MMIO hole in the guest. So we don't
often have VM RAM, which could be a DMA target, matching those p2p
addresses. We also hope that any serious device assignment users have
singleton iommu groups, ie. the IO subsystem is designed to support
proper, fine grained isolation.

> At least we can state the feature also is missing on x86 and it would be
> nice to report the risk to the userspace and urge him to opt-in.

Sure, but the information is already there, it's "just" a matter of
QEMU taking it into account, which has some implications that VMs with
any potential of doing device assignment need to be instantiated with
address maps compatible with the host system, which is not an easy feat
for something often considered the ugly step-child of virtualization.

> To me taking into account those P2P still is controversial and induce
> the bulk of the complexity. Considering the migration use case discussed
> at LPC while only handling the MSI problem looks much easier.
> host can choose an MSI base that is QEMU mach-virt friendly, ie. non RAM
> region. Problem is to satisfy all potential uses though. When migrating,
> mach-virt still is being used so there should not be any collision. Am I
> missing some migration weird use cases here? Of course if we take into
> consideration new host PCIe P2P regions this becomes completely different.

Yep, x86 having a standard MSI range is a nice happenstance, so long as
we're running an x86 VM, we don't worry about that being a DMA target.
Running non-x86 VMs on x86 hosts hits this problem, but is several
orders of magnitude lower priority.

> We still have the good old v14 where the user space chose where MSI
> IOVA's are put without any risk of collision ;-)
>
> >> If DMA to an IOVA is consumed
> >> by an intermediate device before it hits the IOMMU vs not being
> >> translated as specified by the user at the IOMMU, I'm less inclined to
> >> call that something VFIO should reject.
> >
> > Oh, so perhaps we've been talking past each other. In all of these cases,
> > the SMMU can translate the access if it makes it that far. The issue is
> > that not all accesses do make it that far, because they may be "consumed"
> > by another device, such as an MSI doorbell or another endpoint. In other
> > words, I don't envisage a scenario where e.g. some address range just
> > bypasses the SMMU straight to memory. I realise now that that's not clear
> > from the slides I presented.

As above, so long as a transaction that does make it to the iommu is
translated as prescribed by the user, I have no basis for rejecting a
user requested translation. Downstream MSI controllers consuming IOVA
space is no different than the existing p2p problem that vfio considers
a userspace issue.

> >> However, instantiating a VM
> >> with holes to account for every potential peer device seems like it
> >> borders on insanity. Thanks,
> >
> > Ok, so rather than having a list of reserved regions under the iommu node,
> > you're proposing that each region is attributed to the device that "owns"
> > (consumes) it? I think that can work, but we need to make sure that:
> >
> > (a) The topology is walkable from userspace (where do you start?)

For PCI devices userspace can examine the topology of the iommu group
and exclude MMIO ranges of peer devices based on the BARs, which are
exposed in various places, pci-sysfs as well as /proc/iomem. For
non-PCI or MSI controllers... ???

> > (b) It also works for platform (non-PCI) devices, that lack much in the
> > way of bus hierarchy

No idea here, without a userspace visible topology the user is in the
dark as to what devices potentially sit between them and the iommu.

> > (c) It doesn't require Linux to have a driver bound to a device in order
> > for the ranges consumed by that device to be advertised (again,
> > more of an issue for non-PCI).

Right, PCI has this problem solved, be more like PCI ;)

> > How is this currently advertised for PCI? I'd really like to use the same
> > scheme irrespective of the bus type.

For all devices within an IOMMU group, /proc/iomem might be the
solution, but I don't know how the MSI controller works. If the MSI
controller belongs to the group, then maybe it's a matter of creating a
struct device for it that consumes resources and making it show up in
both the iommu group and /proc/iomem. An MSI controller shared between
groups, which already sounds like a bit of a violation of iommu groups,
would need to be discoverable some other way. Thanks,

Alex

2016-11-11 11:19:48

by Joerg Roedel

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Thu, Nov 10, 2016 at 10:46:01AM -0700, Alex Williamson wrote:
> In the case of x86, we know that DMA mappings overlapping the MSI
> doorbells won't be translated correctly, it's not a valid mapping for
> that range, and therefore the iommu driver backing the IOMMU API
> should describe that reserved range and reject mappings to it.

The drivers actually allow mappings to the MSI region via the IOMMU-API,
and I think it should stay this way also for other reserved ranges.
Address space management is done by the IOMMU-API user already (and has
to be done there nowadays), be it a DMA-API implementation which just
reserves these regions in its address space allocator or be it VFIO with
QEMU, which don't map RAM there anyway. So there is no point of checking
this again in the IOMMU drivers and we can keep that out of the
mapping/unmapping fast-path.

> For PCI devices userspace can examine the topology of the iommu group
> and exclude MMIO ranges of peer devices based on the BARs, which are
> exposed in various places, pci-sysfs as well as /proc/iomem. For
> non-PCI or MSI controllers... ???

Right, the hardware resources can be examined. But maybe this can be
extended to also cover RMRR ranges? Then we would be able to assign
devices with RMRR mappings to guests.



Joerg

2016-11-11 15:51:01

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Fri, 11 Nov 2016 12:19:44 +0100
Joerg Roedel <[email protected]> wrote:

> On Thu, Nov 10, 2016 at 10:46:01AM -0700, Alex Williamson wrote:
> > In the case of x86, we know that DMA mappings overlapping the MSI
> > doorbells won't be translated correctly, it's not a valid mapping for
> > that range, and therefore the iommu driver backing the IOMMU API
> > should describe that reserved range and reject mappings to it.
>
> The drivers actually allow mappings to the MSI region via the IOMMU-API,
> and I think it should stay this way also for other reserved ranges.
> Address space management is done by the IOMMU-API user already (and has
> to be done there nowadays), be it a DMA-API implementation which just
> reserves these regions in its address space allocator or be it VFIO with
> QEMU, which don't map RAM there anyway. So there is no point of checking
> this again in the IOMMU drivers and we can keep that out of the
> mapping/unmapping fast-path.

It's really just a happenstance that we don't map RAM over the x86 MSI
range though. That property really can't be guaranteed once we mix
architectures, such as running an aarch64 VM on x86 host via TCG.
AIUI, the MSI range is actually handled differently than other DMA
ranges, so a iommu_map() overlapping a range that the iommu cannot map
should fail just like an attempt to map beyond the address width of the
iommu.

> > For PCI devices userspace can examine the topology of the iommu group
> > and exclude MMIO ranges of peer devices based on the BARs, which are
> > exposed in various places, pci-sysfs as well as /proc/iomem. For
> > non-PCI or MSI controllers... ???
>
> Right, the hardware resources can be examined. But maybe this can be
> extended to also cover RMRR ranges? Then we would be able to assign
> devices with RMRR mappings to guests.

RMRRs are special in a different way, the VT-d spec requires that the
OS honor RMRRs, the user has no responsibility (and currently no
visibility) to make that same arrangement. In order to potentially
protect the physical host platform, the iommu drivers should prevent a
user from remapping RMRRS. Maybe there needs to be a different
interface used by untrusted users vs in-kernel drivers, but I think the
kernel really needs to be defensive in the case of user mappings, which
is where the IOMMU API is rooted. Thanks,

Alex

2016-11-11 16:00:47

by Donald Dutile

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/11/2016 06:19 AM, Joerg Roedel wrote:
> On Thu, Nov 10, 2016 at 10:46:01AM -0700, Alex Williamson wrote:
>> In the case of x86, we know that DMA mappings overlapping the MSI
>> doorbells won't be translated correctly, it's not a valid mapping for
>> that range, and therefore the iommu driver backing the IOMMU API
>> should describe that reserved range and reject mappings to it.
>
> The drivers actually allow mappings to the MSI region via the IOMMU-API,
> and I think it should stay this way also for other reserved ranges.
> Address space management is done by the IOMMU-API user already (and has
> to be done there nowadays), be it a DMA-API implementation which just
> reserves these regions in its address space allocator or be it VFIO with
> QEMU, which don't map RAM there anyway. So there is no point of checking
> this again in the IOMMU drivers and we can keep that out of the
> mapping/unmapping fast-path.
>
>> For PCI devices userspace can examine the topology of the iommu group
>> and exclude MMIO ranges of peer devices based on the BARs, which are
>> exposed in various places, pci-sysfs as well as /proc/iomem. For
>> non-PCI or MSI controllers... ???
>
> Right, the hardware resources can be examined. But maybe this can be
> extended to also cover RMRR ranges? Then we would be able to assign
> devices with RMRR mappings to guests.
>
eh gads no!

Assigning devices w/RMRR's is a security issue waiting to happen, if
it doesn't crash the system before the guest even gets the device --
reset the device before assignment; part of device is gathering system
environmental data; if BIOS/SMM support doesn't get env. data update,
it NMI's the system..... in fear that it may overheat ...


>
>
> Joerg
>

2016-11-11 16:05:47

by Alex Williamson

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Fri, 11 Nov 2016 08:50:56 -0700
Alex Williamson <[email protected]> wrote:

> On Fri, 11 Nov 2016 12:19:44 +0100
> Joerg Roedel <[email protected]> wrote:
>
> > On Thu, Nov 10, 2016 at 10:46:01AM -0700, Alex Williamson wrote:
> > > In the case of x86, we know that DMA mappings overlapping the MSI
> > > doorbells won't be translated correctly, it's not a valid mapping for
> > > that range, and therefore the iommu driver backing the IOMMU API
> > > should describe that reserved range and reject mappings to it.
> >
> > The drivers actually allow mappings to the MSI region via the IOMMU-API,
> > and I think it should stay this way also for other reserved ranges.
> > Address space management is done by the IOMMU-API user already (and has
> > to be done there nowadays), be it a DMA-API implementation which just
> > reserves these regions in its address space allocator or be it VFIO with
> > QEMU, which don't map RAM there anyway. So there is no point of checking
> > this again in the IOMMU drivers and we can keep that out of the
> > mapping/unmapping fast-path.
>
> It's really just a happenstance that we don't map RAM over the x86 MSI
> range though. That property really can't be guaranteed once we mix
> architectures, such as running an aarch64 VM on x86 host via TCG.
> AIUI, the MSI range is actually handled differently than other DMA
> ranges, so a iommu_map() overlapping a range that the iommu cannot map
> should fail just like an attempt to map beyond the address width of the
> iommu.

(clarification, this is x86 specific, the MSI controller - interrupt
remapper - is embedded in the iommu AIUI, so the iommu is actually not
able to provide DMA translation for this range. In architectures where
the MSI controller is separate from the iommu, I agree that the iommu
has no responsibility to fault mapping of iova ranges in the shadow of
an external MSI controller)

> > > For PCI devices userspace can examine the topology of the iommu group
> > > and exclude MMIO ranges of peer devices based on the BARs, which are
> > > exposed in various places, pci-sysfs as well as /proc/iomem. For
> > > non-PCI or MSI controllers... ???
> >
> > Right, the hardware resources can be examined. But maybe this can be
> > extended to also cover RMRR ranges? Then we would be able to assign
> > devices with RMRR mappings to guests.
>
> RMRRs are special in a different way, the VT-d spec requires that the
> OS honor RMRRs, the user has no responsibility (and currently no
> visibility) to make that same arrangement. In order to potentially
> protect the physical host platform, the iommu drivers should prevent a
> user from remapping RMRRS. Maybe there needs to be a different
> interface used by untrusted users vs in-kernel drivers, but I think the
> kernel really needs to be defensive in the case of user mappings, which
> is where the IOMMU API is rooted. Thanks,
>
> Alex
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

2016-11-11 16:25:31

by Donald Dutile

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/11/2016 10:50 AM, Alex Williamson wrote:
> On Fri, 11 Nov 2016 12:19:44 +0100
> Joerg Roedel <[email protected]> wrote:
>
>> On Thu, Nov 10, 2016 at 10:46:01AM -0700, Alex Williamson wrote:
>>> In the case of x86, we know that DMA mappings overlapping the MSI
>>> doorbells won't be translated correctly, it's not a valid mapping for
>>> that range, and therefore the iommu driver backing the IOMMU API
>>> should describe that reserved range and reject mappings to it.
>>
>> The drivers actually allow mappings to the MSI region via the IOMMU-API,
>> and I think it should stay this way also for other reserved ranges.
>> Address space management is done by the IOMMU-API user already (and has
>> to be done there nowadays), be it a DMA-API implementation which just
>> reserves these regions in its address space allocator or be it VFIO with
>> QEMU, which don't map RAM there anyway. So there is no point of checking
>> this again in the IOMMU drivers and we can keep that out of the
>> mapping/unmapping fast-path.
>
> It's really just a happenstance that we don't map RAM over the x86 MSI
> range though. That property really can't be guaranteed once we mix
> architectures, such as running an aarch64 VM on x86 host via TCG.
> AIUI, the MSI range is actually handled differently than other DMA
> ranges, so a iommu_map() overlapping a range that the iommu cannot map
> should fail just like an attempt to map beyond the address width of the
> iommu.
>
+1. As was stated at Plumbers, x86 MSI is in a fixed, hw location, so:
1) that memory space is never a valid page to the system to be used for IOVA,
therefore, nothing to micro-manage in the iommu mapping (fast) path.
2) migration btwn different systems isn't an issue b/c all x86 systems have this mapping.
3) ACS resolves DMA writes to mem going to a device(-mmio space).

For aarch64, without such a 'fixed' MSI location, whatever hole/used-space-struct
concept that is contrived for MSI (DMA) writes on aarch64 won't guarantee migration
failure across mixed aarch64 systems (migrate guest-G from sys-vendor-A to
sys-vendor-B; sys-vendor-A has MSI at addr-A; sys-vendor-B has MSI at addr-B).
Without agreement, migration only possilbe across the same systems (can even
be broken btwn two sytems from same vendor). ACS in the PCIe path handles
the iova->dev-mmio collision problem. q.e.d.

ergo, my proposal to put MSI space as the upper-most, space of any system....
FFFF.FFFF.FFFE0.0000 ... and hw drops the upper 1's/F's, and uses that for MSI.
Allows it to vary on each system based on max-memory. pseudo-fixed, but not
right smack in the middle of mem-space.

There is an inverse scenario for host phys addr's as well:
Wiring the upper-most bit of HPA to be 1==mmio, 0=mem simplifies a lot of
design issues in the cores & chipsets as well. Alpha-EV6, case in point
(18+ yr old design decision). another q.e.d.

I hate to admit it, but jcm has it right wrt 'fixed sys addr map', at least in this IO area.


>>> For PCI devices userspace can examine the topology of the iommu group
>>> and exclude MMIO ranges of peer devices based on the BARs, which are
>>> exposed in various places, pci-sysfs as well as /proc/iomem. For
>>> non-PCI or MSI controllers... ???
>>
>> Right, the hardware resources can be examined. But maybe this can be
>> extended to also cover RMRR ranges? Then we would be able to assign
>> devices with RMRR mappings to guests.
>
> RMRRs are special in a different way, the VT-d spec requires that the
> OS honor RMRRs, the user has no responsibility (and currently no
> visibility) to make that same arrangement. In order to potentially
> protect the physical host platform, the iommu drivers should prevent a
> user from remapping RMRRS. Maybe there needs to be a different
> interface used by untrusted users vs in-kernel drivers, but I think the
> kernel really needs to be defensive in the case of user mappings, which
> is where the IOMMU API is rooted. Thanks,
>
> Alex
>

2016-11-14 15:19:28

by Joerg Roedel

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On Fri, Nov 11, 2016 at 09:05:43AM -0700, Alex Williamson wrote:
> On Fri, 11 Nov 2016 08:50:56 -0700
> Alex Williamson <[email protected]> wrote:
> >
> > It's really just a happenstance that we don't map RAM over the x86 MSI
> > range though. That property really can't be guaranteed once we mix
> > architectures, such as running an aarch64 VM on x86 host via TCG.
> > AIUI, the MSI range is actually handled differently than other DMA
> > ranges, so a iommu_map() overlapping a range that the iommu cannot map
> > should fail just like an attempt to map beyond the address width of the
> > iommu.
>
> (clarification, this is x86 specific, the MSI controller - interrupt
> remapper - is embedded in the iommu AIUI, so the iommu is actually not
> able to provide DMA translation for this range.

Right, on x86 the MSI range can be covered by page-tables, but those are
ignored by the IOMMU hardware. But what I am trying to say is, that
checking for these ranges happens already on a higher level (in the
dma-api implementations by marking these regions as allocted) so that
there is no need to check for them again in the iommu_map/unmap path.



Joerg

2016-11-21 05:52:16

by Jon Masters

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/07/2016 07:45 PM, Will Deacon wrote:

> I figured this was a reasonable post to piggy-back on for the LPC minutes
> relating to guest MSIs on arm64.

Thanks for this Will. I'm still digging out post-LPC and SC16, but the
summary was much appreciated, and I'm glad the conversation is helping.

> 1. The physical memory map is not standardised (Jon pointed out that
> this is something that was realised late on)

Just to note, we discussed this one about 3-4 years ago. I recall making
a vigorous slideshow at a committee meeting in defense of having a
single memory map for ARMv8 servers and requiring everyone to follow it.
I was weak. I listened to the comments that this was "unreasonable".
Instead, I consider it was unreasonable of me to not get with the other
OS vendors and force things to be done one way. The lack of a "map at
zero" RAM location on ARMv8 has been annoying enough for 32-bit DMA only
devices on 64-bit (behind an SMMU but in passthrough mode it doesn't
help) and other issues beyond fixing the MSI doorbell regions. If I ever
have a time machine, I tried harder.

> Jon pointed out that most people are pretty conservative about hardware
> choices when migrating between them -- that is, they may only migrate
> between different revisions of the same SoC, or they know ahead of time
> all of the memory maps they want to support and this could be communicated
> by way of configuration to libvirt.

I think it's certainly reasonable to assume this in an initial
implementation and fix it later. Currently, we're very conservative
about host CPU passthrough anyway and can't migrate from one microarch
to another revision of the same microarch even. And on x86, nobody
really supports e.g. Intel to AMD and back again. I've always been of
the mind that we should ensure the architecture can handle this, but
then cautiously approach this with a default to not doing it.

> Alex asked if there was a security
> issue with DMA bypassing the SMMU, but there aren't currently any systems
> where that is known to happen. Such a system would surely not be safe for
> passthrough.

There are other potential security issues that came up but don't need to
be noted here (yet). I have wanted to clarify the SBSA for a long time
when it comes to how IOMMUs should be implemented. It's past time that
we went back and had a few conversations about that. I've poked.

> Ben mused that a way to handle conflicts dynamically might be to hotplug
> on the entire host bridge in the guest, passing firmware tables describing
> the new reserved regions as a property of the host bridge. Whilst this
> may well solve the issue, it was largely considered future work due to
> its invasive nature and dependency on firmware tables (and guest support)
> that do not currently exist.

Indeed. It's an elegant solution (thanks Ben) that I gather POWER
already does (good for them). We've obviously got a few things to clean
up after we get the basics in place. Again, I think we can consider it
reasonable that the MSI doorbell regions are predetermined on system A
well ahead of any potential migration (that may or may not then work)
for the moment. Vendors will want to loosen this later, and they can
drive the work to do that, for example by hotplugging a host bridge.

Jon.

2016-11-23 20:12:50

by Donald Dutile

[permalink] [raw]
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/21/2016 12:13 AM, Jon Masters wrote:
> On 11/07/2016 07:45 PM, Will Deacon wrote:
>
>> I figured this was a reasonable post to piggy-back on for the LPC minutes
>> relating to guest MSIs on arm64.
>
> Thanks for this Will. I'm still digging out post-LPC and SC16, but the
> summary was much appreciated, and I'm glad the conversation is helping.
>
>> 1. The physical memory map is not standardised (Jon pointed out that
>> this is something that was realised late on)
>
> Just to note, we discussed this one about 3-4 years ago. I recall making
> a vigorous slideshow at a committee meeting in defense of having a
> single memory map for ARMv8 servers and requiring everyone to follow it.
> I was weak. I listened to the comments that this was "unreasonable".
> Instead, I consider it was unreasonable of me to not get with the other
> OS vendors and force things to be done one way. The lack of a "map at
> zero" RAM location on ARMv8 has been annoying enough for 32-bit DMA only
> devices on 64-bit (behind an SMMU but in passthrough mode it doesn't
> help) and other issues beyond fixing the MSI doorbell regions. If I ever
> have a time machine, I tried harder.
>
>> Jon pointed out that most people are pretty conservative about hardware
>> choices when migrating between them -- that is, they may only migrate
>> between different revisions of the same SoC, or they know ahead of time
>> all of the memory maps they want to support and this could be communicated
>> by way of configuration to libvirt.
>
> I think it's certainly reasonable to assume this in an initial
> implementation and fix it later. Currently, we're very conservative
> about host CPU passthrough anyway and can't migrate from one microarch
> to another revision of the same microarch even. And on x86, nobody
> really supports e.g. Intel to AMD and back again. I've always been of
Thats primarily due to diff virt implentations ... vme vs svm.
1) thats not the case here; can argue cross arch variations .. gicv2 vs gicv3 vs gicv4 ...
but cross vendor should work if arch and common feature map (like x86 libvirt can resolve)
is provided
2) second chance to do it better; look n learn!
and common thread i am trying to drive home here ... learn from past mistakes *and* good choices.


> the mind that we should ensure the architecture can handle this, but
> then cautiously approach this with a default to not doing it.
>
>> Alex asked if there was a security
>> issue with DMA bypassing the SMMU, but there aren't currently any systems
>> where that is known to happen. Such a system would surely not be safe for
>> passthrough.
>
> There are other potential security issues that came up but don't need to
> be noted here (yet). I have wanted to clarify the SBSA for a long time
> when it comes to how IOMMUs should be implemented. It's past time that
> we went back and had a few conversations about that. I've poked.
>
>> Ben mused that a way to handle conflicts dynamically might be to hotplug
>> on the entire host bridge in the guest, passing firmware tables describing
>> the new reserved regions as a property of the host bridge. Whilst this
>> may well solve the issue, it was largely considered future work due to
>> its invasive nature and dependency on firmware tables (and guest support)
>> that do not currently exist.
>
> Indeed. It's an elegant solution (thanks Ben) that I gather POWER
> already does (good for them). We've obviously got a few things to clean
> up after we get the basics in place. Again, I think we can consider it
> reasonable that the MSI doorbell regions are predetermined on system A
> well ahead of any potential migration (that may or may not then work)
> for the moment. Vendors will want to loosen this later, and they can
> drive the work to do that, for example by hotplugging a host bridge.
>
> Jon.
>