This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
http://www.spinics.net/lists/kvm/msg152232.html
This time it is using "caps" in IOMMU groups. The main question is if PCI
bus flags or IOMMU domains are still better (and which one).
Here is some background:
Current vfio-pci implementation disallows to mmap the page
containing MSI-X table in case that users can write directly
to MSI-X table and generate an incorrect MSIs.
However, this will cause some performance issue when there
are some critical device registers in the same page as the
MSI-X table. We have to handle the mmio access to these
registers in QEMU emulation rather than in guest.
To solve this issue, this series allows to expose MSI-X table
to userspace when hardware enables the capability of interrupt
remapping which can ensure that a given PCI device can only
shoot the MSIs assigned for it. And we introduce a new bus_flags
PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
for different archs.
This is based on sha1
26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
Please comment. Thanks.
Changelog:
v5:
* redid the whole thing via so-called IOMMU group capabilities
v4:
* rebased on recent upstream
* got all 6 patches from v2 (v3 was missing some)
Alexey Kardashevskiy (5):
iommu: Add capabilities to a group
iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
remapping
iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
enabled
powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
vfio-pci: Allow to expose MSI-X table to userspace when safe
include/linux/iommu.h | 20 ++++++++++++++++++++
include/linux/vfio.h | 1 +
arch/powerpc/kernel/iommu.c | 1 +
drivers/iommu/amd_iommu.c | 3 +++
drivers/iommu/intel-iommu.c | 3 +++
drivers/iommu/iommu.c | 35 +++++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
drivers/vfio/vfio.c | 15 +++++++++++++++
9 files changed, 99 insertions(+), 4 deletions(-)
--
2.11.0
This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if IRQ remapping
is enabled. For Intel, this checks disable_sourceid_checking in addition;
AMD ignores the "nosid" kernel parameters.
Here is some background on how the isolation works:
On Intel VT-d [1], there is an Interrupt Remapping Table, one entry per
interrupt, has a source-id (i.e. BDFN) of allowed device.
On AMD IOMMU [2], there is a Device Table, each entry is indexed by
DevideID which is BDFN.
[1] 9.10 Interrupt Remapping Table Entry (IRTE) for Remapped Interrupts
https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf
[2] "2.2 Data Structures" and "2.2.5 Interrupt Remapping Tables"
https://support.amd.com/TechDocs/48882_IOMMU.pdf
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/iommu/amd_iommu.c | 3 +++
drivers/iommu/intel-iommu.c | 3 +++
2 files changed, 6 insertions(+)
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..d55fcbf3267e 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -470,6 +470,9 @@ static void init_iommu_group(struct device *dev)
if (IS_ERR(group))
return;
+ if (irq_remapping_enabled)
+ iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
iommu_group_put(group);
}
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b7e670a7c243..3896c2d44bfa 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5233,6 +5233,9 @@ static int intel_iommu_add_device(struct device *dev)
if (IS_ERR(group))
return PTR_ERR(group);
+ if (irq_remapping_enabled && !disable_sourceid_checking)
+ iommu_group_set_caps(group, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
iommu_group_put(group);
return 0;
}
--
2.11.0
This introduces capabilities to IOMMU groups. The first defined
capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
group users that a particular IOMMU group is capable of MSIX message
filtering; this is useful when deciding whether or not to allow mapping
of MSIX table to the userspace. Various architectures will enable it
when they decide that it is safe to do so.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/linux/iommu.h | 20 ++++++++++++++++++++
drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
2 files changed, 48 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2cb54adc4a33..6b6f3c2f4904 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -155,6 +155,9 @@ struct iommu_resv_region {
enum iommu_resv_type type;
};
+/* IOMMU group capabilities */
+#define IOMMU_GROUP_CAP_ISOLATE_MSIX (1U)
+
#ifdef CONFIG_IOMMU_API
/**
@@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
extern void iommu_group_set_iommudata(struct iommu_group *group,
void *iommu_data,
void (*release)(void *iommu_data));
+extern void iommu_group_set_caps(struct iommu_group *group,
+ unsigned long clearcaps,
+ unsigned long setcaps);
+extern bool iommu_group_is_capable(struct iommu_group *group,
+ unsigned long cap);
extern int iommu_group_set_name(struct iommu_group *group, const char *name);
extern int iommu_group_add_device(struct iommu_group *group,
struct device *dev);
@@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
{
}
+static inline void iommu_group_set_caps(struct iommu_group *group,
+ unsigned long clearcaps,
+ unsigned long setcaps)
+{
+}
+
+static inline bool iommu_group_is_capable(struct iommu_group *group,
+ unsigned long cap)
+{
+ return false;
+}
+
static inline int iommu_group_set_name(struct iommu_group *group,
const char *name)
{
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3f6ea160afed..6b2c34fe2c3d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -52,6 +52,7 @@ struct iommu_group {
void (*iommu_data_release)(void *iommu_data);
char *name;
int id;
+ unsigned long caps;
struct iommu_domain *default_domain;
struct iommu_domain *domain;
};
@@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
/**
+ * iommu_group_set_caps - Change the group capabilities
+ * @group: the group
+ * @clearcaps: capabilities mask to remove
+ * @setcaps: capabilities mask to add
+ *
+ * IOMMU groups can be capable of various features which device drivers
+ * may read and adjust the behavior.
+ */
+void iommu_group_set_caps(struct iommu_group *group,
+ unsigned long clearcaps, unsigned long setcaps)
+{
+ group->caps &= ~clearcaps;
+ group->caps |= setcaps;
+}
+EXPORT_SYMBOL_GPL(iommu_group_set_caps);
+
+/**
+ * iommu_group_is_capable - Returns if a group capability is present
+ * @group: the group
+ */
+bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
+{
+ return !!(group->caps & cap);
+}
+EXPORT_SYMBOL_GPL(iommu_group_is_capable);
+
+/**
* iommu_group_set_name - set name for a group
* @group: the group
* @name: name
--
2.11.0
This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group if MSI remapping is
enabled on an IRQ domain; this is expected to set the capability
on ARM.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/iommu/iommu.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6b2c34fe2c3d..e720e90fa93c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -32,6 +32,7 @@
#include <linux/pci.h>
#include <linux/bitops.h>
#include <linux/property.h>
+#include <linux/irqdomain.h>
#include <trace/events/iommu.h>
static struct kset *iommu_group_kset;
@@ -1028,6 +1029,7 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
const struct iommu_ops *ops = dev->bus->iommu_ops;
struct iommu_group *group;
int ret;
+ struct irq_domain *d = dev_get_msi_domain(dev);
group = iommu_group_get(dev);
if (group)
@@ -1070,6 +1072,11 @@ struct iommu_group *iommu_group_get_for_dev(struct device *dev)
return ERR_PTR(ret);
}
+ if (d && irq_domain_is_msi(d) &&
+ irq_domain_hierarchical_is_msi_remap(d))
+ iommu_group_set_caps(group, 0,
+ IOMMU_GROUP_CAP_ISOLATE_MSIX);
+
return group;
}
--
2.11.0
This sets IOMMU_GROUP_CAP_ISOLATE_MSIX to a group unconditionally as
there is no IOMMU-capable hardware without such a feature.
On IBM POWERPC (POWER8) [1], PCI host bridge maintains BFDN-to-PE
translation (PE stands for "partitionable endpoint"), and PE index is used
to look at Interrupt Vector Table (IVT) to identify the interrupt server.
Without these translations in place, MSIX messages won't pass PHB.
[1] 3.2.4. MSI Design
http://openpowerfoundation.org/wp-content/uploads/resources/IODA2Spec/IODA2WGSpec-1.0.0-20160217.pdf
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/kernel/iommu.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 233ca3fe4754..dca0a83f1560 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -936,6 +936,7 @@ void iommu_register_group(struct iommu_table_group *table_group,
if (!name)
return;
iommu_group_set_name(grp, name);
+ iommu_group_set_caps(grp, 0, IOMMU_GROUP_CAP_ISOLATE_MSIX);
kfree(name);
}
--
2.11.0
Some devices have a MSIX BAR not aligned to the system page size
greater than 4K (like 64k for ppc64) which at the moment prevents
such MMIO pages from being mapped to the userspace for the sake of
the MSIX BAR content protection. If such page happens to share
the same system page with some frequently accessed registers,
the entire system page will be emulated which can seriously affect
performance.
This allows mapping of MSI-X tables to userspace if hardware provides
MSIX isolation via interrupt remapping or filtering; in other words
allowing direct access to the MSIX BAR won't do any harm to other devices
or cause spurious interrupts visible to the kernel.
This adds a wrapping helper to check if a capability is supported by
an IOMMU group.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/linux/vfio.h | 1 +
drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
drivers/vfio/vfio.c | 15 +++++++++++++++
4 files changed, 37 insertions(+), 4 deletions(-)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 586809abb273..7110bca2fb60 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -46,6 +46,7 @@ struct vfio_device_ops {
extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
+extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);
extern int vfio_add_group_dev(struct device *dev,
const struct vfio_device_ops *ops,
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d87a0a3cda14..c4c39ed64b1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
struct vfio_region_info_cap_sparse_mmap *sparse;
size_t end, size;
int nr_areas = 2, i = 0, ret;
+ bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+ IOMMU_GROUP_CAP_ISOLATE_MSIX);
end = pci_resource_len(vdev->pdev, vdev->msix_bar);
- /* If MSI-X table is aligned to the start or end, only one area */
- if (((vdev->msix_offset & PAGE_MASK) == 0) ||
+ /*
+ * If MSI-X table is allowed to mmap because of the capability
+ * of IRQ remapping or aligned to the start or end, only one area
+ */
+ if (is_msix_isolated ||
+ ((vdev->msix_offset & PAGE_MASK) == 0) ||
(PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
nr_areas = 1;
@@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
sparse->nr_areas = nr_areas;
+ if (is_msix_isolated) {
+ sparse->areas[i].offset = 0;
+ sparse->areas[i].size = end;
+ return 0;
+ }
+
if (vdev->msix_offset & PAGE_MASK) {
sparse->areas[i].offset = 0;
sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
@@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
unsigned int index;
u64 phys_len, req_len, pgoff, req_start;
int ret;
+ bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+ IOMMU_GROUP_CAP_ISOLATE_MSIX);
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
@@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
if (req_start + req_len > phys_len)
return -EINVAL;
- if (index == vdev->msix_bar) {
+ if (index == vdev->msix_bar && !is_msix_isolated) {
/*
* Disallow mmaps overlapping the MSI-X table; users don't
* get to touch this directly. We could find somewhere
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 357243d76f10..7514206a5ea7 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
#include <linux/uaccess.h>
#include <linux/io.h>
#include <linux/vgaarb.h>
+#include <linux/vfio.h>
#include "vfio_pci_private.h"
@@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
resource_size_t end;
void __iomem *io;
ssize_t done;
+ bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
+ IOMMU_GROUP_CAP_ISOLATE_MSIX);
if (pci_resource_start(pdev, bar))
end = pci_resource_len(pdev, bar);
@@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
} else
io = vdev->barmap[bar];
- if (bar == vdev->msix_bar) {
+ if (bar == vdev->msix_bar && !is_msix_isolated) {
x_start = vdev->msix_offset;
x_end = vdev->msix_offset + vdev->msix_size;
}
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 330d50582f40..5292c4a5ae8f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
}
EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
+bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
+{
+ bool ret = false;
+ struct iommu_group *group = vfio_iommu_group_get(dev);
+
+ if (group) {
+ ret = iommu_group_is_capable(group, cap);
+
+ vfio_iommu_group_put(group, dev);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
+
#ifdef CONFIG_VFIO_NOIOMMU
static void *vfio_noiommu_open(unsigned long arg)
{
--
2.11.0
On Mon, Aug 07, 2017 at 05:25:48PM +1000, Alexey Kardashevskiy wrote:
1;4803;0c> Some devices have a MSIX BAR not aligned to the system page size
> greater than 4K (like 64k for ppc64) which at the moment prevents
> such MMIO pages from being mapped to the userspace for the sake of
> the MSIX BAR content protection. If such page happens to share
> the same system page with some frequently accessed registers,
> the entire system page will be emulated which can seriously affect
> performance.
>
> This allows mapping of MSI-X tables to userspace if hardware provides
> MSIX isolation via interrupt remapping or filtering; in other words
> allowing direct access to the MSIX BAR won't do any harm to other devices
> or cause spurious interrupts visible to the kernel.
>
> This adds a wrapping helper to check if a capability is supported by
> an IOMMU group.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
> ---
> include/linux/vfio.h | 1 +
> drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
> drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
> drivers/vfio/vfio.c | 15 +++++++++++++++
> 4 files changed, 37 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809abb273..7110bca2fb60 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -46,6 +46,7 @@ struct vfio_device_ops {
>
> extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
> extern void vfio_iommu_group_put(struct iommu_group *group, struct device *dev);
> +extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap);
This diff probably belongs in the earlier patch adding the function,
rather than here where it's first used. Not worth respinning just for
that, though.
> extern int vfio_add_group_dev(struct device *dev,
> const struct vfio_device_ops *ops,
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d87a0a3cda14..c4c39ed64b1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
> struct vfio_region_info_cap_sparse_mmap *sparse;
> size_t end, size;
> int nr_areas = 2, i = 0, ret;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>
> end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>
> - /* If MSI-X table is aligned to the start or end, only one area */
> - if (((vdev->msix_offset & PAGE_MASK) == 0) ||
> + /*
> + * If MSI-X table is allowed to mmap because of the capability
> + * of IRQ remapping or aligned to the start or end, only one area
> + */
> + if (is_msix_isolated ||
> + ((vdev->msix_offset & PAGE_MASK) == 0) ||
> (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
> nr_areas = 1;
>
> @@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>
> sparse->nr_areas = nr_areas;
>
> + if (is_msix_isolated) {
> + sparse->areas[i].offset = 0;
> + sparse->areas[i].size = end;
> + return 0;
> + }
> +
> if (vdev->msix_offset & PAGE_MASK) {
> sparse->areas[i].offset = 0;
> sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
> @@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> unsigned int index;
> u64 phys_len, req_len, pgoff, req_start;
> int ret;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>
> index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>
> @@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> if (req_start + req_len > phys_len)
> return -EINVAL;
>
> - if (index == vdev->msix_bar) {
> + if (index == vdev->msix_bar && !is_msix_isolated) {
> /*
> * Disallow mmaps overlapping the MSI-X table; users don't
> * get to touch this directly. We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 357243d76f10..7514206a5ea7 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
> #include <linux/uaccess.h>
> #include <linux/io.h>
> #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>
> #include "vfio_pci_private.h"
>
> @@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
> resource_size_t end;
> void __iomem *io;
> ssize_t done;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>
> if (pci_resource_start(pdev, bar))
> end = pci_resource_len(pdev, bar);
> @@ -164,7 +167,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
> } else
> io = vdev->barmap[bar];
>
> - if (bar == vdev->msix_bar) {
> + if (bar == vdev->msix_bar && !is_msix_isolated) {
> x_start = vdev->msix_offset;
> x_end = vdev->msix_offset + vdev->msix_size;
> }
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 330d50582f40..5292c4a5ae8f 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -169,6 +169,21 @@ void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
> }
> EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
>
> +bool vfio_iommu_group_is_capable(struct device *dev, unsigned long cap)
> +{
> + bool ret = false;
> + struct iommu_group *group = vfio_iommu_group_get(dev);
> +
> + if (group) {
> + ret = iommu_group_is_capable(group, cap);
> +
> + vfio_iommu_group_put(group, dev);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_is_capable);
> +
> #ifdef CONFIG_VFIO_NOIOMMU
> static void *vfio_noiommu_open(unsigned long arg)
> {
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Mon, Aug 07, 2017 at 05:25:44PM +1000, Alexey Kardashevskiy wrote:
> This introduces capabilities to IOMMU groups. The first defined
> capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
> group users that a particular IOMMU group is capable of MSIX message
> filtering; this is useful when deciding whether or not to allow mapping
> of MSIX table to the userspace. Various architectures will enable it
> when they decide that it is safe to do so.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
This seems like a reasonable concept that's probably useful for
something, whether or not it's the best approach for the problem at
hand.
> ---
> include/linux/iommu.h | 20 ++++++++++++++++++++
> drivers/iommu/iommu.c | 28 ++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cb54adc4a33..6b6f3c2f4904 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -155,6 +155,9 @@ struct iommu_resv_region {
> enum iommu_resv_type type;
> };
>
> +/* IOMMU group capabilities */
> +#define IOMMU_GROUP_CAP_ISOLATE_MSIX (1U)
> +
> #ifdef CONFIG_IOMMU_API
>
> /**
> @@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct iommu_group *group);
> extern void iommu_group_set_iommudata(struct iommu_group *group,
> void *iommu_data,
> void (*release)(void *iommu_data));
> +extern void iommu_group_set_caps(struct iommu_group *group,
> + unsigned long clearcaps,
> + unsigned long setcaps);
> +extern bool iommu_group_is_capable(struct iommu_group *group,
> + unsigned long cap);
> extern int iommu_group_set_name(struct iommu_group *group, const char *name);
> extern int iommu_group_add_device(struct iommu_group *group,
> struct device *dev);
> @@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct iommu_group *group,
> {
> }
>
> +static inline void iommu_group_set_caps(struct iommu_group *group,
> + unsigned long clearcaps,
> + unsigned long setcaps)
> +{
> +}
> +
> +static inline bool iommu_group_is_capable(struct iommu_group *group,
> + unsigned long cap)
> +{
> + return false;
> +}
> +
> static inline int iommu_group_set_name(struct iommu_group *group,
> const char *name)
> {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea160afed..6b2c34fe2c3d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -52,6 +52,7 @@ struct iommu_group {
> void (*iommu_data_release)(void *iommu_data);
> char *name;
> int id;
> + unsigned long caps;
> struct iommu_domain *default_domain;
> struct iommu_domain *domain;
> };
> @@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
> EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
>
> /**
> + * iommu_group_set_caps - Change the group capabilities
> + * @group: the group
> + * @clearcaps: capabilities mask to remove
> + * @setcaps: capabilities mask to add
> + *
> + * IOMMU groups can be capable of various features which device drivers
> + * may read and adjust the behavior.
> + */
> +void iommu_group_set_caps(struct iommu_group *group,
> + unsigned long clearcaps, unsigned long setcaps)
> +{
> + group->caps &= ~clearcaps;
> + group->caps |= setcaps;
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_set_caps);
> +
> +/**
> + * iommu_group_is_capable - Returns if a group capability is present
> + * @group: the group
> + */
> +bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
> +{
> + return !!(group->caps & cap);
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_is_capable);
> +
> +/**
> * iommu_group_set_name - set name for a group
> * @group: the group
> * @name: name
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Folks,
Is there anything to change besides those compiler errors and David's
comment in 5/5? Or the while patchset is too bad? Thanks.
On 07/08/17 17:25, Alexey Kardashevskiy wrote:
> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
> http://www.spinics.net/lists/kvm/msg152232.html
>
> This time it is using "caps" in IOMMU groups. The main question is if PCI
> bus flags or IOMMU domains are still better (and which one).
>
>
>
> Here is some background:
>
> Current vfio-pci implementation disallows to mmap the page
> containing MSI-X table in case that users can write directly
> to MSI-X table and generate an incorrect MSIs.
>
> However, this will cause some performance issue when there
> are some critical device registers in the same page as the
> MSI-X table. We have to handle the mmio access to these
> registers in QEMU emulation rather than in guest.
>
> To solve this issue, this series allows to expose MSI-X table
> to userspace when hardware enables the capability of interrupt
> remapping which can ensure that a given PCI device can only
> shoot the MSIs assigned for it. And we introduce a new bus_flags
> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> for different archs.
>
>
> This is based on sha1
> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>
> Please comment. Thanks.
>
> Changelog:
>
> v5:
> * redid the whole thing via so-called IOMMU group capabilities
>
> v4:
> * rebased on recent upstream
> * got all 6 patches from v2 (v3 was missing some)
>
>
>
>
> Alexey Kardashevskiy (5):
> iommu: Add capabilities to a group
> iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
> remapping
> iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
> enabled
> powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
> vfio-pci: Allow to expose MSI-X table to userspace when safe
>
> include/linux/iommu.h | 20 ++++++++++++++++++++
> include/linux/vfio.h | 1 +
> arch/powerpc/kernel/iommu.c | 1 +
> drivers/iommu/amd_iommu.c | 3 +++
> drivers/iommu/intel-iommu.c | 3 +++
> drivers/iommu/iommu.c | 35 +++++++++++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
> drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
> drivers/vfio/vfio.c | 15 +++++++++++++++
> 9 files changed, 99 insertions(+), 4 deletions(-)
>
--
Alexey
On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> Folks,
>
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.
While I now understand it's not the low-level thing I first thought it
was, so my reasoning has changed, personally I don't like this approach
any more than the previous one - it still smells of abusing external
APIs to pass information from one part of VFIO to another (and it has
the same conceptual problem of attributing something to interrupt
sources that is actually a property of the interrupt target).
Taking a step back, though, why does vfio-pci perform this check in the
first place? If a malicious guest already has control of a device, any
kind of interrupt spoofing it could do by fiddling with the MSI-X
message address/data it could simply do with a DMA write anyway, so the
security argument doesn't stand up in general (sure, not all PCIe
devices may be capable of arbitrary DMA, but that seems like more of a
tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
the fact that we've let a device be assigned at all means that this is
already a non-issue (because either the hardware provides isolation or
the user has explicitly accepted the consequences of an unsafe
configuration) - from patch #4 that's apparently the same for SPAPR TCE,
in which case it seems this flag doesn't even need to be propagated and
could simply be assumed always.
On the other hand, if the check is not so much to mitigate malicious
guests attacking the system as to prevent dumb guests breaking
themselves (e.g. if some or all of the MSI-X capability is actually
emulated), then allowing things to sometimes go wrong on the grounds of
an irrelevant hardware feature doesn't seem correct :/
Robin.
> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
>
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>> iommu: Add capabilities to a group
>> iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>> remapping
>> iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>> enabled
>> powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>> vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>> include/linux/iommu.h | 20 ++++++++++++++++++++
>> include/linux/vfio.h | 1 +
>> arch/powerpc/kernel/iommu.c | 1 +
>> drivers/iommu/amd_iommu.c | 3 +++
>> drivers/iommu/intel-iommu.c | 3 +++
>> drivers/iommu/iommu.c | 35 +++++++++++++++++++++++++++++++++++
>> drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
>> drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
>> drivers/vfio/vfio.c | 15 +++++++++++++++
>> 9 files changed, 99 insertions(+), 4 deletions(-)
>>
>
>
On 08/14/2017 09:12 PM, Robin Murphy wrote:
> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
>
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
>
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me).
Hi Robin,
DMA writes will be translated (thereby censored) by DMA Remapping hardware,
while MSI/MSI-X will not. Is this different for non-x86?
--
Thanks,
Jike
> Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
>
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/
>
> Robin.
>
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>> iommu: Add capabilities to a group
>>> iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>> remapping
>>> iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>> enabled
>>> powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>> vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>> include/linux/iommu.h | 20 ++++++++++++++++++++
>>> include/linux/vfio.h | 1 +
>>> arch/powerpc/kernel/iommu.c | 1 +
>>> drivers/iommu/amd_iommu.c | 3 +++
>>> drivers/iommu/intel-iommu.c | 3 +++
>>> drivers/iommu/iommu.c | 35 +++++++++++++++++++++++++++++++++++
>>> drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
>>> drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
>>> drivers/vfio/vfio.c | 15 +++++++++++++++
>>> 9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
>
On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
>>> Taking a step back, though, why does vfio-pci perform this check in the
>>> first place? If a malicious guest already has control of a device, any
>>> kind of interrupt spoofing it could do by fiddling with the MSI-X
>>> message address/data it could simply do with a DMA write anyway, so the
>>> security argument doesn't stand up in general (sure, not all PCIe
>>> devices may be capable of arbitrary DMA, but that seems like more of a
>>> tenuous security-by-obscurity angle to me).
>
> I tried to make that point for years, thanks for re-iterating it :-)
>
>> Hi Robin,
>>
>> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
>> while MSI/MSI-X will not. Is this different for non-x86?
>
> There is no way your DMA remapping HW can differenciate. The only
> difference between a DMA write and an MSI is ... the address. So if I
> can make my device DMA to the MSI address range, I've defeated your
> security.
I don't think with IRQ remapping enabled, you can make your device DMA to
MSI address, without being treated as an IRQ and remapped. If so, the IRQ
remapping hardware is simply broken :)
--
Thanks,
Jike
On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > Taking a step back, though, why does vfio-pci perform this check in the
> > first place? If a malicious guest already has control of a device, any
> > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > message address/data it could simply do with a DMA write anyway, so the
> > security argument doesn't stand up in general (sure, not all PCIe
> > devices may be capable of arbitrary DMA, but that seems like more of a
> > tenuous security-by-obscurity angle to me).
I tried to make that point for years, thanks for re-iterating it :-)
> Hi Robin,
>
> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> while MSI/MSI-X will not. Is this different for non-x86?
There is no way your DMA remapping HW can differenciate. The only
difference between a DMA write and an MSI is ... the address. So if I
can make my device DMA to the MSI address range, I've defeated your
security.
The table obfuscating in qemu is only useful as an insecure way of
"making things sort-of-work" for HW that doesnt have proper remapping
or filtering.
On pseries we don't have that problem because:
1) Our hypervisor (which is qemu) provide the DMA address for MSIs/X
so there is no need for "magic remapping" to give the guest a value
that works.
2) Our HW (configured by VFIO/KVM) filters which device can DMA to
what address (including which MSIs/X) thus even if the guest doesn't
use the address passed and messes around, it can only shoot itself in
the foot.
So all we need is a way to tell qemu to stop doing that filtering on
our platform. This is *one bit* of information, it's taken 3 years of
arguments and we still don't have a solution. In the meantime,
workloads *are* being hurt by significant performance degradation due
to the MSI-X table sharing a 64K page (our page size) with other MMIOs.
Yay !
Ben.
On Tue, 2017-08-15 at 09:47 +0800, Jike Song wrote:
> On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > > Taking a step back, though, why does vfio-pci perform this check in the
> > > > first place? If a malicious guest already has control of a device, any
> > > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > > message address/data it could simply do with a DMA write anyway, so the
> > > > security argument doesn't stand up in general (sure, not all PCIe
> > > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > > tenuous security-by-obscurity angle to me).
> >
> > I tried to make that point for years, thanks for re-iterating it :-)
> >
> > > Hi Robin,
> > >
> > > DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> > > while MSI/MSI-X will not. Is this different for non-x86?
> >
> > There is no way your DMA remapping HW can differenciate. The only
> > difference between a DMA write and an MSI is ... the address. So if I
> > can make my device DMA to the MSI address range, I've defeated your
> > security.
>
> I don't think with IRQ remapping enabled, you can make your device DMA to
> MSI address, without being treated as an IRQ and remapped. If so, the IRQ
> remapping hardware is simply broken :)
You are mixing things here.
Robin's point is that there is no security provided by the obfuscating
of the MSI-X table by qemu because whatever qemu does to "filter" the
MSI-X targer addresses can be worked around by making the device DMA
wherever you want.
None of what you say invalidates that basic fact.
Now, as far as your remapping HW goes, either it filters interrupts or
it doesn't. If it does then yes, it can't be spoofed, and thus you
don't need the filtering of the table in qemu.
If it doesn't, then the guest can spoof any interrupt using DMAs and
whatever qemu does to filter the table is not going to fix it.
Thus the point remains that the only value in qemu filtering the table
is to enable already insecure use cases to work, without actually
making them any more secure.
Ben.
On Mon, 2017-08-14 at 14:12 +0100, Robin Murphy wrote:
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/
There is 0 value in trying to prevent the guest kernel from shooting
itself in the foot. There are so many other ways it can do it that I
fail the point of even attempting it here.
In addition, this actually harms performance on some devices. There
are cases where the MSI-X table shares a page with other registrers
that are used during normal device operation. This is especially
problematic on architectures such as powerpc that use 64K pages.
Those devices thus suffer a massive performance loss, for the sake of
something that never happens in practice (especially on pseries where
the MSI configuration is done by paravirt calls, thus by qemu itself).
Cheers,
Ben.
From: Benjamin Herrenschmidt
> Sent: 15 August 2017 02:34
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > Taking a step back, though, why does vfio-pci perform this check in the
> > > first place? If a malicious guest already has control of a device, any
> > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > message address/data it could simply do with a DMA write anyway, so the
> > > security argument doesn't stand up in general (sure, not all PCIe
> > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > tenuous security-by-obscurity angle to me).
>
> I tried to make that point for years, thanks for re-iterating it :-)
Indeed, we have an FPGA based PCIe card where the MSI-X table is just a
piece of PCIe accessible memory.
The device driver has to read the MSI-X table and write the address+data
values to other registers which are then used to raise the interrupt.
(Ok, I've written a better interrupt generator so we don't do that
any more.)
David
On Mon, 14 Aug 2017 14:12:33 +0100
Robin Murphy <[email protected]> wrote:
> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> > Folks,
> >
> > Is there anything to change besides those compiler errors and David's
> > comment in 5/5? Or the while patchset is too bad? Thanks.
>
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
>
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
>
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/
While the theoretical security provided by preventing direct access to
the MSI-X vector table may be mostly a matter of obfuscation, in
practice, I think it changes the problem of creating arbitrary DMA
writes from a generic, trivial, PCI spec based exercise to a more device
specific challenge. I do however have evidence that there are
consumers of the vfio API who would have attempted to program device
interrupts by directly manipulating the vector table had they not been
prevented from doing so and contacting me to learn about the SET_IRQ
ioctl. Therefore I think the behavior also contributes to making the
overall API more difficult to use incorrectly.
Of course I don't think either of those are worth imposing a
performance penalty where we don't otherwise need one. However, if we
look at a VM scenario where the guest is following the PCI standard for
programming MSI-X interrupts (ie. not POWER), we need some mechanism to
intercept those MMIO writes to the vector table and configure the host
interrupt domain of the device rather than allowing the guest direct
access. This is simply part of virtualizing the device to the guest.
So even if the kernel allows mmap'ing the vector table, the hypervisor
needs to trap it, so the mmap isn't required or used anyway. It's only
when you define a non-PCI standard for your guest to program
interrupts, as POWER has done, and can therefore trust that the
hypervisor does not need to trap on the vector table that having that
mmap'able vector table becomes fully useful. AIUI, ARM supports 64k
pages too... does ARM have any strategy that would actually make it
possible to make use of an mmap covering the vector table? Thanks,
Alex
On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> Of course I don't think either of those are worth imposing a
> performance penalty where we don't otherwise need one. However, if we
> look at a VM scenario where the guest is following the PCI standard for
> programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> intercept those MMIO writes to the vector table and configure the host
> interrupt domain of the device rather than allowing the guest direct
> access. This is simply part of virtualizing the device to the guest.
> So even if the kernel allows mmap'ing the vector table, the hypervisor
> needs to trap it, so the mmap isn't required or used anyway. It's only
> when you define a non-PCI standard for your guest to program
> interrupts, as POWER has done, and can therefore trust that the
> hypervisor does not need to trap on the vector table that having that
> mmap'able vector table becomes fully useful. AIUI, ARM supports 64k
> pages too... does ARM have any strategy that would actually make it
> possible to make use of an mmap covering the vector table? Thanks,
WTF ???? Alex, can you stop once and for all with all that "POWER is
not standard" bullshit please ? It's completely wrong.
This has nothing to do with PCIe standard !
The PCIe standard says strictly *nothing* whatsoever about how an OS
obtains the magic address/values to put in the device and how the PCIe
host bridge may do appropriate fitering.
There is nothing on POWER that prevents the guest from writing the MSI-
X address/data by hand. The problem isn't who writes the values or even
how. The problem breaks down into these two things that are NOT covered
by any aspect of the PCIe standard:
1- The OS needs to obtain address/data values for an MSI that will
"work" for the device.
2- The HW+HV needs to prevent collateral damage caused by a device
issuing stores to incorrect address or with incorrect data. Now *this*
is necessary for *ANY* kind of DMA whether it's an MSI or something
else anyway.
Now, the filtering done by qemu is NOT a reasonable way to handle 2)
and whatever excluse about "making it harder" doesn't fly a meter when
it comes to security. Making it "harder to break accidentally" I also
don't buy, people don't just randomly put things in their MSI-X tables
"accidentally", that stuff works or doesn't.
That leaves us with 1). Now this is purely a platform specific matters,
not a spec matter. Once the HW has a way to enforce you can only
generate "allowed" MSIs it becomes a matter of having some FW mechanism
that can be used to informed the OS what address/values to use for a
given interrupts.
This is provided on POWER by a combination of device-tree and RTAS. It
could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
but this is no way a problem of standard compliance, just inferior
firmware interfaces.
So again, for the 234789246th time in years, can we get that 1-bit-of-
information sorted one way or another so we can fix our massive
performance issue instead of adding yet another dozen layers of paint
on that shed ?
Ben.
On Wed, 16 Aug 2017 10:35:49 +1000
Benjamin Herrenschmidt <[email protected]> wrote:
> On Tue, 2017-08-15 at 10:37 -0600, Alex Williamson wrote:
> > Of course I don't think either of those are worth imposing a
> > performance penalty where we don't otherwise need one. However, if we
> > look at a VM scenario where the guest is following the PCI standard for
> > programming MSI-X interrupts (ie. not POWER), we need some mechanism to
> > intercept those MMIO writes to the vector table and configure the host
> > interrupt domain of the device rather than allowing the guest direct
> > access. This is simply part of virtualizing the device to the guest.
> > So even if the kernel allows mmap'ing the vector table, the hypervisor
> > needs to trap it, so the mmap isn't required or used anyway. It's only
> > when you define a non-PCI standard for your guest to program
> > interrupts, as POWER has done, and can therefore trust that the
> > hypervisor does not need to trap on the vector table that having that
> > mmap'able vector table becomes fully useful. AIUI, ARM supports 64k
> > pages too... does ARM have any strategy that would actually make it
> > possible to make use of an mmap covering the vector table? Thanks,
>
> WTF ???? Alex, can you stop once and for all with all that "POWER is
> not standard" bullshit please ? It's completely wrong.
As you've stated, the MSI-X vector table on POWER is currently updated
via a hypercall. POWER is overall PCI compliant (I assume), but the
guest does not directly modify the vector table in MMIO space of the
device. This is important...
> This has nothing to do with PCIe standard !
Yes, it actually does, because if the guest relies on the vector table
to be virtualized then it doesn't particularly matter whether the
vfio-pci kernel driver allows that portion of device MMIO space to be
directly accessed or mapped because QEMU needs for it to be trapped in
order to provide that virtualization.
I'm not knocking POWER, it's a smart thing for virtualization to have
defined this hypercall which negates the need for vector table
virtualization and allows efficient mapping of the device. On other
platform, it's not necessarily practical given the broad base of legacy
guests supported where we'd never get agreement to implement this as
part of the platform spec... if there even was such a thing. Maybe we
could provide the hypercall and dynamically enable direct vector table
mapping (disabling vector table virtualization) only if the hypercall
is used.
> The PCIe standard says strictly *nothing* whatsoever about how an OS
> obtains the magic address/values to put in the device and how the PCIe
> host bridge may do appropriate fitering.
And now we've jumped the tracks... The only way the platform specific
address/data values become important is if we allow direct access to
the vector table AND now we're formulating how the user/guest might
write to it directly. Otherwise the virtualization of the vector
table, or paravirtualization via hypercall provides the translation
where the host and guest address/data pairs can operate in completely
different address spaces.
> There is nothing on POWER that prevents the guest from writing the MSI-
> X address/data by hand. The problem isn't who writes the values or even
> how. The problem breaks down into these two things that are NOT covered
> by any aspect of the PCIe standard:
You've moved on to a different problem, I think everyone aside from
POWER is still back at the problem where who writes the vector table
values is a forefront problem.
> 1- The OS needs to obtain address/data values for an MSI that will
> "work" for the device.
>
> 2- The HW+HV needs to prevent collateral damage caused by a device
> issuing stores to incorrect address or with incorrect data. Now *this*
> is necessary for *ANY* kind of DMA whether it's an MSI or something
> else anyway.
>
> Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> and whatever excluse about "making it harder" doesn't fly a meter when
> it comes to security. Making it "harder to break accidentally" I also
> don't buy, people don't just randomly put things in their MSI-X tables
> "accidentally", that stuff works or doesn't.
As I said before, I'm not willing to preserve the weak attributes that
blocking direct vector table access provides over pursuing a more
performant interface, but I also don't think their value is absolute
zero either.
> That leaves us with 1). Now this is purely a platform specific matters,
> not a spec matter. Once the HW has a way to enforce you can only
> generate "allowed" MSIs it becomes a matter of having some FW mechanism
> that can be used to informed the OS what address/values to use for a
> given interrupts.
>
> This is provided on POWER by a combination of device-tree and RTAS. It
> could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> but this is no way a problem of standard compliance, just inferior
> firmware interfaces.
Firmware pissing match... Processors running with 8k or less page size
fall within the recommendations of the PCI spec for register alignment
of MMIO regions of the device and this whole problem becomes less of an
issue.
> So again, for the 234789246th time in years, can we get that 1-bit-of-
> information sorted one way or another so we can fix our massive
> performance issue instead of adding yet another dozen layers of paint
> on that shed ?
TBH, I'm not even sure which bikeshed we're looking at with this latest
distraction of interfaces through which the user/guest could discover
viable address/data values to write the vector table directly. Thanks,
Alex
On Wed, 2017-08-16 at 10:56 -0600, Alex Williamson wrote:
>
> > WTF ???? Alex, can you stop once and for all with all that "POWER is
> > not standard" bullshit please ? It's completely wrong.
>
> As you've stated, the MSI-X vector table on POWER is currently updated
> via a hypercall. POWER is overall PCI compliant (I assume), but the
> guest does not directly modify the vector table in MMIO space of the
> device. This is important...
Well no. On qemu the guest doesn't always (but it can save/restore it),
but on PowerVM this is done by the FW running inside the partition
itself. And that firmware just does normal stores to the device table.
IE. The problem here isn't so much who does the actual stores to the
device table but where they get the address and data values from, which
isn't covered by the spec.
The added fact that qemu hijacks the stores not just to "remap" them
but also do the whole reuqesting of the interrupt etc... in the host
system is a qemu design choice which also hasn't any relation to the
spec (and arguably isnt' a great choice for our systems).
For example, on PowerVM, the HV assigns a pile of MSIs to the guest to
assign to its devices. The FW inside the guest does a default
assignment but that can be changed.
Thus the interrupts are effectively "hooked up" at the HV level at the
point where the PCI bridge is mapped into the guest.
> > This has nothing to do with PCIe standard !
>
> Yes, it actually does, because if the guest relies on the vector table
> to be virtualized then it doesn't particularly matter whether the
> vfio-pci kernel driver allows that portion of device MMIO space to be
> directly accessed or mapped because QEMU needs for it to be trapped in
> order to provide that virtualization.
And this has nothing to do with the PCIe standard... this has
everything to do with a combination of qemu design choices and
defficient FW interfaces on x86 platforms.
> I'm not knocking POWER, it's a smart thing for virtualization to have
> defined this hypercall which negates the need for vector table
> virtualization and allows efficient mapping of the device. On other
> platform, it's not necessarily practical given the broad base of legacy
> guests supported where we'd never get agreement to implement this as
> part of the platform spec... if there even was such a thing. Maybe we
> could provide the hypercall and dynamically enable direct vector table
> mapping (disabling vector table virtualization) only if the hypercall
> is used.
No I think a better approach would be to provide the guest with a pile
of MSIs to use with devices and have FW (such as ACPI) tell the guest
about them.
> > The PCIe standard says strictly *nothing* whatsoever about how an OS
> > obtains the magic address/values to put in the device and how the PCIe
> > host bridge may do appropriate fitering.
>
> And now we've jumped the tracks... The only way the platform specific
> address/data values become important is if we allow direct access to
> the vector table AND now we're formulating how the user/guest might
> write to it directly. Otherwise the virtualization of the vector
> table, or paravirtualization via hypercall provides the translation
> where the host and guest address/data pairs can operate in completely
> different address spaces.
They can regardless if things are done properly :-)
> > There is nothing on POWER that prevents the guest from writing the MSI-
> > X address/data by hand. The problem isn't who writes the values or even
> > how. The problem breaks down into these two things that are NOT covered
> > by any aspect of the PCIe standard:
>
> You've moved on to a different problem, I think everyone aside from
> POWER is still back at the problem where who writes the vector table
> values is a forefront problem.
>
> > 1- The OS needs to obtain address/data values for an MSI that will
> > "work" for the device.
> >
> > 2- The HW+HV needs to prevent collateral damage caused by a device
> > issuing stores to incorrect address or with incorrect data. Now *this*
> > is necessary for *ANY* kind of DMA whether it's an MSI or something
> > else anyway.
> >
> > Now, the filtering done by qemu is NOT a reasonable way to handle 2)
> > and whatever excluse about "making it harder" doesn't fly a meter when
> > it comes to security. Making it "harder to break accidentally" I also
> > don't buy, people don't just randomly put things in their MSI-X tables
> > "accidentally", that stuff works or doesn't.
>
> As I said before, I'm not willing to preserve the weak attributes that
> blocking direct vector table access provides over pursuing a more
> performant interface, but I also don't think their value is absolute
> zero either.
>
> > That leaves us with 1). Now this is purely a platform specific matters,
> > not a spec matter. Once the HW has a way to enforce you can only
> > generate "allowed" MSIs it becomes a matter of having some FW mechanism
> > that can be used to informed the OS what address/values to use for a
> > given interrupts.
> >
> > This is provided on POWER by a combination of device-tree and RTAS. It
> > could be that x86/ARM64 doesn't provide good enough mechanisms via ACPI
> > but this is no way a problem of standard compliance, just inferior
> > firmware interfaces.
>
> Firmware pissing match... Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.
>
> > So again, for the 234789246th time in years, can we get that 1-bit-of-
> > information sorted one way or another so we can fix our massive
> > performance issue instead of adding yet another dozen layers of paint
> > on that shed ?
>
> TBH, I'm not even sure which bikeshed we're looking at with this latest
> distraction of interfaces through which the user/guest could discover
> viable address/data values to write the vector table directly. Thanks,
>
> Alex
From: Alex Williamson
> Sent: 16 August 2017 17:56
...
> Firmware pissing match... Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.
Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
it just lie to the guest about the physical address of the MSI-X table?
Then mmio access to anything in the same physical page will just work.
It has already been pointed out that you can't actually police the
interrupts that are raised without host hardware support.
Actually, putting other vectors in the MSI-X table is boring, most
drivers will ignore unexpected interrupts.
Much more interesting are physical memory addresses and accessible IO
addresses.
Of course, a lot of boards have PCI master capability and can probably
be persuaded to do writes to specific location anyway.
David
On Thu, 17 Aug 2017 10:56:35 +0000
David Laight <[email protected]> wrote:
> From: Alex Williamson
> > Sent: 16 August 2017 17:56
> ...
> > Firmware pissing match... Processors running with 8k or less page size
> > fall within the recommendations of the PCI spec for register alignment
> > of MMIO regions of the device and this whole problem becomes less of an
> > issue.
>
> Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
> it just lie to the guest about the physical address of the MSI-X table?
> Then mmio access to anything in the same physical page will just work.
That's an interesting idea, but now you need to add a BAR for the
virtualized vector table, but you'll also need to support extending a
BAR because there won't necessarily be a BAR available to add. Of
course PCI requires natural alignment of BARs, thus an extra few bytes
on the end doubles the BAR size. So also hope that if we need to
extend a BAR that there's a relatively small one available. In either
case you're changing the layout of the device from what the driver might
expect. We try pretty hard with device assignment to leave things in
the same place as they appear on bare metal, perhaps removing things,
but not actually moving things. It might work in the majority of
cases, but it seems a bit precarious overall. Thanks,
Alex
Folks,
Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
repost this or go back to PCI bus flags or something else? Thanks.
On 14/08/17 19:45, Alexey Kardashevskiy wrote:
> Folks,
>
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.
>
>
>
> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
>
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>> iommu: Add capabilities to a group
>> iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>> remapping
>> iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>> enabled
>> powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>> vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>> include/linux/iommu.h | 20 ++++++++++++++++++++
>> include/linux/vfio.h | 1 +
>> arch/powerpc/kernel/iommu.c | 1 +
>> drivers/iommu/amd_iommu.c | 3 +++
>> drivers/iommu/intel-iommu.c | 3 +++
>> drivers/iommu/iommu.c | 35 +++++++++++++++++++++++++++++++++++
>> drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
>> drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
>> drivers/vfio/vfio.c | 15 +++++++++++++++
>> 9 files changed, 99 insertions(+), 4 deletions(-)
>>
>
>
--
Alexey
On 21/08/17 12:47, Alexey Kardashevskiy wrote:
> Folks,
>
> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
> repost this or go back to PCI bus flags or something else? Thanks.
Anyone, any help? How do we proceed with this? Thanks.
>
>
>
> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>
>>
>>
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>> iommu: Add capabilities to a group
>>> iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>> remapping
>>> iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>> enabled
>>> powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>> vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>> include/linux/iommu.h | 20 ++++++++++++++++++++
>>> include/linux/vfio.h | 1 +
>>> arch/powerpc/kernel/iommu.c | 1 +
>>> drivers/iommu/amd_iommu.c | 3 +++
>>> drivers/iommu/intel-iommu.c | 3 +++
>>> drivers/iommu/iommu.c | 35 +++++++++++++++++++++++++++++++++++
>>> drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
>>> drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
>>> drivers/vfio/vfio.c | 15 +++++++++++++++
>>> 9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
>
>
--
Alexey
On 29/08/17 12:58, Alexey Kardashevskiy wrote:
> On 21/08/17 12:47, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Ok, people did talk, exchanged ideas, lovely :) What happens now? Do I
>> repost this or go back to PCI bus flags or something else? Thanks.
>
>
> Anyone, any help? How do we proceed with this? Thanks.
Ping, anyone?
Or everybody is just waiting for the kvm forum to discuss it in person? Thanks.
>
>
>
>>
>>
>>
>> On 14/08/17 19:45, Alexey Kardashevskiy wrote:
>>> Folks,
>>>
>>> Is there anything to change besides those compiler errors and David's
>>> comment in 5/5? Or the while patchset is too bad? Thanks.
>>>
>>>
>>>
>>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for mmapping MSI-X table"
>>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>>
>>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>>> bus flags or IOMMU domains are still better (and which one).
>>>
>>>>
>>>>
>>>>
>>>> Here is some background:
>>>>
>>>> Current vfio-pci implementation disallows to mmap the page
>>>> containing MSI-X table in case that users can write directly
>>>> to MSI-X table and generate an incorrect MSIs.
>>>>
>>>> However, this will cause some performance issue when there
>>>> are some critical device registers in the same page as the
>>>> MSI-X table. We have to handle the mmio access to these
>>>> registers in QEMU emulation rather than in guest.
>>>>
>>>> To solve this issue, this series allows to expose MSI-X table
>>>> to userspace when hardware enables the capability of interrupt
>>>> remapping which can ensure that a given PCI device can only
>>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>>> for different archs.
>>>>
>>>>
>>>> This is based on sha1
>>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>>
>>>> Please comment. Thanks.
>>>>
>>>> Changelog:
>>>>
>>>> v5:
>>>> * redid the whole thing via so-called IOMMU group capabilities
>>>>
>>>> v4:
>>>> * rebased on recent upstream
>>>> * got all 6 patches from v2 (v3 was missing some)
>>>>
>>>>
>>>>
>>>>
>>>> Alexey Kardashevskiy (5):
>>>> iommu: Add capabilities to a group
>>>> iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>>> remapping
>>>> iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>>> enabled
>>>> powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>> vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>>
>>>> include/linux/iommu.h | 20 ++++++++++++++++++++
>>>> include/linux/vfio.h | 1 +
>>>> arch/powerpc/kernel/iommu.c | 1 +
>>>> drivers/iommu/amd_iommu.c | 3 +++
>>>> drivers/iommu/intel-iommu.c | 3 +++
>>>> drivers/iommu/iommu.c | 35 +++++++++++++++++++++++++++++++++++
>>>> drivers/vfio/pci/vfio_pci.c | 20 +++++++++++++++++---
>>>> drivers/vfio/pci/vfio_pci_rdwr.c | 5 ++++-
>>>> drivers/vfio/vfio.c | 15 +++++++++++++++
>>>> 9 files changed, 99 insertions(+), 4 deletions(-)
>>>>
>>>
>>>
>>
>>
>
>
--
Alexey