2012-05-22 05:05:03

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 00/13] IOMMU Groups + VFIO

Version 2 incorporating acks and feedback from v1. The PCI DMA quirk
and ACS check are reworked, sysfs iommu groups ABI Documentation
added as well as numerous other fixes, including patches from Alexey
Kardashevskiy towards supporting POWER usage of VFIO and IOMMU groups.

This series can be found here on top of 3.4:

git://github.com/awilliam/linux-vfio.git iommu-group-vfio-20120521

The Qemu tree has also been updated to Qemu 1.1 and can be found here:

git://github.com/awilliam/qemu-vfio.git iommu-group-vfio

I'd really like to make a push to get this in for 3.5, so let's talk
about how to do that across iommu, pci, and new driver. Joerg, are
you sufficiently happy with the IOMMU group concept and code? We'll
also need David Woodhouse buyin on the intel-iommu changes in patches
3 & 6. Who needs to approve VFIO as a new driver, GregKH? Bjorn,
I'd be happy to send the PCI changes as a series for you, but I
wonder if it makes sense to collect acks for them if you approve and
bundle them in with the associated code that needs them so you're
not left with unused code. Let me know which you prefer. If there
are better ways to do it, please let me know. Thanks,

Alex

---

Alex Williamson (13):
vfio: Add PCI device driver
pci: Misc pci_reg additions
pci: Create common pcibios_err_to_errno
pci: export pci_user functions for use by other drivers
vfio: x86 IOMMU implementation
vfio: Add documentation
vfio: VFIO core
iommu: Make use of DMA quirking and ACS enabled check for groups
pci: Add ACS validation utility
pci: Add PCI DMA source ID quirk
iommu: IOMMU groups for VT-d and AMD-Vi
iommu: IOMMU Groups
driver core: Add iommu_group tracking to struct device


.../ABI/testing/sysfs-kernel-iommu_groups | 14
Documentation/ioctl/ioctl-number.txt | 1
Documentation/vfio.txt | 315 ++++
MAINTAINERS | 8
drivers/Kconfig | 2
drivers/Makefile | 1
drivers/iommu/amd_iommu.c | 67 +
drivers/iommu/intel-iommu.c | 87 +
drivers/iommu/iommu.c | 578 +++++++-
drivers/pci/access.c | 6
drivers/pci/pci.c | 76 +
drivers/pci/pci.h | 7
drivers/pci/quirks.c | 69 +
drivers/vfio/Kconfig | 16
drivers/vfio/Makefile | 3
drivers/vfio/pci/Kconfig | 8
drivers/vfio/pci/Makefile | 4
drivers/vfio/pci/vfio_pci.c | 557 +++++++
drivers/vfio/pci/vfio_pci_config.c | 1522 ++++++++++++++++++++
drivers/vfio/pci/vfio_pci_intrs.c | 724 ++++++++++
drivers/vfio/pci/vfio_pci_private.h | 91 +
drivers/vfio/pci/vfio_pci_rdwr.c | 269 ++++
drivers/vfio/vfio.c | 1413 +++++++++++++++++++
drivers/vfio/vfio_iommu_x86.c | 743 ++++++++++
drivers/xen/xen-pciback/conf_space.c | 6
include/linux/device.h | 2
include/linux/iommu.h | 104 +
include/linux/pci.h | 49 +
include/linux/pci_regs.h | 112 +
include/linux/vfio.h | 444 ++++++
30 files changed, 7182 insertions(+), 116 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-iommu_groups
create mode 100644 Documentation/vfio.txt
create mode 100644 drivers/vfio/Kconfig
create mode 100644 drivers/vfio/Makefile
create mode 100644 drivers/vfio/pci/Kconfig
create mode 100644 drivers/vfio/pci/Makefile
create mode 100644 drivers/vfio/pci/vfio_pci.c
create mode 100644 drivers/vfio/pci/vfio_pci_config.c
create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
create mode 100644 drivers/vfio/pci/vfio_pci_private.h
create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
create mode 100644 drivers/vfio/vfio.c
create mode 100644 drivers/vfio/vfio_iommu_x86.c
create mode 100644 include/linux/vfio.h


2012-05-22 05:05:35

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 04/13] pci: Add PCI DMA source ID quirk

DMA transactions are tagged with the source ID of the device making
the request. Occasionally hardware screws this up and uses the
source ID of a different device (often the wrong function number of
a multifunction device). A specific Ricoh multifunction device is
a prime example of this problem and included in this patch. The
purpose of this function is that given a pci_dev, return the pci_dev
to use as the source ID for DMA. When hardware works correctly,
this returns the input device. For the components of the Ricoh
multifunction device, return the pci_dev for function 0.

This will be used by IOMMU drivers for determining the boundaries
of IOMMU groups as multiple devices using the same source ID must
be contained within the same group. This can also be used by
existing streaming DMA paths for the same purpose.

Signed-off-by: Alex Williamson <[email protected]>
---

drivers/pci/quirks.c | 40 ++++++++++++++++++++++++++++++++++++++++
include/linux/pci.h | 5 +++++
2 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4bf7102..a2dd77f 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3109,3 +3109,43 @@ int pci_dev_specific_reset(struct pci_dev *dev, int probe)

return -ENOTTY;
}
+
+static struct pci_dev *pci_func_0_dma_source(struct pci_dev *dev)
+{
+ return pci_get_slot(dev->bus, PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
+}
+
+static const struct pci_dev_dma_source {
+ u16 vendor;
+ u16 device;
+ struct pci_dev *(*dma_source)(struct pci_dev *dev);
+} pci_dev_dma_source[] = {
+ /*
+ * https://bugzilla.redhat.com/show_bug.cgi?id=605888
+ *
+ * Some Ricoh devices use the function 0 source ID for DMA on
+ * other functions of a multifunction device. The DMA devices
+ * is therefore function 0, which will have implications of the
+ * iommu grouping of these devices.
+ */
+ { PCI_VENDOR_ID_RICOH, 0xe822, pci_func_0_dma_source },
+ { PCI_VENDOR_ID_RICOH, 0xe230, pci_func_0_dma_source },
+ { PCI_VENDOR_ID_RICOH, 0xe832, pci_func_0_dma_source },
+ { PCI_VENDOR_ID_RICOH, 0xe832, pci_func_0_dma_source },
+ { 0 }
+};
+
+struct pci_dev *pci_dma_source(struct pci_dev *dev)
+{
+ const struct pci_dev_dma_source *i;
+
+ for (i = pci_dev_dma_source; i->dma_source; i++) {
+ if ((i->vendor == dev->vendor ||
+ i->vendor == (u16)PCI_ANY_ID) &&
+ (i->device == dev->device ||
+ i->device == (u16)PCI_ANY_ID))
+ return i->dma_source(dev);
+ }
+
+ return dev;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e444f5b..02dbfed 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1479,9 +1479,14 @@ enum pci_fixup_pass {

#ifdef CONFIG_PCI_QUIRKS
void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev);
+struct pci_dev *pci_dma_source(struct pci_dev *dev);
#else
static inline void pci_fixup_device(enum pci_fixup_pass pass,
struct pci_dev *dev) {}
+static inline struct pci_dev *pci_dma_source(struct pci_dev *dev)
+{
+ return dev;
+}
#endif

void __iomem *pcim_iomap(struct pci_dev *pdev, int bar, unsigned long maxlen);

2012-05-22 05:05:33

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 03/13] iommu: IOMMU groups for VT-d and AMD-Vi

Add back group support for AMD & Intel. amd_iommu already tracks
devices and has init and uninit routines to manage groups.
intel-iommu does this on the fly, so we make use of the notifier
support built into iommu groups to create and remove groups.

Signed-off-by: Alex Williamson <[email protected]>
---

drivers/iommu/amd_iommu.c | 28 +++++++++++++++++++++++++-
drivers/iommu/intel-iommu.c | 46 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 32c00cd..b7e5ddf 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -256,9 +256,11 @@ static bool check_device(struct device *dev)

static int iommu_init_device(struct device *dev)
{
- struct pci_dev *pdev = to_pci_dev(dev);
+ struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
struct iommu_dev_data *dev_data;
+ struct iommu_group *group;
u16 alias;
+ int ret;

if (dev->archdata.iommu)
return 0;
@@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
return -ENOTSUPP;
}
dev_data->alias_data = alias_data;
+
+ dma_pdev = pci_get_bus_and_slot(alias >> 8, alias & 0xff);
+ } else
+ dma_pdev = pdev;
+
+ if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+ pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+ dma_pdev = pci_get_slot(pdev->bus,
+ PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+ group = iommu_group_get(&dma_pdev->dev);
+ if (!group) {
+ group = iommu_group_alloc();
+ if (IS_ERR(group))
+ return PTR_ERR(group);
}

+ ret = iommu_group_add_device(group, dev);
+
+ iommu_group_put(group);
+
+ if (ret)
+ return ret;
+
if (pci_iommuv2_capable(pdev)) {
struct amd_iommu *iommu;

@@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)

static void iommu_uninit_device(struct device *dev)
{
+ iommu_group_remove_device(dev);
+
/*
* Nothing to do here - we keep dev_data around for unplugged devices
* and reuse it when the device is re-plugged - not doing so would
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d4a0ff7..e63b33b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
return 0;
}

+static int intel_iommu_add_device(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct pci_dev *bridge, *dma_pdev = pdev;
+ struct iommu_group *group;
+ int ret;
+
+ if (!device_to_iommu(pci_domain_nr(pdev->bus),
+ pdev->bus->number, pdev->devfn))
+ return -ENODEV;
+
+ bridge = pci_find_upstream_pcie_bridge(pdev);
+ if (bridge) {
+ if (pci_is_pcie(bridge))
+ dma_pdev = pci_get_domain_bus_and_slot(
+ pci_domain_nr(pdev->bus),
+ bridge->subordinate->number, 0);
+ else
+ dma_pdev = bridge;
+ }
+
+ if (!pdev->is_virtfn && PCI_FUNC(pdev->devfn) && iommu_group_mf &&
+ pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+ dma_pdev = pci_get_slot(pdev->bus,
+ PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+ group = iommu_group_get(&dma_pdev->dev);
+ if (!group) {
+ group = iommu_group_alloc();
+ if (IS_ERR(group))
+ return PTR_ERR(group);
+ }
+
+ ret = iommu_group_add_device(group, dev);
+
+ iommu_group_put(group);
+ return ret;
+}
+
+static void intel_iommu_remove_device(struct device *dev)
+{
+ iommu_group_remove_device(dev);
+}
+
static struct iommu_ops intel_iommu_ops = {
.domain_init = intel_iommu_domain_init,
.domain_destroy = intel_iommu_domain_destroy,
@@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
.unmap = intel_iommu_unmap,
.iova_to_phys = intel_iommu_iova_to_phys,
.domain_has_cap = intel_iommu_domain_has_cap,
+ .add_device = intel_iommu_add_device,
+ .remove_device = intel_iommu_remove_device,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
};

2012-05-22 05:05:57

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 07/13] vfio: VFIO core

VFIO is a secure user level driver for use with both virtual machines
and user level drivers. VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access. It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).

New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface. We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model. IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms. VFIO
users are able to query and initialize the IOMMU model of their
choice.

Please see the follow-on Documentation commit for further description
and usage example.

Signed-off-by: Alex Williamson <[email protected]>
---

Documentation/ioctl/ioctl-number.txt | 1
MAINTAINERS | 8
drivers/Kconfig | 2
drivers/Makefile | 1
drivers/vfio/Kconfig | 8
drivers/vfio/Makefile | 1
drivers/vfio/vfio.c | 1406 ++++++++++++++++++++++++++++++++++
include/linux/vfio.h | 366 +++++++++
8 files changed, 1793 insertions(+), 0 deletions(-)
create mode 100644 drivers/vfio/Kconfig
create mode 100644 drivers/vfio/Makefile
create mode 100644 drivers/vfio/vfio.c
create mode 100644 include/linux/vfio.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index e34b531..111e30a 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,6 +88,7 @@ Code Seq#(hex) Include File Comments
and kernel/power/user.c
'8' all SNP8023 advanced NIC card
<mailto:[email protected]>
+';' 64-6F linux/vfio.h
'@' 00-0F linux/radeonfb.h conflict!
'@' 00-0F drivers/video/aty/aty128fb.c conflict!
'A' 00-1F linux/apm_bios.h conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index b362709..5aca4ff 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7229,6 +7229,14 @@ S: Maintained
F: Documentation/filesystems/vfat.txt
F: fs/fat/

+VFIO DRIVER
+M: Alex Williamson <[email protected]>
+L: [email protected]
+S: Maintained
+F: Documentation/vfio.txt
+F: drivers/vfio/
+F: include/linux/vfio.h
+
VIDEOBUF2 FRAMEWORK
M: Pawel Osciak <[email protected]>
M: Marek Szyprowski <[email protected]>
diff --git a/drivers/Kconfig b/drivers/Kconfig
index d236aef..46eb115 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"

source "drivers/uio/Kconfig"

+source "drivers/vfio/Kconfig"
+
source "drivers/vlynq/Kconfig"

source "drivers/virtio/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 95952c8..fe1880a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_ATM) += atm/
obj-$(CONFIG_FUSION) += message/
obj-y += firewire/
obj-$(CONFIG_UIO) += uio/
+obj-$(CONFIG_VFIO) += vfio/
obj-y += cdrom/
obj-y += auxdisplay/
obj-$(CONFIG_PCCARD) += pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 0000000..9acb1e7
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig VFIO
+ tristate "VFIO Non-Privileged userspace driver framework"
+ depends on IOMMU_API
+ help
+ VFIO provides a framework for secure userspace device drivers.
+ See Documentation/vfio.txt for more details.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 0000000..7500a67
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VFIO) += vfio.o
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
new file mode 100644
index 0000000..6558eef
--- /dev/null
+++ b/drivers/vfio/vfio.c
@@ -0,0 +1,1406 @@
+/*
+ * VFIO core
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#include <linux/cdev.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/iommu.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+
+#define DRIVER_VERSION "0.3"
+#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
+#define DRIVER_DESC "VFIO - User Level meta-driver"
+
+static struct vfio {
+ struct class *class;
+ struct list_head iommu_drivers_list;
+ struct mutex iommu_drivers_lock;
+ struct list_head group_list;
+ struct idr group_idr;
+ struct mutex group_lock;
+ struct cdev group_cdev;
+ struct device *dev;
+ dev_t devt;
+ struct cdev cdev;
+ wait_queue_head_t release_q;
+} vfio;
+
+struct vfio_iommu_driver {
+ const struct vfio_iommu_driver_ops *ops;
+ struct list_head vfio_next;
+};
+
+struct vfio_container {
+ struct kref kref;
+ struct list_head group_list;
+ struct mutex group_lock;
+ struct vfio_iommu_driver *iommu_driver;
+ void *iommu_data;
+};
+
+struct vfio_group {
+ struct kref kref;
+ int minor;
+ atomic_t container_users;
+ struct iommu_group *iommu_group;
+ struct vfio_container *container;
+ struct list_head device_list;
+ struct mutex device_lock;
+ struct device *dev;
+ struct notifier_block nb;
+ struct list_head vfio_next;
+ struct list_head container_next;
+};
+
+struct vfio_device {
+ struct kref kref;
+ struct device *dev;
+ const struct vfio_device_ops *ops;
+ struct vfio_group *group;
+ struct list_head group_next;
+ void *device_data;
+};
+
+/**
+ * IOMMU driver registration
+ */
+int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+ struct vfio_iommu_driver *driver, *tmp;
+
+ driver = kzalloc(sizeof(*driver), GFP_KERNEL);
+ if (!driver)
+ return -ENOMEM;
+
+ driver->ops = ops;
+
+ mutex_lock(&vfio.iommu_drivers_lock);
+
+ /* Check for duplicates */
+ list_for_each_entry(tmp, &vfio.iommu_drivers_list, vfio_next) {
+ if (tmp->ops == ops) {
+ mutex_unlock(&vfio.iommu_drivers_lock);
+ kfree(driver);
+ return -EINVAL;
+ }
+ }
+
+ list_add(&driver->vfio_next, &vfio.iommu_drivers_list);
+
+ mutex_unlock(&vfio.iommu_drivers_lock);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_register_iommu_driver);
+
+void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+ struct vfio_iommu_driver *driver;
+
+ mutex_lock(&vfio.iommu_drivers_lock);
+ list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+ if (driver->ops == ops) {
+ list_del(&driver->vfio_next);
+ mutex_unlock(&vfio.iommu_drivers_lock);
+ kfree(driver);
+ return;
+ }
+ }
+ mutex_unlock(&vfio.iommu_drivers_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
+
+/**
+ * Group minor allocation/free - both called with vfio.group_lock held
+ */
+static int vfio_alloc_group_minor(struct vfio_group *group)
+{
+ int ret, minor;
+
+again:
+ if (unlikely(idr_pre_get(&vfio.group_idr, GFP_KERNEL) == 0))
+ return -ENOMEM;
+
+ /* index 0 is used by /dev/vfio/vfio */
+ ret = idr_get_new_above(&vfio.group_idr, group, 1, &minor);
+ if (ret == -EAGAIN)
+ goto again;
+ if (ret || minor > MINORMASK) {
+ if (minor > MINORMASK)
+ idr_remove(&vfio.group_idr, minor);
+ return -ENOSPC;
+ }
+
+ return minor;
+}
+
+static void vfio_free_group_minor(int minor)
+{
+ idr_remove(&vfio.group_idr, minor);
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+ unsigned long action, void *data);
+static void vfio_group_get(struct vfio_group *group);
+
+/**
+ * Container objects - containers are created when /dev/vfio/vfio is
+ * opened, but their lifecycle extends until the last user is done, so
+ * it's freed via kref. Must support container/group/device being
+ * closed in any order.
+ */
+static void vfio_container_get(struct vfio_container *container)
+{
+ kref_get(&container->kref);
+}
+
+static void vfio_container_release(struct kref *kref)
+{
+ struct vfio_container *container;
+ container = container_of(kref, struct vfio_container, kref);
+
+ kfree(container);
+}
+
+static void vfio_container_put(struct vfio_container *container)
+{
+ kref_put(&container->kref, vfio_container_release);
+}
+
+/**
+ * Group objects - create, release, get, put, search
+ */
+static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
+{
+ struct vfio_group *group, *tmp;
+ struct device *dev;
+ int ret, minor;
+
+ group = kzalloc(sizeof(*group), GFP_KERNEL);
+ if (!group)
+ return ERR_PTR(-ENOMEM);
+
+ kref_init(&group->kref);
+ INIT_LIST_HEAD(&group->device_list);
+ mutex_init(&group->device_lock);
+ atomic_set(&group->container_users, 0);
+ group->iommu_group = iommu_group;
+
+ group->nb.notifier_call = vfio_iommu_group_notifier;
+
+ /*
+ * blocking notifiers acquire a rwsem around registering and hold
+ * it around callback. Therefore, need to register outside of
+ * vfio.group_lock to avoid A-B/B-A contention. Our callback won't
+ * do anything unless it can find the group in vfio.group_list, so
+ * no harm in registering early.
+ */
+ ret = iommu_group_register_notifier(iommu_group, &group->nb);
+ if (ret) {
+ kfree(group);
+ return ERR_PTR(ret);
+ }
+
+ mutex_lock(&vfio.group_lock);
+
+ minor = vfio_alloc_group_minor(group);
+ if (minor < 0) {
+ mutex_unlock(&vfio.group_lock);
+ kfree(group);
+ return ERR_PTR(minor);
+ }
+
+ /* Did we race creating this group? */
+ list_for_each_entry(tmp, &vfio.group_list, vfio_next) {
+ if (tmp->iommu_group == iommu_group) {
+ vfio_group_get(tmp);
+ vfio_free_group_minor(minor);
+ mutex_unlock(&vfio.group_lock);
+ kfree(group);
+ return tmp;
+ }
+ }
+
+ dev = device_create(vfio.class, NULL, MKDEV(MAJOR(vfio.devt), minor),
+ group, "%d", iommu_group_id(iommu_group));
+ if (IS_ERR(dev)) {
+ vfio_free_group_minor(minor);
+ mutex_unlock(&vfio.group_lock);
+ kfree(group);
+ return (struct vfio_group *)dev; /* ERR_PTR */
+ }
+
+ group->minor = minor;
+ group->dev = dev;
+
+ list_add(&group->vfio_next, &vfio.group_list);
+
+ mutex_unlock(&vfio.group_lock);
+
+ return group;
+}
+
+static void vfio_group_release(struct kref *kref)
+{
+ struct vfio_group *group = container_of(kref, struct vfio_group, kref);
+
+ WARN_ON(!list_empty(&group->device_list));
+
+ device_destroy(vfio.class, MKDEV(MAJOR(vfio.devt), group->minor));
+ list_del(&group->vfio_next);
+ vfio_free_group_minor(group->minor);
+
+ mutex_unlock(&vfio.group_lock);
+
+ /*
+ * Unregister outside of lock. A spurious callback is harmless now
+ * that the group is no longer in vfio.group_list.
+ */
+ iommu_group_unregister_notifier(group->iommu_group, &group->nb);
+
+ kfree(group);
+}
+
+static void vfio_group_put(struct vfio_group *group)
+{
+ mutex_lock(&vfio.group_lock);
+ /*
+ * Release needs to unlock to unregister the notifier, so only
+ * unlock if not released.
+ */
+ if (!kref_put(&group->kref, vfio_group_release))
+ mutex_unlock(&vfio.group_lock);
+}
+
+/* Assume group_lock or group reference is held */
+static void vfio_group_get(struct vfio_group *group)
+{
+ kref_get(&group->kref);
+}
+
+/*
+ * Not really a try as we will sleep for mutex, but we need to make
+ * sure the group pointer is valid under lock and get a reference.
+ */
+static struct vfio_group *vfio_group_try_get(struct vfio_group *group)
+{
+ struct vfio_group *target = group;
+
+ mutex_lock(&vfio.group_lock);
+ list_for_each_entry(group, &vfio.group_list, vfio_next) {
+ if (group == target) {
+ vfio_group_get(group);
+ mutex_unlock(&vfio.group_lock);
+ return group;
+ }
+ }
+ mutex_unlock(&vfio.group_lock);
+
+ return NULL;
+}
+
+static
+struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+{
+ struct vfio_group *group;
+
+ mutex_lock(&vfio.group_lock);
+ list_for_each_entry(group, &vfio.group_list, vfio_next) {
+ if (group->iommu_group == iommu_group) {
+ vfio_group_get(group);
+ mutex_unlock(&vfio.group_lock);
+ return group;
+ }
+ }
+ mutex_unlock(&vfio.group_lock);
+
+ return NULL;
+}
+
+static struct vfio_group *vfio_group_get_from_minor(int minor)
+{
+ struct vfio_group *group;
+
+ mutex_lock(&vfio.group_lock);
+ group = idr_find(&vfio.group_idr, minor);
+ if (!group) {
+ mutex_unlock(&vfio.group_lock);
+ return NULL;
+ }
+ vfio_group_get(group);
+ mutex_unlock(&vfio.group_lock);
+
+ return group;
+}
+
+/**
+ * Device objects - create, release, get, put, search
+ */
+static
+struct vfio_device *vfio_group_create_device(struct vfio_group *group,
+ struct device *dev,
+ const struct vfio_device_ops *ops,
+ void *device_data)
+{
+ struct vfio_device *device;
+ int ret;
+
+ device = kzalloc(sizeof(*device), GFP_KERNEL);
+ if (!device)
+ return ERR_PTR(-ENOMEM);
+
+ kref_init(&device->kref);
+ device->dev = dev;
+ device->group = group;
+ device->ops = ops;
+ device->device_data = device_data;
+
+ ret = dev_set_drvdata(dev, device);
+ if (ret) {
+ kfree(device);
+ return ERR_PTR(ret);
+ }
+
+ /* No need to get group_lock, caller has group reference */
+ vfio_group_get(group);
+
+ mutex_lock(&group->device_lock);
+ list_add(&device->group_next, &group->device_list);
+ mutex_unlock(&group->device_lock);
+
+ return device;
+}
+
+static void vfio_device_release(struct kref *kref)
+{
+ struct vfio_device *device = container_of(kref,
+ struct vfio_device, kref);
+ struct vfio_group *group = device->group;
+
+ mutex_lock(&group->device_lock);
+ list_del(&device->group_next);
+ mutex_unlock(&group->device_lock);
+
+ dev_set_drvdata(device->dev, NULL);
+
+ kfree(device);
+
+ /* vfio_del_group_dev may be waiting for this device */
+ wake_up(&vfio.release_q);
+}
+
+/* Device reference always implies a group reference */
+static void vfio_device_put(struct vfio_device *device)
+{
+ kref_put(&device->kref, vfio_device_release);
+ vfio_group_put(device->group);
+}
+
+static void vfio_device_get(struct vfio_device *device)
+{
+ vfio_group_get(device->group);
+ kref_get(&device->kref);
+}
+
+static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
+ struct device *dev)
+{
+ struct vfio_device *device;
+
+ mutex_lock(&group->device_lock);
+ list_for_each_entry(device, &group->device_list, group_next) {
+ if (device->dev == dev) {
+ vfio_device_get(device);
+ mutex_unlock(&group->device_lock);
+ return device;
+ }
+ }
+ mutex_unlock(&group->device_lock);
+ return NULL;
+}
+
+/**
+ * Async device support
+ */
+static int vfio_group_nb_add_dev(struct vfio_group *group, struct device *dev)
+{
+ struct vfio_device *device;
+
+ /* Do we already know about it? We shouldn't */
+ device = vfio_group_get_device(group, dev);
+ if (WARN_ON_ONCE(device)) {
+ vfio_device_put(device);
+ return 0;
+ }
+
+ /* Nothing to do for idle groups */
+ if (!atomic_read(&group->container_users))
+ return 0;
+
+ /* TODO Prevent device auto probing */
+ WARN("Device %s added to live group %d!\n", dev_name(dev),
+ iommu_group_id(group->iommu_group));
+
+ return 0;
+}
+
+static int vfio_group_nb_del_dev(struct vfio_group *group, struct device *dev)
+{
+ struct vfio_device *device;
+
+ /*
+ * Expect to fall out here. If a device was in use, it would
+ * have been bound to a vfio sub-driver, which would have blocked
+ * in .remove at vfio_del_group_dev. Sanity check that we no
+ * longer track the device, so it's safe to remove.
+ */
+ device = vfio_group_get_device(group, dev);
+ if (likely(!device))
+ return 0;
+
+ WARN("Device %s removed from live group %d!\n", dev_name(dev),
+ iommu_group_id(group->iommu_group));
+
+ vfio_device_put(device);
+ return 0;
+}
+
+static int vfio_group_nb_verify(struct vfio_group *group, struct device *dev)
+{
+ struct vfio_device *device;
+
+ /* We don't care what happens when the group isn't in use */
+ if (!atomic_read(&group->container_users))
+ return 0;
+
+ device = vfio_group_get_device(group, dev);
+ if (device)
+ vfio_device_put(device);
+
+ return device ? 0 : -EINVAL;
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct vfio_group *group = container_of(nb, struct vfio_group, nb);
+ struct device *dev = data;
+
+ /*
+ * Need to go through a group_lock lookup to get a reference or
+ * we risk racing a group being removed. Leave a WARN_ON for
+ * debuging, but if the group no longer exists, a spurious notify
+ * is harmless.
+ */
+ group = vfio_group_try_get(group);
+ if (WARN_ON(!group))
+ return NOTIFY_OK;
+
+ switch (action) {
+ case IOMMU_GROUP_NOTIFY_ADD_DEVICE:
+ vfio_group_nb_add_dev(group, dev);
+ break;
+ case IOMMU_GROUP_NOTIFY_DEL_DEVICE:
+ vfio_group_nb_del_dev(group, dev);
+ break;
+ case IOMMU_GROUP_NOTIFY_BIND_DRIVER:
+ printk(KERN_DEBUG
+ "%s: Device %s, group %d binding to driver\n", __func__,
+ dev_name(dev), iommu_group_id(group->iommu_group));
+ break;
+ case IOMMU_GROUP_NOTIFY_BOUND_DRIVER:
+ printk(KERN_DEBUG
+ "%s: Device %s, group %d bound to driver %s\n", __func__,
+ dev_name(dev), iommu_group_id(group->iommu_group),
+ dev->driver->name);
+ BUG_ON(vfio_group_nb_verify(group, dev));
+ break;
+ case IOMMU_GROUP_NOTIFY_UNBIND_DRIVER:
+ printk(KERN_DEBUG
+ "%s: Device %s, group %d unbinding from driver %s\n",
+ __func__, dev_name(dev),
+ iommu_group_id(group->iommu_group), dev->driver->name);
+ break;
+ case IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER:
+ printk(KERN_DEBUG
+ "%s: Device %s, group %d unbound from driver\n",
+ __func__, dev_name(dev),
+ iommu_group_id(group->iommu_group));
+ /*
+ * XXX An unbound device in a live group is ok, but we'd
+ * really like to avoid the above BUG_ON by preventing other
+ * drivers from binding to it. Once that occurs, we have to
+ * stop the system to maintain isolation. At a minimum, we'd
+ * want a toggle to disable driver auto probe for this device.
+ */
+ break;
+ }
+
+ vfio_group_put(group);
+ return NOTIFY_OK;
+}
+
+/**
+ * VFIO driver API
+ */
+int vfio_add_group_dev(struct device *dev,
+ const struct vfio_device_ops *ops, void *device_data)
+{
+ struct iommu_group *iommu_group;
+ struct vfio_group *group;
+ struct vfio_device *device;
+
+ iommu_group = iommu_group_get(dev);
+ if (!iommu_group)
+ return -EINVAL;
+
+ group = vfio_group_get_from_iommu(iommu_group);
+ if (!group) {
+ group = vfio_create_group(iommu_group);
+ if (IS_ERR(group)) {
+ iommu_group_put(iommu_group);
+ return PTR_ERR(group);
+ }
+ }
+
+ device = vfio_group_get_device(group, dev);
+ if (device) {
+ WARN(1, "Device %s already exists on group %d\n",
+ dev_name(dev), iommu_group_id(iommu_group));
+ vfio_device_put(device);
+ vfio_group_put(group);
+ iommu_group_put(iommu_group);
+ return -EBUSY;
+ }
+
+ device = vfio_group_create_device(group, dev, ops, device_data);
+ if (IS_ERR(device)) {
+ vfio_group_put(group);
+ iommu_group_put(iommu_group);
+ return PTR_ERR(device);
+ }
+
+ /*
+ * Added device holds reference to iommu_group and vfio_device
+ * (which in turn holds reference to vfio_group). Drop extra
+ * group reference used while acquiring device.
+ */
+ vfio_group_put(group);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_add_group_dev);
+
+/* Test whether a struct device is present in our tracking */
+static bool vfio_dev_present(struct device *dev)
+{
+ struct iommu_group *iommu_group;
+ struct vfio_group *group;
+ struct vfio_device *device;
+
+ iommu_group = iommu_group_get(dev);
+ if (!iommu_group)
+ return false;
+
+ group = vfio_group_get_from_iommu(iommu_group);
+ if (!group) {
+ iommu_group_put(iommu_group);
+ return false;
+ }
+
+ device = vfio_group_get_device(group, dev);
+ if (!device) {
+ vfio_group_put(group);
+ iommu_group_put(iommu_group);
+ return false;
+ }
+
+ vfio_device_put(device);
+ vfio_group_put(group);
+ iommu_group_put(iommu_group);
+ return true;
+}
+
+/*
+ * Decrement the device reference count and wait for the device to be
+ * removed. Open file descriptors for the device... */
+void *vfio_del_group_dev(struct device *dev)
+{
+ struct vfio_device *device = dev_get_drvdata(dev);
+ struct vfio_group *group = device->group;
+ struct iommu_group *iommu_group = group->iommu_group;
+ void *device_data = device->device_data;
+
+ vfio_device_put(device);
+
+ /* TODO send a signal to encourage this to be released */
+ wait_event(vfio.release_q, !vfio_dev_present(dev));
+
+ iommu_group_put(iommu_group);
+
+ return device_data;
+}
+EXPORT_SYMBOL_GPL(vfio_del_group_dev);
+
+/**
+ * VFIO base fd, /dev/vfio/vfio
+ */
+static long vfio_ioctl_check_extension(struct vfio_container *container,
+ unsigned long arg)
+{
+ struct vfio_iommu_driver *driver = container->iommu_driver;
+ long ret = 0;
+
+ switch (arg) {
+ /* No base extensions yet */
+ default:
+ /*
+ * If no driver is set, poll all registered drivers for
+ * extensions and return the first positive result. If
+ * a driver is already set, further queries will be passed
+ * only to that driver.
+ */
+ if (!driver) {
+ mutex_lock(&vfio.iommu_drivers_lock);
+ list_for_each_entry(driver, &vfio.iommu_drivers_list,
+ vfio_next) {
+ if (!try_module_get(driver->ops->owner))
+ continue;
+
+ ret = driver->ops->ioctl(NULL,
+ VFIO_CHECK_EXTENSION,
+ arg);
+ module_put(driver->ops->owner);
+ if (ret > 0)
+ break;
+ }
+ mutex_unlock(&vfio.iommu_drivers_lock);
+ } else
+ ret = driver->ops->ioctl(container->iommu_data,
+ VFIO_CHECK_EXTENSION, arg);
+ }
+
+ return ret;
+}
+
+/* hold container->group_lock */
+static int __vfio_container_attach_groups(struct vfio_container *container,
+ struct vfio_iommu_driver *driver,
+ void *data)
+{
+ struct vfio_group *group;
+ int ret = -ENODEV;
+
+ list_for_each_entry(group, &container->group_list, container_next) {
+ ret = driver->ops->attach_group(data, group->iommu_group);
+ if (ret)
+ goto unwind;
+ }
+
+ return ret;
+
+unwind:
+ list_for_each_entry_continue_reverse(group, &container->group_list,
+ container_next) {
+ driver->ops->detach_group(data, group->iommu_group);
+ }
+
+ return ret;
+}
+
+static long vfio_ioctl_set_iommu(struct vfio_container *container,
+ unsigned long arg)
+{
+ struct vfio_iommu_driver *driver;
+ long ret = -ENODEV;
+
+ mutex_lock(&container->group_lock);
+
+ /*
+ * The container is designed to be an unprivileged interface while
+ * the group can be assigned to specific users. Therefore, only by
+ * adding a group to a container does the user get the privilege of
+ * enabling the iommu, which may allocate finite resources. There
+ * is no unset_iommu, but by removing all the groups from a container,
+ * the container is deprivileged and returns to an unset state.
+ */
+ if (list_empty(&container->group_list) || container->iommu_driver) {
+ mutex_unlock(&container->group_lock);
+ return -EINVAL;
+ }
+
+ mutex_lock(&vfio.iommu_drivers_lock);
+ list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+ void *data;
+
+ if (!try_module_get(driver->ops->owner))
+ continue;
+
+ /*
+ * The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
+ * so test which iommu driver reported support for this
+ * extension and call open on them. We also pass them the
+ * magic, allowing a single driver to support multiple
+ * interfaces if they'd like.
+ */
+ if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
+ module_put(driver->ops->owner);
+ continue;
+ }
+
+ /* module reference holds the driver we're working on */
+ mutex_unlock(&vfio.iommu_drivers_lock);
+
+ data = driver->ops->open(arg);
+ if (IS_ERR(data)) {
+ ret = PTR_ERR(data);
+ goto skip_drivers_unlock;
+ }
+
+ ret = __vfio_container_attach_groups(container, driver, data);
+ if (!ret) {
+ container->iommu_driver = driver;
+ container->iommu_data = data;
+ } else
+ driver->ops->release(data);
+
+ goto skip_drivers_unlock;
+ }
+
+ mutex_unlock(&vfio.iommu_drivers_lock);
+skip_drivers_unlock:
+ mutex_unlock(&container->group_lock);
+
+ return ret;
+}
+
+static long vfio_fops_unl_ioctl(struct file *filep,
+ unsigned int cmd, unsigned long arg)
+{
+ struct vfio_container *container = filep->private_data;
+ struct vfio_iommu_driver *driver;
+ void *data;
+ long ret = -EINVAL;
+
+ if (!container)
+ return ret;
+
+ driver = container->iommu_driver;
+ data = container->iommu_data;
+
+ switch (cmd) {
+ case VFIO_GET_API_VERSION:
+ ret = VFIO_API_VERSION;
+ break;
+ case VFIO_CHECK_EXTENSION:
+ ret = vfio_ioctl_check_extension(container, arg);
+ break;
+ case VFIO_SET_IOMMU:
+ ret = vfio_ioctl_set_iommu(container, arg);
+ break;
+ default:
+ if (driver) /* passthrough all unrecognized ioctls */
+ ret = driver->ops->ioctl(data, cmd, arg);
+ }
+
+ return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_fops_compat_ioctl(struct file *filep,
+ unsigned int cmd, unsigned long arg)
+{
+ arg = (unsigned long)compat_ptr(arg);
+ return vfio_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif /* CONFIG_COMPAT */
+
+static int vfio_fops_open(struct inode *inode, struct file *filep)
+{
+ struct vfio_container *container;
+
+ container = kzalloc(sizeof(*container), GFP_KERNEL);
+ if (!container)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&container->group_list);
+ mutex_init(&container->group_lock);
+ kref_init(&container->kref);
+
+ filep->private_data = container;
+
+ return 0;
+}
+
+static int vfio_fops_release(struct inode *inode, struct file *filep)
+{
+ struct vfio_container *container = filep->private_data;
+
+ filep->private_data = NULL;
+
+ vfio_container_put(container);
+
+ return 0;
+}
+
+/*
+ * Once an iommu driver is set, we optionally pass read/write/mmap
+ * on to the driver, allowing management interfaces beyond ioctl.
+ */
+static ssize_t vfio_fops_read(struct file *filep, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_container *container = filep->private_data;
+ struct vfio_iommu_driver *driver = container->iommu_driver;
+
+ if (unlikely(!driver || !driver->ops->read))
+ return -EINVAL;
+
+ return driver->ops->read(container->iommu_data, buf, count, ppos);
+}
+
+static ssize_t vfio_fops_write(struct file *filep, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_container *container = filep->private_data;
+ struct vfio_iommu_driver *driver = container->iommu_driver;
+
+ if (unlikely(!driver || !driver->ops->write))
+ return -EINVAL;
+
+ return driver->ops->write(container->iommu_data, buf, count, ppos);
+}
+
+static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+ struct vfio_container *container = filep->private_data;
+ struct vfio_iommu_driver *driver = container->iommu_driver;
+
+ if (unlikely(!driver || !driver->ops->mmap))
+ return -EINVAL;
+
+ return driver->ops->mmap(container->iommu_data, vma);
+}
+
+static const struct file_operations vfio_fops = {
+ .owner = THIS_MODULE,
+ .open = vfio_fops_open,
+ .release = vfio_fops_release,
+ .read = vfio_fops_read,
+ .write = vfio_fops_write,
+ .unlocked_ioctl = vfio_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = vfio_fops_compat_ioctl,
+#endif
+ .mmap = vfio_fops_mmap,
+};
+
+/**
+ * VFIO Group fd, /dev/vfio/$GROUP
+ */
+static void __vfio_group_unset_container(struct vfio_group *group)
+{
+ struct vfio_container *container = group->container;
+ struct vfio_iommu_driver *driver;
+
+ mutex_lock(&container->group_lock);
+
+ driver = container->iommu_driver;
+ if (driver)
+ driver->ops->detach_group(container->iommu_data,
+ group->iommu_group);
+
+ group->container = NULL;
+ list_del(&group->container_next);
+
+ /* Detaching the last group deprivileges a container, remove iommu */
+ if (driver && list_empty(&container->group_list)) {
+ driver->ops->release(container->iommu_data);
+ module_put(driver->ops->owner);
+ container->iommu_driver = NULL;
+ container->iommu_data = NULL;
+ }
+
+ mutex_unlock(&container->group_lock);
+
+ vfio_container_put(container);
+}
+
+/*
+ * VFIO_GROUP_UNSET_CONTAINER should fail if there are other users or
+ * if there was no container to unset. Since the ioctl is called on
+ * the group, we know that still exists, therefore the only valid
+ * transition here is 1->0.
+ */
+static int vfio_group_unset_container(struct vfio_group *group)
+{
+ int users = atomic_cmpxchg(&group->container_users, 1, 0);
+
+ if (!users)
+ return -EINVAL;
+ if (users != 1)
+ return -EBUSY;
+
+ __vfio_group_unset_container(group);
+
+ return 0;
+}
+
+/*
+ * When removing container users, anything that removes the last user
+ * implicitly removes the group from the container. That is, if the
+ * group file descriptor is closed, as well as any device file descriptors,
+ * the group is free.
+ */
+static void vfio_group_try_dissolve_container(struct vfio_group *group)
+{
+ if (0 == atomic_dec_if_positive(&group->container_users))
+ __vfio_group_unset_container(group);
+}
+
+static int vfio_group_set_container(struct vfio_group *group, int container_fd)
+{
+ struct file *filep;
+ struct vfio_container *container;
+ struct vfio_iommu_driver *driver;
+ int ret = 0;
+
+ if (atomic_read(&group->container_users))
+ return -EINVAL;
+
+ filep = fget(container_fd);
+ if (!filep)
+ return -EBADF;
+
+ /* Sanity check, is this really our fd? */
+ if (filep->f_op != &vfio_fops) {
+ fput(filep);
+ return -EINVAL;
+ }
+
+ container = filep->private_data;
+ WARN_ON(!container); /* fget ensures we don't race vfio_release */
+
+ mutex_lock(&container->group_lock);
+
+ driver = container->iommu_driver;
+ if (driver) {
+ ret = driver->ops->attach_group(container->iommu_data,
+ group->iommu_group);
+ if (ret)
+ goto unlock_out;
+ }
+
+ group->container = container;
+ list_add(&group->container_next, &container->group_list);
+
+ /* Get a reference on the container and mark a user within the group */
+ vfio_container_get(container);
+ atomic_inc(&group->container_users);
+
+unlock_out:
+ mutex_unlock(&container->group_lock);
+ fput(filep);
+ if (ret)
+ vfio_container_put(container);
+
+ return ret;
+}
+
+/*
+ * A vfio group is viable for use by userspace if all devices are either
+ * driver-less or bound to a vfio driver. We test the latter by the
+ * existence of a struct vfio_device matching the dev.
+ */
+static int vfio_dev_viable(struct device *dev, void *data)
+{
+ struct vfio_group *group = data;
+ struct vfio_device *device;
+
+ if (!dev->driver)
+ return 0;
+
+ device = vfio_group_get_device(group, dev);
+ if (device) {
+ vfio_device_put(device);
+ return 0;
+ }
+
+ /*
+ * XXX IOMMU grouping restraints such as PCI ACS may create groups
+ * with devices that it's not practical to either unload the driver
+ * or bind it to a vfio driver (pcieport, shpchp). We probably want
+ * a whitelist or drivers or device types that we consider ok.
+ */
+
+ return -EINVAL;
+}
+
+static bool vfio_group_viable(struct vfio_group *group)
+{
+ return (iommu_group_for_each_dev(group->iommu_group,
+ group, vfio_dev_viable) == 0);
+}
+
+static const struct file_operations vfio_device_fops;
+
+static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
+{
+ struct vfio_device *device;
+ struct file *filep;
+ int ret = -ENODEV;
+
+ if (0 == atomic_read(&group->container_users) ||
+ !group->container->iommu_driver || !vfio_group_viable(group))
+ return -EINVAL;
+
+ mutex_lock(&group->device_lock);
+ list_for_each_entry(device, &group->device_list, group_next) {
+ if (strcmp(dev_name(device->dev), buf))
+ continue;
+
+ ret = device->ops->open(device->device_data);
+ if (ret)
+ break;
+ /*
+ * We can't use anon_inode_getfd() because we need to modify
+ * the f_mode flags directly to allow more than just ioctls
+ */
+ ret = get_unused_fd();
+ if (ret < 0) {
+ device->ops->release(device->device_data);
+ break;
+ }
+
+ filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
+ device, O_RDWR);
+ if (IS_ERR(filep)) {
+ put_unused_fd(ret);
+ ret = PTR_ERR(filep);
+ device->ops->release(device->device_data);
+ break;
+ }
+
+ /*
+ * TODO: add an anon_inode interface to do this.
+ * Appears to be missing by lack of need rather than
+ * explicitly prevented. Now there's need.
+ */
+ filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+ fd_install(ret, filep);
+
+ vfio_device_get(device);
+ atomic_inc(&group->container_users);
+ break;
+ }
+ mutex_unlock(&group->device_lock);
+
+ return ret;
+}
+
+static long vfio_group_fops_unl_ioctl(struct file *filep,
+ unsigned int cmd, unsigned long arg)
+{
+ struct vfio_group *group = filep->private_data;
+ long ret = -ENOTTY;
+
+ switch (cmd) {
+ case VFIO_GROUP_GET_STATUS:
+ {
+ struct vfio_group_status status;
+ unsigned long minsz;
+
+ minsz = offsetofend(struct vfio_group_status, flags);
+
+ if (copy_from_user(&status, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (status.argsz < minsz)
+ return -EINVAL;
+
+ status.flags = 0;
+
+ if (vfio_group_viable(group))
+ status.flags |= VFIO_GROUP_FLAGS_VIABLE;
+
+ if (group->container)
+ status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET;
+
+ ret = copy_to_user((void __user *)arg, &status, minsz);
+
+ break;
+ }
+ case VFIO_GROUP_SET_CONTAINER:
+ {
+ int fd;
+
+ if (get_user(fd, (int __user *)arg))
+ return -EFAULT;
+
+ if (fd < 0)
+ return -EINVAL;
+
+ ret = vfio_group_set_container(group, fd);
+ break;
+ }
+ case VFIO_GROUP_UNSET_CONTAINER:
+ ret = vfio_group_unset_container(group);
+ break;
+ case VFIO_GROUP_GET_DEVICE_FD:
+ {
+ char *buf;
+
+ buf = strndup_user((const char __user *)arg, PAGE_SIZE);
+ if (IS_ERR(buf))
+ return PTR_ERR(buf);
+
+ ret = vfio_group_get_device_fd(group, buf);
+ kfree(buf);
+ break;
+ }
+ }
+
+ return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_group_fops_compat_ioctl(struct file *filep,
+ unsigned int cmd, unsigned long arg)
+{
+ arg = (unsigned long)compat_ptr(arg);
+ return vfio_group_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif /* CONFIG_COMPAT */
+
+static int vfio_group_fops_open(struct inode *inode, struct file *filep)
+{
+ struct vfio_group *group;
+
+ group = vfio_group_get_from_minor(iminor(inode));
+ if (!group)
+ return -ENODEV;
+
+ if (group->container) {
+ vfio_group_put(group);
+ return -EBUSY;
+ }
+
+ filep->private_data = group;
+
+ return 0;
+}
+
+static int vfio_group_fops_release(struct inode *inode, struct file *filep)
+{
+ struct vfio_group *group = filep->private_data;
+
+ filep->private_data = NULL;
+
+ vfio_group_try_dissolve_container(group);
+
+ vfio_group_put(group);
+
+ return 0;
+}
+
+static const struct file_operations vfio_group_fops = {
+ .owner = THIS_MODULE,
+ .unlocked_ioctl = vfio_group_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = vfio_group_fops_compat_ioctl,
+#endif
+ .open = vfio_group_fops_open,
+ .release = vfio_group_fops_release,
+};
+
+/**
+ * VFIO Device fd
+ */
+static int vfio_device_fops_release(struct inode *inode, struct file *filep)
+{
+ struct vfio_device *device = filep->private_data;
+
+ device->ops->release(device->device_data);
+
+ vfio_group_try_dissolve_container(device->group);
+
+ vfio_device_put(device);
+
+ return 0;
+}
+
+static long vfio_device_fops_unl_ioctl(struct file *filep,
+ unsigned int cmd, unsigned long arg)
+{
+ struct vfio_device *device = filep->private_data;
+
+ if (unlikely(!device->ops->ioctl))
+ return -EINVAL;
+
+ return device->ops->ioctl(device->device_data, cmd, arg);
+}
+
+static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_device *device = filep->private_data;
+
+ if (unlikely(!device->ops->read))
+ return -EINVAL;
+
+ return device->ops->read(device->device_data, buf, count, ppos);
+}
+
+static ssize_t vfio_device_fops_write(struct file *filep,
+ const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_device *device = filep->private_data;
+
+ if (unlikely(!device->ops->write))
+ return -EINVAL;
+
+ return device->ops->write(device->device_data, buf, count, ppos);
+}
+
+static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+ struct vfio_device *device = filep->private_data;
+
+ if (unlikely(!device->ops->mmap))
+ return -EINVAL;
+
+ return device->ops->mmap(device->device_data, vma);
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_device_fops_compat_ioctl(struct file *filep,
+ unsigned int cmd, unsigned long arg)
+{
+ arg = (unsigned long)compat_ptr(arg);
+ return vfio_device_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif /* CONFIG_COMPAT */
+
+static const struct file_operations vfio_device_fops = {
+ .owner = THIS_MODULE,
+ .release = vfio_device_fops_release,
+ .read = vfio_device_fops_read,
+ .write = vfio_device_fops_write,
+ .unlocked_ioctl = vfio_device_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = vfio_device_fops_compat_ioctl,
+#endif
+ .mmap = vfio_device_fops_mmap,
+};
+
+/**
+ * Module/class support
+ */
+static char *vfio_devnode(struct device *dev, umode_t *mode)
+{
+ return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
+}
+
+static int __init vfio_init(void)
+{
+ int ret;
+
+ idr_init(&vfio.group_idr);
+ mutex_init(&vfio.group_lock);
+ mutex_init(&vfio.iommu_drivers_lock);
+ INIT_LIST_HEAD(&vfio.group_list);
+ INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+ init_waitqueue_head(&vfio.release_q);
+
+ vfio.class = class_create(THIS_MODULE, "vfio");
+ if (IS_ERR(vfio.class)) {
+ ret = PTR_ERR(vfio.class);
+ goto err_class;
+ }
+
+ vfio.class->devnode = vfio_devnode;
+
+ ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
+ if (ret)
+ goto err_base_chrdev;
+
+ cdev_init(&vfio.cdev, &vfio_fops);
+ ret = cdev_add(&vfio.cdev, vfio.devt, 1);
+ if (ret)
+ goto err_base_cdev;
+
+ vfio.dev = device_create(vfio.class, NULL, vfio.devt, NULL, "vfio");
+ if (IS_ERR(vfio.dev)) {
+ ret = PTR_ERR(vfio.dev);
+ goto err_base_dev;
+ }
+
+ /* /dev/vfio/$GROUP */
+ cdev_init(&vfio.group_cdev, &vfio_group_fops);
+ ret = cdev_add(&vfio.group_cdev,
+ MKDEV(MAJOR(vfio.devt), 1), MINORMASK - 1);
+ if (ret)
+ goto err_groups_cdev;
+
+ pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+ return 0;
+
+err_groups_cdev:
+ device_destroy(vfio.class, vfio.devt);
+err_base_dev:
+ cdev_del(&vfio.cdev);
+err_base_cdev:
+ unregister_chrdev_region(vfio.devt, MINORMASK);
+err_base_chrdev:
+ class_destroy(vfio.class);
+ vfio.class = NULL;
+err_class:
+ return ret;
+}
+
+static void __exit vfio_cleanup(void)
+{
+ WARN_ON(!list_empty(&vfio.group_list));
+
+ idr_destroy(&vfio.group_idr);
+ cdev_del(&vfio.group_cdev);
+ device_destroy(vfio.class, vfio.devt);
+ cdev_del(&vfio.cdev);
+ unregister_chrdev_region(vfio.devt, MINORMASK);
+ class_destroy(vfio.class);
+ vfio.class = NULL;
+}
+
+module_init(vfio_init);
+module_exit(vfio_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
new file mode 100644
index 0000000..76a8f97
--- /dev/null
+++ b/include/linux/vfio.h
@@ -0,0 +1,366 @@
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/iommu.h>
+#include <linux/mm.h>
+
+#define VFIO_API_VERSION 0
+
+#ifdef __KERNEL__ /* Internal VFIO-core/bus driver API */
+
+/**
+ * struct vfio_device_ops - VFIO bus driver device callbacks
+ *
+ * @open: Called when userspace creates new file descriptor for device
+ * @release: Called when userspace releases file descriptor for device
+ * @read: Perform read(2) on device file descriptor
+ * @write: Perform write(2) on device file descriptor
+ * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
+ * operations documented below
+ * @mmap: Perform mmap(2) on a region of the device file descriptor
+ */
+struct vfio_device_ops {
+ char *name;
+ int (*open)(void *device_data);
+ void (*release)(void *device_data);
+ ssize_t (*read)(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos);
+ ssize_t (*write)(void *device_data, const char __user *buf,
+ size_t count, loff_t *size);
+ long (*ioctl)(void *device_data, unsigned int cmd,
+ unsigned long arg);
+ int (*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+extern int vfio_add_group_dev(struct device *dev,
+ const struct vfio_device_ops *ops,
+ void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+/**
+ * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
+ */
+struct vfio_iommu_driver_ops {
+ char *name;
+ struct module *owner;
+ void *(*open)(unsigned long arg);
+ void (*release)(void *iommu_data);
+ ssize_t (*read)(void *iommu_data, char __user *buf,
+ size_t count, loff_t *ppos);
+ ssize_t (*write)(void *iommu_data, const char __user *buf,
+ size_t count, loff_t *size);
+ long (*ioctl)(void *iommu_data, unsigned int cmd,
+ unsigned long arg);
+ int (*mmap)(void *iommu_data, struct vm_area_struct *vma);
+ int (*attach_group)(void *iommu_data,
+ struct iommu_group *group);
+ void (*detach_group)(void *iommu_data,
+ struct iommu_group *group);
+
+};
+
+extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
+
+extern void vfio_unregister_iommu_driver(
+ const struct vfio_iommu_driver_ops *ops);
+
+/**
+ * offsetofend(TYPE, MEMBER)
+ *
+ * @TYPE: The type of the structure
+ * @MEMBER: The member within the structure to get the end offset of
+ *
+ * Simple helper macro for dealing with variable sized structures passed
+ * from user space. This allows us to easily determine if the provided
+ * structure is sized to include various fields.
+ */
+#define offsetofend(TYPE, MEMBER) ({ \
+ TYPE tmp; \
+ offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_X86_IOMMU 1
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace. We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE (';')
+#define VFIO_BASE 100
+
+/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API. This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION _IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION _IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type. The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type. A group must be set to this file descriptor before this
+ * ioctl is available. The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU _IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ * struct vfio_group_status)
+ *
+ * Retrieve information about the group. Fills in provided
+ * struct vfio_group_info. Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_GROUP_FLAGS_VIABLE (1 << 0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET (1 << 1)
+};
+#define VFIO_GROUP_GET_STATUS _IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided. Groups may only belong to a single
+ * container. Containers may, at their discretion, support multiple
+ * groups. Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER _IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container. This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state. All device file descriptors must be released
+ * prior to calling this interface. When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER _IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string. The string should match a device listed in
+ * the devices subdirectory of the IOMMU group sysfs entry. The
+ * group containing the device must already be added to this context.
+ * Return: new file descriptor on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_GET_DEVICE_FD _IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
+ * struct vfio_device_info)
+ *
+ * Retrieve information about the device. Fills in provided
+ * struct vfio_device_info. Caller sets argsz.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_info {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_DEVICE_FLAGS_RESET (1 << 0) /* Device supports reset */
+ __u32 num_regions; /* Max region index + 1 */
+ __u32 num_irqs; /* Max IRQ index + 1 */
+};
+#define VFIO_DEVICE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 7)
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
+ * struct vfio_region_info)
+ *
+ * Retrieve information about a device region. Caller provides
+ * struct vfio_region_info with index value set. Caller sets argsz.
+ * Implementation of region mapping is bus driver specific. This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space). Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_region_info {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_REGION_INFO_FLAG_READ (1 << 0) /* Region supports read */
+#define VFIO_REGION_INFO_FLAG_WRITE (1 << 1) /* Region supports write */
+#define VFIO_REGION_INFO_FLAG_MMAP (1 << 2) /* Region supports mmap */
+ __u32 index; /* Region index */
+ __u32 resv; /* Reserved for alignment */
+ __u64 size; /* Region size (bytes) */
+ __u64 offset; /* Region offset from start of device fd */
+};
+#define VFIO_DEVICE_GET_REGION_INFO _IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ * struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ. Caller provides
+ * struct vfio_irq_info with index value set. Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific. Indexes
+ * using multiple IRQs are primarily intended to support MSI-like
+ * interrupt blocks. Zero count irq blocks may be used to describe
+ * unimplemented interrupt types.
+ *
+ * The EVENTFD flag indicates the interrupt index supports eventfd based
+ * signaling.
+ *
+ * The MASKABLE flags indicates the index supports MASK and UNMASK
+ * actions described below.
+ *
+ * AUTOMASKED indicates that after signaling, the interrupt line is
+ * automatically masked by VFIO and the user needs to unmask the line
+ * to receive new interrupts. This is primarily intended to distinguish
+ * level triggered interrupts.
+ *
+ * The NORESIZE flag indicates that the interrupt lines within the index
+ * are setup as a set and new subindexes cannot be enabled without first
+ * disabling the entire index. This is used for interrupts like PCI MSI
+ * and MSI-X where the driver may only use a subset of the available
+ * indexes, but VFIO needs to enable a specific number of vectors
+ * upfront. In the case of MSI-X, where the user can enable MSI-X and
+ * then add and unmask vectors, it's up to userspace to make the decision
+ * whether to allocate the maximum supported number of vectors or tear
+ * down setup and incrementally increase the vectors as each is enabled.
+ */
+struct vfio_irq_info {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_IRQ_INFO_EVENTFD (1 << 0)
+#define VFIO_IRQ_INFO_MASKABLE (1 << 1)
+#define VFIO_IRQ_INFO_AUTOMASKED (1 << 2)
+#define VFIO_IRQ_INFO_NORESIZE (1 << 3)
+ __u32 index; /* IRQ index */
+ __s32 count; /* Number of IRQs within this index */
+};
+#define VFIO_DEVICE_GET_IRQ_INFO _IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
+ *
+ * Set signaling, masking, and unmasking of interrupts. Caller provides
+ * struct vfio_irq_set with all fields set. 'start' and 'count' indicate
+ * the range of subindexes being specified.
+ *
+ * The DATA flags specify the type of data provided. If DATA_NONE, the
+ * operation performs the specified action immediately on the specified
+ * interrupt(s). For example, to unmask AUTOMASKED interrupt [0,0]:
+ * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
+ *
+ * DATA_BOOL allows sparse support for the same on arrays of interrupts.
+ * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
+ * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
+ * data = {1,0,1}
+ *
+ * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
+ * A value of -1 can be used to either de-assign interrupts if already
+ * assigned or skip un-assigned interrupts. For example, to set an eventfd
+ * to be trigger for interrupts [0,0] and [0,2]:
+ * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
+ * data = {fd1, -1, fd2}
+ * If index [0,1] is previously set, two count = 1 ioctls calls would be
+ * required to set [0,0] and [0,2] without changing [0,1].
+ *
+ * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
+ * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
+ * from userspace (ie. simulate hardware triggering).
+ *
+ * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
+ * enables the interrupt index for the device. Individual subindex interrupts
+ * can be disabled using the -1 value for DATA_EVENTFD or the index can be
+ * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
+ *
+ * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
+ * ACTION_TRIGGER specifies kernel->user signaling.
+ */
+struct vfio_irq_set {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_IRQ_SET_DATA_NONE (1 << 0) /* Data not present */
+#define VFIO_IRQ_SET_DATA_BOOL (1 << 1) /* Data is bool (u8) */
+#define VFIO_IRQ_SET_DATA_EVENTFD (1 << 2) /* Data is eventfd (s32) */
+#define VFIO_IRQ_SET_ACTION_MASK (1 << 3) /* Mask interrupt */
+#define VFIO_IRQ_SET_ACTION_UNMASK (1 << 4) /* Unmask interrupt */
+#define VFIO_IRQ_SET_ACTION_TRIGGER (1 << 5) /* Trigger interrupt */
+ __u32 index;
+ __s32 start;
+ __s32 count;
+ __u8 data[];
+};
+#define VFIO_DEVICE_SET_IRQS _IO(VFIO_TYPE, VFIO_BASE + 10)
+
+#define VFIO_IRQ_SET_DATA_TYPE_MASK (VFIO_IRQ_SET_DATA_NONE | \
+ VFIO_IRQ_SET_DATA_BOOL | \
+ VFIO_IRQ_SET_DATA_EVENTFD)
+#define VFIO_IRQ_SET_ACTION_TYPE_MASK (VFIO_IRQ_SET_ACTION_MASK | \
+ VFIO_IRQ_SET_ACTION_UNMASK | \
+ VFIO_IRQ_SET_ACTION_TRIGGER)
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET _IO(VFIO_TYPE, VFIO_BASE + 11)
+
+#endif /* VFIO_H */

2012-05-22 05:06:21

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 11/13] pci: Create common pcibios_err_to_errno

For returning errors out to non-PCI code. Re-name xen's version.

Signed-off-by: Alex Williamson <[email protected]>
Acked-by: Konrad Rzeszutek Wilk <[email protected]>
---

drivers/xen/xen-pciback/conf_space.c | 6 +++---
include/linux/pci.h | 26 ++++++++++++++++++++++++++
2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xen-pciback/conf_space.c b/drivers/xen/xen-pciback/conf_space.c
index 30d7be0..46ae0f9 100644
--- a/drivers/xen/xen-pciback/conf_space.c
+++ b/drivers/xen/xen-pciback/conf_space.c
@@ -124,7 +124,7 @@ static inline u32 merge_value(u32 val, u32 new_val, u32 new_val_mask,
return val;
}

-static int pcibios_err_to_errno(int err)
+static int xen_pcibios_err_to_errno(int err)
{
switch (err) {
case PCIBIOS_SUCCESSFUL:
@@ -202,7 +202,7 @@ out:
pci_name(dev), size, offset, value);

*ret_val = value;
- return pcibios_err_to_errno(err);
+ return xen_pcibios_err_to_errno(err);
}

int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
@@ -290,7 +290,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int offset, int size, u32 value)
}
}

- return pcibios_err_to_errno(err);
+ return xen_pcibios_err_to_errno(err);
}

void xen_pcibk_config_free_dyn_fields(struct pci_dev *dev)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0cf57d5..b0f79b3 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -467,6 +467,32 @@ static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev) { return false;
#define PCIBIOS_SET_FAILED 0x88
#define PCIBIOS_BUFFER_TOO_SMALL 0x89

+/*
+ * Translate above to generic errno for passing back through non-pci.
+ */
+static inline int pcibios_err_to_errno(int err)
+{
+ if (err <= PCIBIOS_SUCCESSFUL)
+ return err; /* Assume already errno */
+
+ switch (err) {
+ case PCIBIOS_FUNC_NOT_SUPPORTED:
+ return -ENOENT;
+ case PCIBIOS_BAD_VENDOR_ID:
+ return -EINVAL;
+ case PCIBIOS_DEVICE_NOT_FOUND:
+ return -ENODEV;
+ case PCIBIOS_BAD_REGISTER_NUMBER:
+ return -EFAULT;
+ case PCIBIOS_SET_FAILED:
+ return -EIO;
+ case PCIBIOS_BUFFER_TOO_SMALL:
+ return -ENOSPC;
+ }
+
+ return -ENOTTY;
+}
+
/* Low-level architecture-dependent routines */

struct pci_ops {

2012-05-22 05:06:39

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 12/13] pci: Misc pci_reg additions

Fill in many missing definitions and add sizeof fields for many
sections allowing for more extensive config parsing.

Signed-off-by: Alex Williamson <[email protected]>
---

include/linux/pci_regs.h | 112 +++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 4b608f5..379be84 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
* Under PCI, each device has 256 bytes of configuration address space,
* of which the first 64 bytes are standardized as follows:
*/
+#define PCI_STD_HEADER_SIZEOF 64
#define PCI_VENDOR_ID 0x00 /* 16 bits */
#define PCI_DEVICE_ID 0x02 /* 16 bits */
#define PCI_COMMAND 0x04 /* 16 bits */
@@ -209,9 +210,12 @@
#define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */
#define PCI_CAP_ID_SSVID 0x0D /* Bridge subsystem vendor/device ID */
#define PCI_CAP_ID_AGP3 0x0E /* AGP Target PCI-PCI bridge */
+#define PCI_CAP_ID_SECDEV 0x0F /* Secure Device */
#define PCI_CAP_ID_EXP 0x10 /* PCI Express */
#define PCI_CAP_ID_MSIX 0x11 /* MSI-X */
+#define PCI_CAP_ID_SATA 0x12 /* SATA Data/Index Conf. */
#define PCI_CAP_ID_AF 0x13 /* PCI Advanced Features */
+#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
#define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */
#define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */
#define PCI_CAP_SIZEOF 4
@@ -276,6 +280,7 @@
#define PCI_VPD_ADDR_MASK 0x7fff /* Address mask */
#define PCI_VPD_ADDR_F 0x8000 /* Write 0, 1 indicates completion */
#define PCI_VPD_DATA 4 /* 32-bits of data returned here */
+#define PCI_CAP_VPD_SIZEOF 8

/* Slot Identification */

@@ -297,8 +302,10 @@
#define PCI_MSI_ADDRESS_HI 8 /* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */
#define PCI_MSI_DATA_32 8 /* 16 bits of data for 32-bit devices */
#define PCI_MSI_MASK_32 12 /* Mask bits register for 32-bit devices */
+#define PCI_MSI_PENDING_32 16 /* Pending intrs for 32-bit devices */
#define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */
#define PCI_MSI_MASK_64 16 /* Mask bits register for 64-bit devices */
+#define PCI_MSI_PENDING_64 20 /* Pending intrs for 64-bit devices */

/* MSI-X registers */
#define PCI_MSIX_FLAGS 2
@@ -308,6 +315,7 @@
#define PCI_MSIX_TABLE 4
#define PCI_MSIX_PBA 8
#define PCI_MSIX_FLAGS_BIRMASK (7 << 0)
+#define PCI_CAP_MSIX_SIZEOF 12 /* size of MSIX registers */

/* MSI-X entry's format */
#define PCI_MSIX_ENTRY_SIZE 16
@@ -338,6 +346,7 @@
#define PCI_AF_CTRL_FLR 0x01
#define PCI_AF_STATUS 5
#define PCI_AF_STATUS_TP 0x01
+#define PCI_CAP_AF_SIZEOF 6 /* size of AF registers */

/* PCI-X registers */

@@ -374,6 +383,9 @@
#define PCI_X_STATUS_SPL_ERR 0x20000000 /* Rcvd Split Completion Error Msg */
#define PCI_X_STATUS_266MHZ 0x40000000 /* 266 MHz capable */
#define PCI_X_STATUS_533MHZ 0x80000000 /* 533 MHz capable */
+#define PCI_X_ECC_CSR 8 /* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0 8 /* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V12 24 /* size for Version 1 & 2 */

/* PCI Bridge Subsystem ID registers */

@@ -462,6 +474,7 @@
#define PCI_EXP_LNKSTA_DLLLA 0x2000 /* Data Link Layer Link Active */
#define PCI_EXP_LNKSTA_LBMS 0x4000 /* Link Bandwidth Management Status */
#define PCI_EXP_LNKSTA_LABS 0x8000 /* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20 /* v1 endpoints end here */
#define PCI_EXP_SLTCAP 20 /* Slot Capabilities */
#define PCI_EXP_SLTCAP_ABP 0x00000001 /* Attention Button Present */
#define PCI_EXP_SLTCAP_PCP 0x00000002 /* Power Controller Present */
@@ -521,6 +534,7 @@
#define PCI_EXP_OBFF_MSGA_EN 0x2000 /* OBFF enable with Message type A */
#define PCI_EXP_OBFF_MSGB_EN 0x4000 /* OBFF enable with Message type B */
#define PCI_EXP_OBFF_WAKE_EN 0x6000 /* OBFF using WAKE# signaling */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44 /* v2 endpoints end here */
#define PCI_EXP_LNKCTL2 48 /* Link Control 2 */
#define PCI_EXP_SLTCTL2 56 /* Slot Control 2 */

@@ -529,23 +543,43 @@
#define PCI_EXT_CAP_VER(header) ((header >> 16) & 0xf)
#define PCI_EXT_CAP_NEXT(header) ((header >> 20) & 0xffc)

-#define PCI_EXT_CAP_ID_ERR 1
-#define PCI_EXT_CAP_ID_VC 2
-#define PCI_EXT_CAP_ID_DSN 3
-#define PCI_EXT_CAP_ID_PWR 4
-#define PCI_EXT_CAP_ID_VNDR 11
-#define PCI_EXT_CAP_ID_ACS 13
-#define PCI_EXT_CAP_ID_ARI 14
-#define PCI_EXT_CAP_ID_ATS 15
-#define PCI_EXT_CAP_ID_SRIOV 16
-#define PCI_EXT_CAP_ID_PRI 19
-#define PCI_EXT_CAP_ID_LTR 24
-#define PCI_EXT_CAP_ID_PASID 27
+#define PCI_EXT_CAP_ID_ERR 0x01 /* Advanced Error Reporting */
+#define PCI_EXT_CAP_ID_VC 0x02 /* Virtual Channel Capability */
+#define PCI_EXT_CAP_ID_DSN 0x03 /* Device Serial Number */
+#define PCI_EXT_CAP_ID_PWR 0x04 /* Power Budgeting */
+#define PCI_EXT_CAP_ID_RCLD 0x05 /* Root Complex Link Declaration */
+#define PCI_EXT_CAP_ID_RCILC 0x06 /* Root Complex Internal Link Control */
+#define PCI_EXT_CAP_ID_RCEC 0x07 /* Root Complex Event Collector */
+#define PCI_EXT_CAP_ID_MFVC 0x08 /* Multi-Function VC Capability */
+#define PCI_EXT_CAP_ID_VC9 0x09 /* same as _VC */
+#define PCI_EXT_CAP_ID_RCRB 0x0A /* Root Complex RB? */
+#define PCI_EXT_CAP_ID_VNDR 0x0B /* Vendor Specific */
+#define PCI_EXT_CAP_ID_CAC 0x0C /* Config Access - obsolete */
+#define PCI_EXT_CAP_ID_ACS 0x0D /* Access Control Services */
+#define PCI_EXT_CAP_ID_ARI 0x0E /* Alternate Routing ID */
+#define PCI_EXT_CAP_ID_ATS 0x0F /* Address Translation Services */
+#define PCI_EXT_CAP_ID_SRIOV 0x10 /* Single Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MRIOV 0x11 /* Multi Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MCAST 0x12 /* Multicast */
+#define PCI_EXT_CAP_ID_PRI 0x13 /* Page Request Interface */
+#define PCI_EXT_CAP_ID_AMD_XXX 0x14 /* reserved for AMD */
+#define PCI_EXT_CAP_ID_REBAR 0x15 /* resizable BAR */
+#define PCI_EXT_CAP_ID_DPA 0x16 /* dynamic power alloc */
+#define PCI_EXT_CAP_ID_TPH 0x17 /* TPH request */
+#define PCI_EXT_CAP_ID_LTR 0x18 /* latency tolerance reporting */
+#define PCI_EXT_CAP_ID_SECPCI 0x19 /* Secondary PCIe */
+#define PCI_EXT_CAP_ID_PMUX 0x1A /* Protocol Multiplexing */
+#define PCI_EXT_CAP_ID_PASID 0x1B /* Process Address Space ID */
+#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PASID
+
+#define PCI_EXT_CAP_DSN_SIZEOF 12
+#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40

/* Advanced Error Reporting */
#define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */
#define PCI_ERR_UNC_TRAIN 0x00000001 /* Training */
#define PCI_ERR_UNC_DLP 0x00000010 /* Data Link Protocol */
+#define PCI_ERR_UNC_SURPDN 0x00000020 /* Surprise Down */
#define PCI_ERR_UNC_POISON_TLP 0x00001000 /* Poisoned TLP */
#define PCI_ERR_UNC_FCP 0x00002000 /* Flow Control Protocol */
#define PCI_ERR_UNC_COMP_TIME 0x00004000 /* Completion Timeout */
@@ -555,6 +589,11 @@
#define PCI_ERR_UNC_MALF_TLP 0x00040000 /* Malformed TLP */
#define PCI_ERR_UNC_ECRC 0x00080000 /* ECRC Error Status */
#define PCI_ERR_UNC_UNSUP 0x00100000 /* Unsupported Request */
+#define PCI_ERR_UNC_ACSV 0x00200000 /* ACS Violation */
+#define PCI_ERR_UNC_INTN 0x00400000 /* internal error */
+#define PCI_ERR_UNC_MCBTLP 0x00800000 /* MC blocked TLP */
+#define PCI_ERR_UNC_ATOMEG 0x01000000 /* Atomic egress blocked */
+#define PCI_ERR_UNC_TLPPRE 0x02000000 /* TLP prefix blocked */
#define PCI_ERR_UNCOR_MASK 8 /* Uncorrectable Error Mask */
/* Same bits as above */
#define PCI_ERR_UNCOR_SEVER 12 /* Uncorrectable Error Severity */
@@ -565,6 +604,9 @@
#define PCI_ERR_COR_BAD_DLLP 0x00000080 /* Bad DLLP Status */
#define PCI_ERR_COR_REP_ROLL 0x00000100 /* REPLAY_NUM Rollover */
#define PCI_ERR_COR_REP_TIMER 0x00001000 /* Replay Timer Timeout */
+#define PCI_ERR_COR_ADV_NFAT 0x00002000 /* Advisory Non-Fatal */
+#define PCI_ERR_COR_INTERNAL 0x00004000 /* Corrected Internal */
+#define PCI_ERR_COR_LOG_OVER 0x00008000 /* Header Log Overflow */
#define PCI_ERR_COR_MASK 20 /* Correctable Error Mask */
/* Same bits as above */
#define PCI_ERR_CAP 24 /* Advanced Error Capabilities */
@@ -596,12 +638,18 @@

/* Virtual Channel */
#define PCI_VC_PORT_REG1 4
+#define PCI_VC_REG1_EVCC 0x7 /* extended vc count */
#define PCI_VC_PORT_REG2 8
+#define PCI_VC_REG2_32_PHASE 0x2
+#define PCI_VC_REG2_64_PHASE 0x4
+#define PCI_VC_REG2_128_PHASE 0x8
#define PCI_VC_PORT_CTRL 12
#define PCI_VC_PORT_STATUS 14
#define PCI_VC_RES_CAP 16
#define PCI_VC_RES_CTRL 20
#define PCI_VC_RES_STATUS 26
+#define PCI_CAP_VC_BASE_SIZEOF 0x10
+#define PCI_CAP_VC_PER_VC_SIZEOF 0x0C

/* Power Budgeting */
#define PCI_PWR_DSR 4 /* Data Select Register */
@@ -614,6 +662,7 @@
#define PCI_PWR_DATA_RAIL(x) (((x) >> 18) & 7) /* Power Rail */
#define PCI_PWR_CAP 12 /* Capability */
#define PCI_PWR_CAP_BUDGET(x) ((x) & 1) /* Included in system budget */
+#define PCI_EXT_CAP_PWR_SIZEOF 16

/*
* Hypertransport sub capability types
@@ -646,6 +695,8 @@
#define HT_CAPTYPE_ERROR_RETRY 0xC0 /* Retry on error configuration */
#define HT_CAPTYPE_GEN3 0xD0 /* Generation 3 hypertransport configuration */
#define HT_CAPTYPE_PM 0xE0 /* Hypertransport powermanagement configuration */
+#define HT_CAP_SIZEOF_LONG 28 /* slave & primary */
+#define HT_CAP_SIZEOF_SHORT 24 /* host & secondary */

/* Alternative Routing-ID Interpretation */
#define PCI_ARI_CAP 0x04 /* ARI Capability Register */
@@ -656,6 +707,7 @@
#define PCI_ARI_CTRL_MFVC 0x0001 /* MFVC Function Groups Enable */
#define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */
#define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */
+#define PCI_EXT_CAP_ARI_SIZEOF 8

/* Address Translation Service */
#define PCI_ATS_CAP 0x04 /* ATS Capability Register */
@@ -665,6 +717,7 @@
#define PCI_ATS_CTRL_ENABLE 0x8000 /* ATS Enable */
#define PCI_ATS_CTRL_STU(x) ((x) & 0x1f) /* Smallest Translation Unit */
#define PCI_ATS_MIN_STU 12 /* shift of minimum STU block */
+#define PCI_EXT_CAP_ATS_SIZEOF 8

/* Page Request Interface */
#define PCI_PRI_CTRL 0x04 /* PRI control register */
@@ -676,6 +729,7 @@
#define PCI_PRI_STATUS_STOPPED 0x100 /* PRI Stopped */
#define PCI_PRI_MAX_REQ 0x08 /* PRI max reqs supported */
#define PCI_PRI_ALLOC_REQ 0x0c /* PRI max reqs allowed */
+#define PCI_EXT_CAP_PRI_SIZEOF 16

/* PASID capability */
#define PCI_PASID_CAP 0x04 /* PASID feature register */
@@ -685,6 +739,7 @@
#define PCI_PASID_CTRL_ENABLE 0x01 /* Enable bit */
#define PCI_PASID_CTRL_EXEC 0x02 /* Exec permissions Enable */
#define PCI_PASID_CTRL_PRIV 0x04 /* Priviledge Mode Enable */
+#define PCI_EXT_CAP_PASID_SIZEOF 8

/* Single Root I/O Virtualization */
#define PCI_SRIOV_CAP 0x04 /* SR-IOV Capabilities */
@@ -716,12 +771,14 @@
#define PCI_SRIOV_VFM_MI 0x1 /* Dormant.MigrateIn */
#define PCI_SRIOV_VFM_MO 0x2 /* Active.MigrateOut */
#define PCI_SRIOV_VFM_AV 0x3 /* Active.Available */
+#define PCI_EXT_CAP_SRIOV_SIZEOF 64

#define PCI_LTR_MAX_SNOOP_LAT 0x4
#define PCI_LTR_MAX_NOSNOOP_LAT 0x6
#define PCI_LTR_VALUE_MASK 0x000003ff
#define PCI_LTR_SCALE_MASK 0x00001c00
#define PCI_LTR_SCALE_SHIFT 10
+#define PCI_EXT_CAP_LTR_SIZEOF 8

/* Access Control Service */
#define PCI_ACS_CAP 0x04 /* ACS Capability Register */
@@ -732,7 +789,38 @@
#define PCI_ACS_UF 0x10 /* Upstream Forwarding */
#define PCI_ACS_EC 0x20 /* P2P Egress Control */
#define PCI_ACS_DT 0x40 /* Direct Translated P2P */
+#define PCI_ACS_EGRESS_BITS 0x05 /* ACS Egress Control Vector Size */
#define PCI_ACS_CTRL 0x06 /* ACS Control Register */
#define PCI_ACS_EGRESS_CTL_V 0x08 /* ACS Egress Control Vector */

+#define PCI_VSEC_HDR 4 /* extended cap - vendor specific */
+#define PCI_VSEC_HDR_LEN_SHIFT 20 /* shift for length field */
+
+/* sata capability */
+#define PCI_SATA_REGS 4 /* SATA REGs specifier */
+#define PCI_SATA_REGS_MASK 0xF /* location - BAR#/inline */
+#define PCI_SATA_REGS_INLINE 0xF /* REGS in config space */
+#define PCI_SATA_SIZEOF_SHORT 8
+#define PCI_SATA_SIZEOF_LONG 16
+
+/* resizable BARs */
+#define PCI_REBAR_CTRL 8 /* control register */
+#define PCI_REBAR_CTRL_NBAR_MASK (7 << 5) /* mask for # bars */
+#define PCI_REBAR_CTRL_NBAR_SHIFT 5 /* shift for # bars */
+
+/* dynamic power allocation */
+#define PCI_DPA_CAP 4 /* capability register */
+#define PCI_DPA_CAP_SUBSTATE_MASK 0x1F /* # substates - 1 */
+#define PCI_DPA_BASE_SIZEOF 16 /* size with 0 substates */
+
+/* TPH Requester */
+#define PCI_TPH_CAP 4 /* capability register */
+#define PCI_TPH_CAP_LOC_MASK 0x600 /* location mask */
+#define PCI_TPH_LOC_NONE 0x000 /* no location */
+#define PCI_TPH_LOC_CAP 0x200 /* in capability */
+#define PCI_TPH_LOC_MSIX 0x400 /* in MSI-X */
+#define PCI_TPH_CAP_ST_MASK 0x07FF0000 /* st table mask */
+#define PCI_TPH_CAP_ST_SHIFT 16 /* st table shift */
+#define PCI_TPH_BASE_SIZEOF 12 /* size with no st table */
+
#endif /* LINUX_PCI_REGS_H */

2012-05-22 05:06:55

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 13/13] vfio: Add PCI device driver

Add PCI device support for VFIO. PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device. PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers. I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions. Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.

Signed-off-by: Alex Williamson <[email protected]>
---

drivers/vfio/Kconfig | 2
drivers/vfio/pci/Kconfig | 8
drivers/vfio/pci/Makefile | 4
drivers/vfio/pci/vfio_pci.c | 557 +++++++++++++
drivers/vfio/pci/vfio_pci_config.c | 1522 +++++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_intrs.c | 724 +++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h | 91 ++
drivers/vfio/pci/vfio_pci_rdwr.c | 269 ++++++
include/linux/vfio.h | 26 +
9 files changed, 3203 insertions(+), 0 deletions(-)
create mode 100644 drivers/vfio/pci/Kconfig
create mode 100644 drivers/vfio/pci/Makefile
create mode 100644 drivers/vfio/pci/vfio_pci.c
create mode 100644 drivers/vfio/pci/vfio_pci_config.c
create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
create mode 100644 drivers/vfio/pci/vfio_pci_private.h
create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index bd88a30..77b754c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -12,3 +12,5 @@ menuconfig VFIO
See Documentation/vfio.txt for more details.

If you don't know what to do here, say N.
+
+source "drivers/vfio/pci/Kconfig"
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
new file mode 100644
index 0000000..cc7db62
--- /dev/null
+++ b/drivers/vfio/pci/Kconfig
@@ -0,0 +1,8 @@
+config VFIO_PCI
+ tristate "VFIO support for PCI devices"
+ depends on VFIO && PCI
+ help
+ Support for the PCI VFIO bus driver. This is required to make
+ use of PCI drivers using the VFIO framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
new file mode 100644
index 0000000..1310792
--- /dev/null
+++ b/drivers/vfio/pci/Makefile
@@ -0,0 +1,4 @@
+
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+
+obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index 0000000..b2f1f3a
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,557 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION "0.1.9"
+#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
+#define DRIVER_DESC "VFIO PCI - User Level meta-driver"
+
+static int vfio_pci_enable(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int ret;
+ u16 cmd;
+ u8 msix_pos;
+
+ vdev->reset_works = (pci_reset_function(pdev) == 0);
+ pci_save_state(pdev);
+ vdev->pci_saved_state = pci_store_saved_state(pdev);
+ if (!vdev->pci_saved_state)
+ printk(KERN_DEBUG "%s: Couldn't store %s saved state\n",
+ __func__, dev_name(&pdev->dev));
+
+ ret = vfio_config_init(vdev);
+ if (ret)
+ goto out;
+
+ vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+
+ pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+ if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
+ cmd &= ~PCI_COMMAND_INTX_DISABLE;
+ pci_write_config_word(pdev, PCI_COMMAND, cmd);
+ }
+
+ msix_pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+ if (msix_pos) {
+ u16 flags;
+ u32 table;
+
+ pci_read_config_word(pdev, msix_pos + PCI_MSIX_FLAGS, &flags);
+ pci_read_config_dword(pdev, msix_pos + PCI_MSIX_TABLE, &table);
+
+ vdev->msix_bar = table & PCI_MSIX_FLAGS_BIRMASK;
+ vdev->msix_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
+ vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
+ } else
+ vdev->msix_bar = 0xFF;
+
+ ret = pci_enable_device(pdev);
+ if (ret)
+ goto out;
+
+ return ret;
+
+out:
+ kfree(vdev->pci_saved_state);
+ vdev->pci_saved_state = NULL;
+ vfio_config_free(vdev);
+ return ret;
+}
+
+static void vfio_pci_disable(struct vfio_pci_device *vdev)
+{
+ int bar;
+
+ pci_disable_device(vdev->pdev);
+
+ vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
+ VFIO_IRQ_SET_ACTION_TRIGGER,
+ vdev->irq_type, 0, 0, NULL);
+
+ vdev->virq_disabled = false;
+
+ vfio_config_free(vdev);
+
+ if (pci_reset_function(vdev->pdev) == 0) {
+ if (pci_load_and_free_saved_state(vdev->pdev,
+ &vdev->pci_saved_state) == 0)
+ pci_restore_state(vdev->pdev);
+ else
+ printk(KERN_INFO "%s: Couldn't reload %s saved state\n",
+ __func__, dev_name(&vdev->pdev->dev));
+ }
+
+ for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
+ if (!vdev->barmap[bar])
+ continue;
+ pci_iounmap(vdev->pdev, vdev->barmap[bar]);
+ pci_release_selected_regions(vdev->pdev, 1 << bar);
+ vdev->barmap[bar] = NULL;
+ }
+}
+
+static void vfio_pci_release(void *device_data)
+{
+ struct vfio_pci_device *vdev = device_data;
+
+ if (atomic_dec_and_test(&vdev->refcnt))
+ vfio_pci_disable(vdev);
+
+ module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+ struct vfio_pci_device *vdev = device_data;
+
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
+ if (atomic_inc_return(&vdev->refcnt) == 1) {
+ int ret = vfio_pci_enable(vdev);
+ if (ret) {
+ module_put(THIS_MODULE);
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+{
+ if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+ u8 pin;
+ pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+ if (pin)
+ return 1;
+
+ } else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+ u8 pos;
+ u16 flags;
+
+ pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSI);
+ if (pos) {
+ pci_read_config_word(vdev->pdev,
+ pos + PCI_MSI_FLAGS, &flags);
+
+ return 1 << (flags & PCI_MSI_FLAGS_QMASK);
+ }
+ } else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+ u8 pos;
+ u16 flags;
+
+ pos = pci_find_capability(vdev->pdev, PCI_CAP_ID_MSIX);
+ if (pos) {
+ pci_read_config_word(vdev->pdev,
+ pos + PCI_MSIX_FLAGS, &flags);
+
+ return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+ }
+ }
+
+ return 0;
+}
+
+static long vfio_pci_ioctl(void *device_data,
+ unsigned int cmd, unsigned long arg)
+{
+ struct vfio_pci_device *vdev = device_data;
+ unsigned long minsz;
+
+ if (cmd == VFIO_DEVICE_GET_INFO) {
+ struct vfio_device_info info;
+
+ minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+ if (vdev->reset_works)
+ info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+ info.num_regions = VFIO_PCI_NUM_REGIONS;
+ info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+ return copy_to_user((void __user *)arg, &info, minsz);
+
+ } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_region_info info;
+
+ minsz = offsetofend(struct vfio_region_info, offset);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_REGIONS)
+ return -EINVAL;
+
+ info.flags = 0;
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+
+ if (info.index == VFIO_PCI_CONFIG_REGION_INDEX) {
+ info.size = pdev->cfg_size;
+ } else if (pci_resource_start(pdev, info.index)) {
+ unsigned long flags;
+
+ flags = pci_resource_flags(pdev, info.index);
+
+ info.flags |= VFIO_REGION_INFO_FLAG_READ;
+
+ /* Report the actual ROM size instead of the BAR size,
+ * this gives the user an easy way to determine whether
+ * there's anything here w/o trying to read it. */
+ if (info.index == VFIO_PCI_ROM_REGION_INDEX) {
+ void __iomem *io;
+ size_t size;
+
+ io = pci_map_rom(pdev, &size);
+ info.size = io ? size : 0;
+ pci_unmap_rom(pdev, io);
+ } else if (flags & IORESOURCE_MEM) {
+ info.size = pci_resource_len(pdev, info.index);
+ info.flags |= (VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP);
+ } else {
+ info.size = pci_resource_len(pdev, info.index);
+ info.flags |= VFIO_REGION_INFO_FLAG_WRITE;
+ }
+ } else
+ info.size = 0;
+
+ return copy_to_user((void __user *)arg, &info, minsz);
+
+ } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+ struct vfio_irq_info info;
+
+ minsz = offsetofend(struct vfio_irq_info, count);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+ return -EINVAL;
+
+ info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+ info.count = vfio_pci_get_irq_count(vdev, info.index);
+
+ if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+ info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+ VFIO_IRQ_INFO_AUTOMASKED);
+ else
+ info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+ return copy_to_user((void __user *)arg, &info, minsz);
+
+ } else if (cmd == VFIO_DEVICE_SET_IRQS) {
+ struct vfio_irq_set hdr;
+ u8 *data = NULL;
+ int ret = 0;
+
+ minsz = offsetofend(struct vfio_irq_set, count);
+
+ if (copy_from_user(&hdr, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+ hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+ VFIO_IRQ_SET_ACTION_TYPE_MASK))
+ return -EINVAL;
+
+ if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+ size_t size;
+
+ if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+ size = sizeof(uint8_t);
+ else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+ size = sizeof(int32_t);
+ else
+ return -EINVAL;
+
+ if (hdr.argsz - minsz < hdr.count * size ||
+ hdr.count > vfio_pci_get_irq_count(vdev, hdr.index))
+ return -EINVAL;
+
+ data = kmalloc(hdr.count * size, GFP_KERNEL);
+ if (!data)
+ return -ENOMEM;
+
+ if (copy_from_user(data, (void __user *)(arg + minsz),
+ hdr.count * size)) {
+ kfree(data);
+ return -EFAULT;
+ }
+ }
+
+ mutex_lock(&vdev->igate);
+
+ ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+ hdr.start, hdr.count, data);
+
+ mutex_unlock(&vdev->igate);
+ kfree(data);
+
+ return ret;
+
+ } else if (cmd == VFIO_DEVICE_RESET)
+ return vdev->reset_works ?
+ pci_reset_function(vdev->pdev) : -EINVAL;
+
+ return -ENOTTY;
+}
+
+static ssize_t vfio_pci_read(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ struct vfio_pci_device *vdev = device_data;
+ struct pci_dev *pdev = vdev->pdev;
+
+ if (index >= VFIO_PCI_NUM_REGIONS)
+ return -EINVAL;
+
+ if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+ return vfio_pci_config_readwrite(vdev, buf, count, ppos, false);
+ else if (index == VFIO_PCI_ROM_REGION_INDEX)
+ return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+ else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+ return vfio_pci_io_readwrite(vdev, buf, count, ppos, false);
+ else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM)
+ return vfio_pci_mem_readwrite(vdev, buf, count, ppos, false);
+
+ return -EINVAL;
+}
+
+static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ struct vfio_pci_device *vdev = device_data;
+ struct pci_dev *pdev = vdev->pdev;
+
+ if (index >= VFIO_PCI_NUM_REGIONS)
+ return -EINVAL;
+
+ if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+ return vfio_pci_config_readwrite(vdev, (char __user *)buf,
+ count, ppos, true);
+ else if (index == VFIO_PCI_ROM_REGION_INDEX)
+ return -EINVAL;
+ else if (pci_resource_flags(pdev, index) & IORESOURCE_IO)
+ return vfio_pci_io_readwrite(vdev, (char __user *)buf,
+ count, ppos, true);
+ else if (pci_resource_flags(pdev, index) & IORESOURCE_MEM) {
+ return vfio_pci_mem_readwrite(vdev, (char __user *)buf,
+ count, ppos, true);
+ }
+
+ return -EINVAL;
+}
+
+static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+ struct vfio_pci_device *vdev = device_data;
+ struct pci_dev *pdev = vdev->pdev;
+ unsigned int index;
+ u64 phys_len, req_len, pgoff, req_start, phys;
+ int ret;
+
+ index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+ if (index >= VFIO_PCI_ROM_REGION_INDEX)
+ return -EINVAL;
+ if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM))
+ return -EINVAL;
+
+ phys_len = pci_resource_len(pdev, index);
+ req_len = vma->vm_end - vma->vm_start;
+ pgoff = vma->vm_pgoff &
+ ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+ req_start = pgoff << PAGE_SHIFT;
+
+ if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
+ return -EINVAL;
+
+ if (index == vdev->msix_bar) {
+ /*
+ * Disallow mmaps overlapping the MSI-X table; users don't
+ * get to touch this directly. We could find somewhere
+ * else to map the overlap, but page granularity is only
+ * a recommendation, not a requirement, so the user needs
+ * to know which bits are real. Requiring them to mmap
+ * around the table makes that clear.
+ */
+
+ /* If neither entirely above nor below, then it overlaps */
+ if (!(req_start >= vdev->msix_offset + vdev->msix_size ||
+ req_start + req_len <= vdev->msix_offset))
+ return -EINVAL;
+ }
+
+ /*
+ * Even though we don't make use of the barmap for the mmap,
+ * we need to request the region and the barmap tracks that.
+ */
+ if (!vdev->barmap[index]) {
+ ret = pci_request_selected_regions(pdev,
+ 1 << index, "vfio-pci");
+ if (ret)
+ return ret;
+
+ vdev->barmap[index] = pci_iomap(pdev, index, 0);
+ }
+
+ vma->vm_private_data = vdev;
+ vma->vm_flags |= (VM_IO | VM_RESERVED);
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+ phys = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+ return remap_pfn_range(vma, vma->vm_start, phys,
+ req_len, vma->vm_page_prot);
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+ .name = "vfio-pci",
+ .open = vfio_pci_open,
+ .release = vfio_pci_release,
+ .ioctl = vfio_pci_ioctl,
+ .read = vfio_pci_read,
+ .write = vfio_pci_write,
+ .mmap = vfio_pci_mmap,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ u8 type;
+ struct vfio_pci_device *vdev;
+ struct iommu_group *group;
+ int ret;
+
+ pci_read_config_byte(pdev, PCI_HEADER_TYPE, &type);
+ if ((type & PCI_HEADER_TYPE) != PCI_HEADER_TYPE_NORMAL)
+ return -EINVAL;
+
+ group = iommu_group_get(&pdev->dev);
+ if (!group)
+ return -EINVAL;
+
+ vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+ if (!vdev) {
+ iommu_group_put(group);
+ return -ENOMEM;
+ }
+
+ vdev->pdev = pdev;
+ vdev->irq_type = VFIO_PCI_NUM_IRQS;
+ mutex_init(&vdev->igate);
+ spin_lock_init(&vdev->irqlock);
+ atomic_set(&vdev->refcnt, 0);
+
+ ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+ if (ret) {
+ iommu_group_put(group);
+ kfree(vdev);
+ }
+
+ return ret;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+ struct vfio_pci_device *vdev;
+
+ vdev = vfio_del_group_dev(&pdev->dev);
+ if (!vdev)
+ return;
+
+ iommu_group_put(pdev->dev.iommu_group);
+ kfree(vdev);
+}
+
+static struct pci_driver vfio_pci_driver = {
+ .name = "vfio-pci",
+ .id_table = NULL, /* only dynamic ids */
+ .probe = vfio_pci_probe,
+ .remove = vfio_pci_remove,
+};
+
+void __exit vfio_pci_cleanup(void)
+{
+ pci_unregister_driver(&vfio_pci_driver);
+ vfio_pci_virqfd_exit();
+ vfio_pci_uninit_perm_bits();
+}
+
+int __init vfio_pci_init(void)
+{
+ int ret;
+
+ /* Allocate shared config space permision data used by all devices */
+ ret = vfio_pci_init_perm_bits();
+ if (ret)
+ return ret;
+
+ /* Start the virqfd cleanup handler */
+ ret = vfio_pci_virqfd_init();
+ if (ret)
+ goto out_virqfd;
+
+ /* Register and scan for devices */
+ ret = pci_register_driver(&vfio_pci_driver);
+ if (ret)
+ goto out_driver;
+
+ return 0;
+
+out_virqfd:
+ vfio_pci_virqfd_exit();
+out_driver:
+ vfio_pci_uninit_perm_bits();
+ return ret;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
new file mode 100644
index 0000000..2c2e9a7
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -0,0 +1,1522 @@
+/*
+ * VFIO PCI config space virtualization
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+/*
+ * This code handles reading and writing of PCI configuration registers.
+ * This is hairy because we want to allow a lot of flexibility to the
+ * user driver, but cannot trust it with all of the config fields.
+ * Tables determine which fields can be read and written, as well as
+ * which fields are 'virtualized' - special actions and translations to
+ * make it appear to the user that he has control, when in fact things
+ * must be negotiated with the underlying OS.
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#include "vfio_pci_private.h"
+
+#define PCI_CFG_SPACE_SIZE 256
+
+/* Useful "pseudo" capabilities */
+#define PCI_CAP_ID_BASIC 0
+#define PCI_CAP_ID_INVALID 0xFF
+
+#define is_bar(offset) \
+ ((offset >= PCI_BASE_ADDRESS_0 && offset < PCI_BASE_ADDRESS_5 + 4) || \
+ (offset >= PCI_ROM_ADDRESS && offset < PCI_ROM_ADDRESS + 4))
+
+/*
+ * Lengths of PCI Config Capabilities
+ * 0: Removed from the user visible capability list
+ * FF: Variable length
+ */
+static u8 pci_cap_length[] = {
+ [PCI_CAP_ID_BASIC] = PCI_STD_HEADER_SIZEOF, /* pci config header */
+ [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
+ [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
+ [PCI_CAP_ID_VPD] = PCI_CAP_VPD_SIZEOF,
+ [PCI_CAP_ID_SLOTID] = 0, /* bridge - don't care */
+ [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, 20, or 24 */
+ [PCI_CAP_ID_CHSWP] = 0, /* cpci - not yet */
+ [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
+ [PCI_CAP_ID_HT] = 0xFF, /* hypertransport */
+ [PCI_CAP_ID_VNDR] = 0xFF, /* variable */
+ [PCI_CAP_ID_DBG] = 0, /* debug - don't care */
+ [PCI_CAP_ID_CCRC] = 0, /* cpci - not yet */
+ [PCI_CAP_ID_SHPC] = 0, /* hotswap - not yet */
+ [PCI_CAP_ID_SSVID] = 0, /* bridge - don't care */
+ [PCI_CAP_ID_AGP3] = 0, /* AGP8x - not yet */
+ [PCI_CAP_ID_SECDEV] = 0, /* secure device not yet */
+ [PCI_CAP_ID_EXP] = 0xFF, /* 20 or 44 */
+ [PCI_CAP_ID_MSIX] = PCI_CAP_MSIX_SIZEOF,
+ [PCI_CAP_ID_SATA] = 0xFF,
+ [PCI_CAP_ID_AF] = PCI_CAP_AF_SIZEOF,
+};
+
+/*
+ * Lengths of PCIe/PCI-X Extended Config Capabilities
+ * 0: Removed or masked from the user visible capabilty list
+ * FF: Variable length
+ */
+static u16 pci_ext_cap_length[] = {
+ [PCI_EXT_CAP_ID_ERR] = PCI_ERR_ROOT_COMMAND,
+ [PCI_EXT_CAP_ID_VC] = 0xFF,
+ [PCI_EXT_CAP_ID_DSN] = PCI_EXT_CAP_DSN_SIZEOF,
+ [PCI_EXT_CAP_ID_PWR] = PCI_EXT_CAP_PWR_SIZEOF,
+ [PCI_EXT_CAP_ID_RCLD] = 0, /* root only - don't care */
+ [PCI_EXT_CAP_ID_RCILC] = 0, /* root only - don't care */
+ [PCI_EXT_CAP_ID_RCEC] = 0, /* root only - don't care */
+ [PCI_EXT_CAP_ID_MFVC] = 0xFF,
+ [PCI_EXT_CAP_ID_VC9] = 0xFF, /* same as CAP_ID_VC */
+ [PCI_EXT_CAP_ID_RCRB] = 0, /* root only - don't care */
+ [PCI_EXT_CAP_ID_VNDR] = 0xFF,
+ [PCI_EXT_CAP_ID_CAC] = 0, /* obsolete */
+ [PCI_EXT_CAP_ID_ACS] = 0xFF,
+ [PCI_EXT_CAP_ID_ARI] = PCI_EXT_CAP_ARI_SIZEOF,
+ [PCI_EXT_CAP_ID_ATS] = PCI_EXT_CAP_ATS_SIZEOF,
+ [PCI_EXT_CAP_ID_SRIOV] = PCI_EXT_CAP_SRIOV_SIZEOF,
+ [PCI_EXT_CAP_ID_MRIOV] = 0, /* not yet */
+ [PCI_EXT_CAP_ID_MCAST] = PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF,
+ [PCI_EXT_CAP_ID_PRI] = PCI_EXT_CAP_PRI_SIZEOF,
+ [PCI_EXT_CAP_ID_AMD_XXX] = 0, /* not yet */
+ [PCI_EXT_CAP_ID_REBAR] = 0xFF,
+ [PCI_EXT_CAP_ID_DPA] = 0xFF,
+ [PCI_EXT_CAP_ID_TPH] = 0xFF,
+ [PCI_EXT_CAP_ID_LTR] = PCI_EXT_CAP_LTR_SIZEOF,
+ [PCI_EXT_CAP_ID_SECPCI] = 0, /* not yet */
+ [PCI_EXT_CAP_ID_PMUX] = 0, /* not yet */
+ [PCI_EXT_CAP_ID_PASID] = 0, /* not yet */
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists, but what is read depends on
+ * whether the field is 'virtualized', or just pass thru to the
+ * hardware. Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+ u8 *virt; /* read/write virtual data, not hw */
+ u8 *write; /* writeable bits */
+ int (*readfn)(struct vfio_pci_device *vdev, int pos, int count,
+ struct perm_bits *perm, int offset, u32 *val);
+ int (*writefn)(struct vfio_pci_device *vdev, int pos, int count,
+ struct perm_bits *perm, int offset, u32 val);
+};
+
+#define NO_VIRT 0
+#define ALL_VIRT 0xFFFFFFFFU
+#define NO_WRITE 0
+#define ALL_WRITE 0xFFFFFFFFU
+
+static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+ u32 *val, int count)
+{
+ int ret = -EINVAL;
+
+ *val = 0;
+
+ switch (count) {
+ case 1:
+ ret = pci_user_read_config_byte(pdev, offset, (u8 *)val);
+ break;
+ case 2:
+ ret = pci_user_read_config_word(pdev, offset, (u16 *)val);
+ break;
+ case 4:
+ ret = pci_user_read_config_dword(pdev, offset, val);
+ break;
+ }
+
+ *val = cpu_to_le32(*val);
+
+ return pcibios_err_to_errno(ret);
+}
+
+static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+ u32 val, int count)
+{
+ int ret = -EINVAL;
+
+ val = le32_to_cpu(val);
+
+ switch (count) {
+ case 1:
+ ret = pci_user_write_config_byte(pdev, offset, val);
+ break;
+ case 2:
+ ret = pci_user_write_config_word(pdev, offset, val);
+ break;
+ case 4:
+ ret = pci_user_write_config_dword(pdev, offset, val);
+ break;
+ }
+
+ return pcibios_err_to_errno(ret);
+}
+
+static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 *val)
+{
+ u32 virt = 0;
+
+ memcpy(val, vdev->vconfig + pos, count);
+
+ memcpy(&virt, perm->virt + offset, count);
+
+ /* Any non-virtualized bits? */
+ if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
+ struct pci_dev *pdev = vdev->pdev;
+ u32 phys_val = 0;
+ int ret;
+
+ ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+ if (ret)
+ return ret;
+
+ *val = (phys_val & ~virt) | (*val & virt);
+ }
+
+ return count;
+}
+
+static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 val)
+{
+ u32 virt = 0, write = 0;
+
+ memcpy(&write, perm->write + offset, count);
+
+ if (!write)
+ return count; /* drop, no writable bits */
+
+ memcpy(&virt, perm->virt + offset, count);
+
+ /* Virtualized and writable bits go to vconfig */
+ if (write & virt) {
+ u32 virt_val = 0;
+
+ memcpy(&virt_val, vdev->vconfig + pos, count);
+
+ virt_val &= ~(write & virt);
+ virt_val |= (val & (write & virt));
+
+ memcpy(vdev->vconfig + pos, &virt_val, count);
+ }
+
+ /* Non-virtualzed and writable bits go to hardware */
+ if (write & ~virt) {
+ struct pci_dev *pdev = vdev->pdev;
+ u32 phys_val = 0;
+ int ret;
+
+ ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+ if (ret)
+ return ret;
+
+ phys_val &= ~(write & ~virt);
+ phys_val |= (val & (write & ~virt));
+
+ ret = vfio_user_config_write(pdev, pos, phys_val, count);
+ if (ret)
+ return ret;
+ }
+
+ return count;
+}
+
+/* Allow direct read from hardware, except for capability next pointer */
+static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 *val)
+{
+ int ret;
+
+ ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ if (pos >= PCI_CFG_SPACE_SIZE) { /* Extended cap header mangling */
+ if (offset < 4)
+ memcpy(val, vdev->vconfig + pos, count);
+ } else if (pos >= PCI_STD_HEADER_SIZEOF) { /* Std cap mangling */
+ if (offset == PCI_CAP_LIST_ID && count > 1)
+ memcpy(val, vdev->vconfig + pos,
+ min(PCI_CAP_FLAGS, count));
+ else if (offset == PCI_CAP_LIST_NEXT)
+ memcpy(val, vdev->vconfig + pos, 1);
+ }
+
+ return count;
+}
+
+static int vfio_direct_config_write(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 val)
+{
+ int ret;
+
+ ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
+/* Default all regions to read-only, no-virtualization */
+static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
+ [0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
+ [0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+};
+
+static void free_perm_bits(struct perm_bits *perm)
+{
+ kfree(perm->virt);
+ kfree(perm->write);
+ perm->virt = NULL;
+ perm->write = NULL;
+}
+
+static int alloc_perm_bits(struct perm_bits *perm, int size)
+{
+ /*
+ * Round up all permission bits to the next dword, this lets us
+ * ignore whether a read/write exceeds the defined capability
+ * structure. We can do this because:
+ * - Standard config space is already dword aligned
+ * - Capabilities are all dword alinged (bits 0:1 of next reserved)
+ * - Express capabilities defined as dword aligned
+ */
+ size = round_up(size, 4);
+
+ /*
+ * Zero state is
+ * - All Readable, None Writeable, None Virtualized
+ */
+ perm->virt = kzalloc(size, GFP_KERNEL);
+ perm->write = kzalloc(size, GFP_KERNEL);
+ if (!perm->virt || !perm->write) {
+ free_perm_bits(perm);
+ return -ENOMEM;
+ }
+
+ perm->readfn = vfio_default_config_read;
+ perm->writefn = vfio_default_config_write;
+
+ return 0;
+}
+
+/*
+ * Helper functions for filling in permission tables
+ */
+static inline void p_setb(struct perm_bits *p, int off, u8 virt, u8 write)
+{
+ p->virt[off] = virt;
+ p->write[off] = write;
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setw(struct perm_bits *p, int off, u16 virt, u16 write)
+{
+ *(u16 *)(&p->virt[off]) = cpu_to_le16(virt);
+ *(u16 *)(&p->write[off]) = cpu_to_le16(write);
+}
+
+/* Handle endian-ness - pci and tables are little-endian */
+static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
+{
+ *(u32 *)(&p->virt[off]) = cpu_to_le32(virt);
+ *(u32 *)(&p->write[off]) = cpu_to_le32(write);
+}
+
+/*
+ * Restore the *real* BARs after we detect a FLR or backdoor reset.
+ * (backdoor = some device specific technique that we didn't catch)
+ */
+static void vfio_bar_restore(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u32 *rbar = vdev->rbar;
+ int i;
+
+ if (pdev->is_virtfn)
+ return;
+
+ printk(KERN_INFO "%s: %s reset recovery - restoring bars\n",
+ __func__, dev_name(&pdev->dev));
+
+ for (i = PCI_BASE_ADDRESS_0; i <= PCI_BASE_ADDRESS_5; i += 4, rbar++)
+ pci_user_write_config_dword(pdev, i, *rbar);
+
+ pci_user_write_config_dword(pdev, PCI_ROM_ADDRESS, *rbar);
+}
+
+static u32 vfio_generate_bar_flags(struct pci_dev *pdev, int bar)
+{
+ unsigned long flags = pci_resource_flags(pdev, bar);
+ u32 val;
+
+ if (flags & IORESOURCE_IO)
+ return cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
+
+ val = PCI_BASE_ADDRESS_SPACE_MEMORY;
+
+ if (flags & IORESOURCE_PREFETCH)
+ val |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+ if (flags & IORESOURCE_MEM_64)
+ val |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+
+ return cpu_to_le32(val);
+}
+
+/*
+ * Pretend we're hardware and tweak the values of the *virtual* PCI BARs
+ * to reflect the hardware capabilities. This implements BAR sizing.
+ */
+static void vfio_bar_fixup(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int i;
+ u32 *bar;
+ u64 mask;
+
+ bar = (u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_0];
+
+ for (i = PCI_STD_RESOURCES; i <= PCI_STD_RESOURCE_END; i++, bar++) {
+ if (!pci_resource_start(pdev, i)) {
+ *bar = 0; /* Unmapped by host = unimplemented to user */
+ continue;
+ }
+
+ mask = ~(pci_resource_len(pdev, i) - 1);
+
+ *bar &= cpu_to_le32((u32)mask);
+ *bar |= vfio_generate_bar_flags(pdev, i);
+
+ if (*bar & cpu_to_le32(PCI_BASE_ADDRESS_MEM_TYPE_64)) {
+ bar++;
+ *bar &= cpu_to_le32((u32)(mask >> 32));
+ i++;
+ }
+ }
+
+ bar = (u32 *)&vdev->vconfig[PCI_ROM_ADDRESS];
+
+ /*
+ * NB. we expose the actual BAR size here, regardless of whether
+ * we can read it. When we report the REGION_INFO for the ROM
+ * we report what PCI tells us is the actual ROM size.
+ */
+ if (pci_resource_start(pdev, PCI_ROM_RESOURCE)) {
+ mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
+ mask |= PCI_ROM_ADDRESS_ENABLE;
+ *bar &= cpu_to_le32((u32)mask);
+ } else
+ *bar = 0;
+
+ vdev->bardirty = false;
+}
+
+static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 *val)
+{
+ if (is_bar(offset)) /* pos == offset for basic config */
+ vfio_bar_fixup(vdev);
+
+ count = vfio_default_config_read(vdev, pos, count, perm, offset, val);
+
+ /* Mask in virtual memory enable for SR-IOV devices */
+ if (offset == PCI_COMMAND && vdev->pdev->is_virtfn) {
+ u16 cmd = le16_to_cpu(*(u16 *)&vdev->vconfig[PCI_COMMAND]);
+ *val = (le32_to_cpu(*val) | (cmd & PCI_COMMAND_MEMORY));
+ *val = cpu_to_le32(*val);
+ }
+
+ return count;
+}
+
+static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 val)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u16 phys_cmd, *virt_cmd, new_cmd = 0;
+ int ret;
+
+ virt_cmd = (u16 *)&vdev->vconfig[PCI_COMMAND];
+
+ if (offset == PCI_COMMAND) {
+ bool phys_mem, virt_mem, new_mem, phys_io, virt_io, new_io;
+
+ ret = pci_user_read_config_word(pdev, PCI_COMMAND, &phys_cmd);
+ if (ret)
+ return ret;
+
+ new_cmd = le32_to_cpu(val);
+
+ phys_mem = !!(phys_cmd & PCI_COMMAND_MEMORY);
+ virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
+ new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
+
+ phys_io = !!(phys_cmd & PCI_COMMAND_IO);
+ virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO);
+ new_io = !!(new_cmd & PCI_COMMAND_IO);
+
+ /*
+ * If the user is writing mem/io enable (new_mem/io) and we
+ * think it's already enabled (virt_mem/io), but the hardware
+ * shows it disabled (phys_mem/io, then the device has
+ * undergone some kind of backdoor reset and needs to be
+ * restored before we allow it to enable the bars.
+ * SR-IOV devices will trigger this, but we catch them later
+ */
+ if ((new_mem && virt_mem && !phys_mem) ||
+ (new_io && virt_io && !phys_io))
+ vfio_bar_restore(vdev);
+ }
+
+ count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+ if (count < 0)
+ return count;
+
+ /*
+ * Save current memory/io enable bits in vconfig to allow for
+ * the test above next time.
+ */
+ if (offset == PCI_COMMAND) {
+ u16 mask = PCI_COMMAND_MEMORY | PCI_COMMAND_IO;
+
+ *virt_cmd &= cpu_to_le16(~mask);
+ *virt_cmd |= cpu_to_le16(new_cmd & mask);
+ }
+
+ /* Emulate INTx disable */
+ if (offset >= PCI_COMMAND && offset <= PCI_COMMAND + 1) {
+ bool virt_intx_disable;
+
+ virt_intx_disable = !!(le16_to_cpu(*virt_cmd) &
+ PCI_COMMAND_INTX_DISABLE);
+
+ if (virt_intx_disable && !vdev->virq_disabled) {
+ vdev->virq_disabled = true;
+ vfio_pci_intx_mask(vdev);
+ } else if (!virt_intx_disable && vdev->virq_disabled) {
+ vdev->virq_disabled = false;
+ vfio_pci_intx_unmask(vdev);
+ }
+ }
+
+ if (is_bar(offset))
+ vdev->bardirty = true;
+
+ return count;
+}
+
+/* Permissions for the Basic PCI Header */
+static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
+{
+ if (alloc_perm_bits(perm, PCI_STD_HEADER_SIZEOF))
+ return -ENOMEM;
+
+ perm->readfn = vfio_basic_config_read;
+ perm->writefn = vfio_basic_config_write;
+
+ /* Virtualized for SR-IOV functions, which just have FFFF */
+ p_setw(perm, PCI_VENDOR_ID, (u16)ALL_VIRT, NO_WRITE);
+ p_setw(perm, PCI_DEVICE_ID, (u16)ALL_VIRT, NO_WRITE);
+
+ /*
+ * Virtualize INTx disable, we use it internally for interrupt
+ * control and can emulate it for non-PCI 2.3 devices.
+ */
+ p_setw(perm, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE, (u16)ALL_WRITE);
+
+ /* Virtualize capability list, we might want to skip/disable */
+ p_setw(perm, PCI_STATUS, PCI_STATUS_CAP_LIST, NO_WRITE);
+
+ /* No harm to write */
+ p_setb(perm, PCI_CACHE_LINE_SIZE, NO_VIRT, (u8)ALL_WRITE);
+ p_setb(perm, PCI_LATENCY_TIMER, NO_VIRT, (u8)ALL_WRITE);
+ p_setb(perm, PCI_BIST, NO_VIRT, (u8)ALL_WRITE);
+
+ /* Virtualize all bars, can't touch the real ones */
+ p_setd(perm, PCI_BASE_ADDRESS_0, ALL_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_BASE_ADDRESS_1, ALL_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_BASE_ADDRESS_2, ALL_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_BASE_ADDRESS_3, ALL_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_BASE_ADDRESS_4, ALL_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_BASE_ADDRESS_5, ALL_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_ROM_ADDRESS, ALL_VIRT, ALL_WRITE);
+
+ /* Allow us to adjust capability chain */
+ p_setb(perm, PCI_CAPABILITY_LIST, (u8)ALL_VIRT, NO_WRITE);
+
+ /* Sometimes used by sw, just virtualize */
+ p_setb(perm, PCI_INTERRUPT_LINE, (u8)ALL_VIRT, (u8)ALL_WRITE);
+ return 0;
+}
+
+/* Permissions for the Power Management capability */
+static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
+{
+ if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM]))
+ return -ENOMEM;
+
+ /*
+ * We always virtualize the next field so we can remove
+ * capabilities from the chain if we want to.
+ */
+ p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+ /*
+ * Power management is defined *per function*,
+ * so we let the user write this
+ */
+ p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE);
+ return 0;
+}
+
+/* Permissions for PCI-X capability */
+static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
+{
+ /* Alloc 24, but only 8 are used in v0 */
+ if (alloc_perm_bits(perm, PCI_CAP_PCIX_SIZEOF_V12))
+ return -ENOMEM;
+
+ p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+ p_setw(perm, PCI_X_CMD, NO_VIRT, (u16)ALL_WRITE);
+ p_setd(perm, PCI_X_ECC_CSR, NO_VIRT, ALL_WRITE);
+ return 0;
+}
+
+/* Permissions for PCI Express capability */
+static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
+{
+ /* Alloc larger of two possible sizes */
+ if (alloc_perm_bits(perm, PCI_CAP_EXP_ENDPOINT_SIZEOF_V2))
+ return -ENOMEM;
+
+ p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+ /*
+ * Allow writes to device control fields (includes FLR!)
+ * but not to devctl_phantom which could confuse IOMMU
+ * or to the ARI bit in devctl2 which is set at probe time
+ */
+ p_setw(perm, PCI_EXP_DEVCTL, NO_VIRT, ~PCI_EXP_DEVCTL_PHANTOM);
+ p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);
+ return 0;
+}
+
+/* Permissions for Advanced Function capability */
+static int __init init_pci_cap_af_perm(struct perm_bits *perm)
+{
+ if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_AF]))
+ return -ENOMEM;
+
+ p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+ p_setb(perm, PCI_AF_CTRL, NO_VIRT, PCI_AF_CTRL_FLR);
+ return 0;
+}
+
+/* Permissions for Advanced Error Reporting extended capability */
+static int __init init_pci_ext_cap_err_perm(struct perm_bits *perm)
+{
+ u32 mask;
+
+ if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_ERR]))
+ return -ENOMEM;
+
+ /*
+ * Virtualize the first dword of all express capabilities
+ * because it includes the next pointer. This lets us later
+ * remove capabilities from the chain if we need to.
+ */
+ p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+ /* Writable bits mask */
+ mask = PCI_ERR_UNC_TRAIN | /* Training */
+ PCI_ERR_UNC_DLP | /* Data Link Protocol */
+ PCI_ERR_UNC_SURPDN | /* Surprise Down */
+ PCI_ERR_UNC_POISON_TLP | /* Poisoned TLP */
+ PCI_ERR_UNC_FCP | /* Flow Control Protocol */
+ PCI_ERR_UNC_COMP_TIME | /* Completion Timeout */
+ PCI_ERR_UNC_COMP_ABORT | /* Completer Abort */
+ PCI_ERR_UNC_UNX_COMP | /* Unexpected Completion */
+ PCI_ERR_UNC_RX_OVER | /* Receiver Overflow */
+ PCI_ERR_UNC_MALF_TLP | /* Malformed TLP */
+ PCI_ERR_UNC_ECRC | /* ECRC Error Status */
+ PCI_ERR_UNC_UNSUP | /* Unsupported Request */
+ PCI_ERR_UNC_ACSV | /* ACS Violation */
+ PCI_ERR_UNC_INTN | /* internal error */
+ PCI_ERR_UNC_MCBTLP | /* MC blocked TLP */
+ PCI_ERR_UNC_ATOMEG | /* Atomic egress blocked */
+ PCI_ERR_UNC_TLPPRE; /* TLP prefix blocked */
+ p_setd(perm, PCI_ERR_UNCOR_STATUS, NO_VIRT, mask);
+ p_setd(perm, PCI_ERR_UNCOR_MASK, NO_VIRT, mask);
+ p_setd(perm, PCI_ERR_UNCOR_SEVER, NO_VIRT, mask);
+
+ mask = PCI_ERR_COR_RCVR | /* Receiver Error Status */
+ PCI_ERR_COR_BAD_TLP | /* Bad TLP Status */
+ PCI_ERR_COR_BAD_DLLP | /* Bad DLLP Status */
+ PCI_ERR_COR_REP_ROLL | /* REPLAY_NUM Rollover */
+ PCI_ERR_COR_REP_TIMER | /* Replay Timer Timeout */
+ PCI_ERR_COR_ADV_NFAT | /* Advisory Non-Fatal */
+ PCI_ERR_COR_INTERNAL | /* Corrected Internal */
+ PCI_ERR_COR_LOG_OVER; /* Header Log Overflow */
+ p_setd(perm, PCI_ERR_COR_STATUS, NO_VIRT, mask);
+ p_setd(perm, PCI_ERR_COR_MASK, NO_VIRT, mask);
+
+ mask = PCI_ERR_CAP_ECRC_GENE | /* ECRC Generation Enable */
+ PCI_ERR_CAP_ECRC_CHKE; /* ECRC Check Enable */
+ p_setd(perm, PCI_ERR_CAP, NO_VIRT, mask);
+ return 0;
+}
+
+/* Permissions for Power Budgeting extended capability */
+static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
+{
+ if (alloc_perm_bits(perm, pci_ext_cap_length[PCI_EXT_CAP_ID_PWR]))
+ return -ENOMEM;
+
+ p_setd(perm, 0, ALL_VIRT, NO_WRITE);
+
+ /* Writing the data selector is OK, the info is still read-only */
+ p_setb(perm, PCI_PWR_DATA, NO_VIRT, (u8)ALL_WRITE);
+ return 0;
+}
+
+/*
+ * Initialize the shared permission tables
+ */
+void vfio_pci_uninit_perm_bits(void)
+{
+ free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
+
+ free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
+ free_perm_bits(&cap_perms[PCI_CAP_ID_PCIX]);
+ free_perm_bits(&cap_perms[PCI_CAP_ID_EXP]);
+ free_perm_bits(&cap_perms[PCI_CAP_ID_AF]);
+
+ free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+ free_perm_bits(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+}
+
+int __init vfio_pci_init_perm_bits(void)
+{
+ int ret;
+
+ /* Basic config space */
+ ret = init_pci_cap_basic_perm(&cap_perms[PCI_CAP_ID_BASIC]);
+
+ /* Capabilities */
+ ret |= init_pci_cap_pm_perm(&cap_perms[PCI_CAP_ID_PM]);
+ cap_perms[PCI_CAP_ID_VPD].writefn = vfio_direct_config_write;
+ ret |= init_pci_cap_pcix_perm(&cap_perms[PCI_CAP_ID_PCIX]);
+ cap_perms[PCI_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+ ret |= init_pci_cap_exp_perm(&cap_perms[PCI_CAP_ID_EXP]);
+ ret |= init_pci_cap_af_perm(&cap_perms[PCI_CAP_ID_AF]);
+
+ /* Extended capabilities */
+ ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
+ ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
+ ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_direct_config_write;
+
+ if (ret)
+ vfio_pci_uninit_perm_bits();
+
+ return ret;
+}
+
+static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos)
+{
+ u8 cap;
+ int base = (pos >= PCI_CFG_SPACE_SIZE) ? PCI_CFG_SPACE_SIZE :
+ PCI_STD_HEADER_SIZEOF;
+ base /= 4;
+ pos /= 4;
+
+ cap = vdev->pci_config_map[pos];
+
+ if (cap == PCI_CAP_ID_BASIC)
+ return 0;
+
+ /* XXX Can we have to abutting capabilities of the same type? */
+ while (pos - 1 >= base && vdev->pci_config_map[pos - 1] == cap)
+ pos--;
+
+ return pos * 4;
+}
+
+static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 *val)
+{
+ /* Update max available queue size from msi_qmax */
+ if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+ u16 *flags;
+ int start;
+
+ start = vfio_find_cap_start(vdev, pos);
+
+ flags = (u16 *)&vdev->vconfig[start];
+
+ *flags &= cpu_to_le16(~PCI_MSI_FLAGS_QMASK);
+ *flags |= cpu_to_le16(vdev->msi_qmax << 1);
+ }
+
+ return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_msi_config_write(struct vfio_pci_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, u32 val)
+{
+ count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+ if (count < 0)
+ return count;
+
+ /* Fixup and write configured queue size and enable to hardware */
+ if (offset <= PCI_MSI_FLAGS && offset + count >= PCI_MSI_FLAGS) {
+ u16 *pflags, flags;
+ int start, ret;
+
+ start = vfio_find_cap_start(vdev, pos);
+
+ pflags = (u16 *)&vdev->vconfig[start + PCI_MSI_FLAGS];
+
+ flags = le16_to_cpu(*pflags);
+
+ /* MSI is enabled via ioctl */
+ if (!is_msi(vdev))
+ flags &= ~PCI_MSI_FLAGS_ENABLE;
+
+ /* Check queue size */
+ if ((flags & PCI_MSI_FLAGS_QSIZE) >> 4 > vdev->msi_qmax) {
+ flags &= ~PCI_MSI_FLAGS_QSIZE;
+ flags |= vdev->msi_qmax << 4;
+ }
+
+ /* Write back to virt and to hardware */
+ *pflags = cpu_to_le16(flags);
+ ret = pci_user_write_config_word(vdev->pdev,
+ start + PCI_MSI_FLAGS,
+ flags);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+ }
+
+ return count;
+}
+
+/*
+ * MSI determination is per-device, so this routine gets used beyond
+ * initialization time. Don't add __init
+ */
+static int init_pci_cap_msi_perm(struct perm_bits *perm, int len, u16 flags)
+{
+ if (alloc_perm_bits(perm, len))
+ return -ENOMEM;
+
+ perm->readfn = vfio_msi_config_read;
+ perm->writefn = vfio_msi_config_write;
+
+ p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+ /*
+ * The upper byte of the control register is reserved,
+ * just setup the lower byte.
+ */
+ p_setb(perm, PCI_MSI_FLAGS, (u8)ALL_VIRT, (u8)ALL_WRITE);
+ p_setd(perm, PCI_MSI_ADDRESS_LO, ALL_VIRT, ALL_WRITE);
+ if (flags & PCI_MSI_FLAGS_64BIT) {
+ p_setd(perm, PCI_MSI_ADDRESS_HI, ALL_VIRT, ALL_WRITE);
+ p_setw(perm, PCI_MSI_DATA_64, (u16)ALL_VIRT, (u16)ALL_WRITE);
+ if (flags & PCI_MSI_FLAGS_MASKBIT) {
+ p_setd(perm, PCI_MSI_MASK_64, NO_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_MSI_PENDING_64, NO_VIRT, ALL_WRITE);
+ }
+ } else {
+ p_setw(perm, PCI_MSI_DATA_32, (u16)ALL_VIRT, (u16)ALL_WRITE);
+ if (flags & PCI_MSI_FLAGS_MASKBIT) {
+ p_setd(perm, PCI_MSI_MASK_32, NO_VIRT, ALL_WRITE);
+ p_setd(perm, PCI_MSI_PENDING_32, NO_VIRT, ALL_WRITE);
+ }
+ }
+ return 0;
+}
+
+/* Determine MSI CAP field length; initialize msi_perms on 1st call per vdev */
+static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int len, ret;
+ u16 flags;
+
+ ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ len = 10; /* Minimum size */
+ if (flags & PCI_MSI_FLAGS_64BIT)
+ len += 4;
+ if (flags & PCI_MSI_FLAGS_MASKBIT)
+ len += 10;
+
+ if (vdev->msi_perm)
+ return len;
+
+ vdev->msi_perm = kmalloc(sizeof(struct perm_bits), GFP_KERNEL);
+ if (!vdev->msi_perm)
+ return -ENOMEM;
+
+ ret = init_pci_cap_msi_perm(vdev->msi_perm, len, flags);
+ if (ret)
+ return ret;
+
+ return len;
+}
+
+/* Determine extended capability length for VC (2 & 9) and MFVC */
+static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u32 tmp;
+ int ret, evcc, phases, vc_arb;
+ int len = PCI_CAP_VC_BASE_SIZEOF;
+
+ ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG1, &tmp);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ evcc = tmp & PCI_VC_REG1_EVCC; /* extended vc count */
+ ret = pci_read_config_dword(pdev, pos + PCI_VC_PORT_REG2, &tmp);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ if (tmp & PCI_VC_REG2_128_PHASE)
+ phases = 128;
+ else if (tmp & PCI_VC_REG2_64_PHASE)
+ phases = 64;
+ else if (tmp & PCI_VC_REG2_32_PHASE)
+ phases = 32;
+ else
+ phases = 0;
+
+ vc_arb = phases * 4;
+
+ /*
+ * Port arbitration tables are root & switch only;
+ * function arbitration tables are function 0 only.
+ * In either case, we'll never let user write them so
+ * we don't care how big they are
+ */
+ len += (1 + evcc) * PCI_CAP_VC_PER_VC_SIZEOF;
+ if (vc_arb) {
+ len = round_up(len, 16);
+ len += vc_arb / 8;
+ }
+ return len;
+}
+
+static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u16 word;
+ u8 byte;
+ int ret;
+
+ switch (cap) {
+ case PCI_CAP_ID_MSI:
+ return vfio_msi_cap_len(vdev, pos);
+ case PCI_CAP_ID_PCIX:
+ ret = pci_read_config_word(pdev, pos + PCI_X_CMD, &word);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ if (PCI_X_CMD_VERSION(word)) {
+ vdev->extended_caps = true;
+ return PCI_CAP_PCIX_SIZEOF_V12;
+ } else
+ return PCI_CAP_PCIX_SIZEOF_V0;
+ case PCI_CAP_ID_VNDR:
+ /* length follows next field */
+ ret = pci_read_config_byte(pdev, pos + PCI_CAP_FLAGS, &byte);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ return byte;
+ case PCI_CAP_ID_EXP:
+ /* length based on version */
+ ret = pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, &word);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ if ((word & PCI_EXP_FLAGS_VERS) == 1)
+ return PCI_CAP_EXP_ENDPOINT_SIZEOF_V1;
+ else {
+ vdev->extended_caps = true;
+ return PCI_CAP_EXP_ENDPOINT_SIZEOF_V2;
+ }
+ case PCI_CAP_ID_HT:
+ ret = pci_read_config_byte(pdev, pos + 3, &byte);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ return (byte & HT_3BIT_CAP_MASK) ?
+ HT_CAP_SIZEOF_SHORT : HT_CAP_SIZEOF_LONG;
+ case PCI_CAP_ID_SATA:
+ ret = pci_read_config_byte(pdev, pos + PCI_SATA_REGS, &byte);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ byte &= PCI_SATA_REGS_MASK;
+ if (byte == PCI_SATA_REGS_INLINE)
+ return PCI_SATA_SIZEOF_LONG;
+ else
+ return PCI_SATA_SIZEOF_SHORT;
+ default:
+ printk(KERN_WARNING
+ "%s: %s unknown length for pci cap 0x%x@0x%x\n",
+ dev_name(&pdev->dev), __func__, cap, pos);
+ }
+
+ return 0;
+}
+
+static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 byte;
+ u32 dword;
+ int ret;
+
+ switch (ecap) {
+ case PCI_EXT_CAP_ID_VNDR:
+ ret = pci_read_config_dword(pdev, epos + PCI_VSEC_HDR, &dword);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ return dword >> PCI_VSEC_HDR_LEN_SHIFT;
+ case PCI_EXT_CAP_ID_VC:
+ case PCI_EXT_CAP_ID_VC9:
+ case PCI_EXT_CAP_ID_MFVC:
+ return vfio_vc_cap_len(vdev, epos);
+ case PCI_EXT_CAP_ID_ACS:
+ ret = pci_read_config_byte(pdev, epos + PCI_ACS_CAP, &byte);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ if (byte & PCI_ACS_EC) {
+ int bits;
+
+ ret = pci_read_config_byte(pdev,
+ epos + PCI_ACS_EGRESS_BITS,
+ &byte);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ bits = byte ? round_up(byte, 32) : 256;
+ return 8 + (bits / 8);
+ }
+ return 8;
+
+ case PCI_EXT_CAP_ID_REBAR:
+ ret = pci_read_config_byte(pdev, epos + PCI_REBAR_CTRL, &byte);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ byte &= PCI_REBAR_CTRL_NBAR_MASK;
+ byte >>= PCI_REBAR_CTRL_NBAR_SHIFT;
+
+ return 4 + (byte * 8);
+ case PCI_EXT_CAP_ID_DPA:
+ ret = pci_read_config_byte(pdev, epos + PCI_DPA_CAP, &byte);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ byte &= PCI_DPA_CAP_SUBSTATE_MASK;
+ byte = round_up(byte + 1, 4);
+ return PCI_DPA_BASE_SIZEOF + byte;
+ case PCI_EXT_CAP_ID_TPH:
+ ret = pci_read_config_dword(pdev, epos + PCI_TPH_CAP, &dword);
+ if (ret)
+ return pcibios_err_to_errno(ret);
+
+ if ((dword & PCI_TPH_CAP_LOC_MASK) == PCI_TPH_LOC_CAP) {
+ int sts;
+
+ sts = byte & PCI_TPH_CAP_ST_MASK;
+ sts >>= PCI_TPH_CAP_ST_SHIFT;
+ return PCI_TPH_BASE_SIZEOF + round_up(sts * 2, 4);
+ }
+ return PCI_TPH_BASE_SIZEOF;
+ default:
+ printk(KERN_WARNING
+ "%s: %s unknown length for pci ecap 0x%x@0x%x\n",
+ dev_name(&pdev->dev), __func__, ecap, epos);
+ }
+
+ return 0;
+}
+
+static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
+ int offset, int size)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int ret = 0;
+
+ /*
+ * We try to read physical config space in the largest chunks
+ * we can, assuming that all of the fields support dword access.
+ * pci_save_state() makes this same assumption and seems to do ok.
+ */
+ while (size) {
+ int filled;
+
+ if (size >= 4 && !(offset % 4)) {
+ u32 *dword = (u32 *)&vdev->vconfig[offset];
+ ret = pci_read_config_dword(pdev, offset, dword);
+ if (ret)
+ return ret;
+ *dword = cpu_to_le32(*dword);
+ filled = 4;
+ } else if (size >= 2 && !(offset % 2)) {
+ u16 *word = (u16 *)&vdev->vconfig[offset];
+ ret = pci_read_config_word(pdev, offset, word);
+ if (ret)
+ return ret;
+ *word = cpu_to_le16(*word);
+ filled = 2;
+ } else {
+ u8 *byte = &vdev->vconfig[offset];
+ ret = pci_read_config_byte(pdev, offset, byte);
+ if (ret)
+ return ret;
+ filled = 1;
+ }
+
+ offset += filled;
+ size -= filled;
+ }
+
+ return ret;
+}
+
+static int vfio_cap_init(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map = vdev->pci_config_map;
+ u16 status;
+ u8 pos, *prev, cap;
+ int loops, ret, caps = 0;
+
+ /* Any capabilities? */
+ ret = pci_read_config_word(pdev, PCI_STATUS, &status);
+ if (ret)
+ return ret;
+
+ if (!(status & PCI_STATUS_CAP_LIST))
+ return 0; /* Done */
+
+ ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+ if (ret)
+ return ret;
+
+ /* Mark the previous position in case we want to skip a capability */
+ prev = &vdev->vconfig[PCI_CAPABILITY_LIST];
+
+ /* We can bound our loop, capabilities are dword aligned */
+ loops = (PCI_CFG_SPACE_SIZE - PCI_STD_HEADER_SIZEOF) / PCI_CAP_SIZEOF;
+ while (pos && loops--) {
+ u8 next;
+ int i, len = 0;
+
+ ret = pci_read_config_byte(pdev, pos, &cap);
+ if (ret)
+ return ret;
+
+ ret = pci_read_config_byte(pdev,
+ pos + PCI_CAP_LIST_NEXT, &next);
+ if (ret)
+ return ret;
+
+ if (cap <= PCI_CAP_ID_MAX) {
+ len = pci_cap_length[cap];
+ if (len == 0xFF) { /* Variable length */
+ len = vfio_cap_len(vdev, cap, pos);
+ if (len < 0)
+ return len;
+ }
+ }
+
+ if (!len) {
+ printk(KERN_INFO "%s: %s hiding cap 0x%x\n",
+ __func__, dev_name(&pdev->dev), cap);
+ *prev = next;
+ pos = next;
+ continue;
+ }
+
+ /* Sanity check, do we overlap other capabilities? */
+ for (i = 0; i < len; i += 4) {
+ if (likely(map[(pos + i) / 4] == PCI_CAP_ID_INVALID))
+ continue;
+
+ printk(KERN_WARNING
+ "%s: %s pci config conflict @0x%x, was cap 0x%x now cap 0x%x\n",
+ __func__, dev_name(&pdev->dev), pos + i,
+ map[pos + i], cap);
+ }
+
+ memset(map + (pos / 4), cap, len / 4);
+ ret = vfio_fill_vconfig_bytes(vdev, pos, len);
+ if (ret)
+ return ret;
+
+ prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
+ pos = next;
+ caps++;
+ }
+
+ /* If we didn't fill any capabilities, clear the status flag */
+ if (!caps) {
+ u16 *vstatus = (u16 *)&vdev->vconfig[PCI_STATUS];
+ *vstatus &= ~cpu_to_le16(PCI_STATUS_CAP_LIST);
+ }
+
+ return 0;
+}
+
+static int vfio_ecap_init(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map = vdev->pci_config_map;
+ u16 epos;
+ u32 *prev = NULL;
+ int loops, ret, ecaps = 0;
+
+ if (!vdev->extended_caps)
+ return 0;
+
+ epos = PCI_CFG_SPACE_SIZE;
+
+ loops = (pdev->cfg_size - PCI_CFG_SPACE_SIZE) / PCI_CAP_SIZEOF;
+
+ while (loops-- && epos >= PCI_CFG_SPACE_SIZE) {
+ u32 header;
+ u16 ecap;
+ int i, len = 0;
+ bool hidden = false;
+
+ ret = pci_read_config_dword(pdev, epos, &header);
+ if (ret)
+ return ret;
+
+ ecap = PCI_EXT_CAP_ID(header);
+
+ if (ecap <= PCI_EXT_CAP_ID_MAX) {
+ len = pci_ext_cap_length[ecap];
+ if (len == 0xFF) {
+ len = vfio_ext_cap_len(vdev, ecap, epos);
+ if (len < 0)
+ return ret;
+ }
+ }
+
+ if (!len) {
+ printk(KERN_INFO "%s: %s hiding ecap 0x%x@0x%x\n",
+ __func__, dev_name(&pdev->dev), ecap, epos);
+
+ /* If not the first in the chain, we can skip over it */
+ if (prev) {
+ epos = PCI_EXT_CAP_NEXT(header);
+ *prev &= cpu_to_le32(~((u32)0xffc << 20));
+ *prev |= cpu_to_le32((u32)epos << 20);
+ continue;
+ }
+
+ /*
+ * Otherwise, fill in a placeholder, the direct
+ * readfn will virtualize this automatically
+ */
+ len = PCI_CAP_SIZEOF;
+ hidden = true;
+ }
+
+ for (i = 0; i < len; i += 4) {
+ if (likely(map[(epos + i) / 4] == PCI_CAP_ID_INVALID))
+ continue;
+
+ printk(KERN_WARNING
+ "%s: %s pci config conflict @0x%x, was ecap 0x%x now ecap 0x%x\n",
+ __func__, dev_name(&pdev->dev),
+ epos + i, map[epos + i], ecap);
+ }
+
+ /*
+ * Even though ecap is 2 bytes, we're currently a long way
+ * from exceeding 1 byte capabilities. If we ever make it
+ * up to 0xFF we'll need to up this to a two-byte, byte map.
+ */
+ BUILD_BUG_ON(PCI_EXT_CAP_ID_MAX >= PCI_CAP_ID_INVALID);
+
+ memset(map + (epos / 4), ecap, len / 4);
+ ret = vfio_fill_vconfig_bytes(vdev, epos, len);
+ if (ret)
+ return ret;
+
+ /*
+ * If we're just using this capability to anchor the list,
+ * hide the real ID. Only count real ecaps. XXX PCI spec
+ * indicates to use cap id = 0, version = 0, next = 0 if
+ * ecaps are absent, hope users check all the way to next.
+ */
+ if (hidden)
+ *(u32 *)&vdev->vconfig[epos] &=
+ cpu_to_le32(((u32)0xffc << 20));
+ else
+ ecaps++;
+
+ prev = (u32 *)&vdev->vconfig[epos];
+ epos = PCI_EXT_CAP_NEXT(header);
+ }
+
+ if (!ecaps)
+ *(u32 *)&vdev->vconfig[PCI_CFG_SPACE_SIZE] = 0;
+
+ return 0;
+}
+
+/*
+ * For each device we allocate a pci_config_map that indicates the
+ * capability occupying each dword and thus the struct perm_bits we
+ * use for read and write. We also allocate a virtualized config
+ * space which tracks reads and writes to bits that we emulate for
+ * the user. Initial values filled from device.
+ *
+ * Using shared stuct perm_bits between all vfio-pci devices saves
+ * us from allocating cfg_size buffers for virt and write for every
+ * device. We could remove vconfig and allocate individual buffers
+ * for each area requring emulated bits, but the array of pointers
+ * would be comparable in size (at least for standard config space).
+ */
+int vfio_config_init(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map, *vconfig;
+ int ret;
+
+ /*
+ * Config space, caps and ecaps are all dword aligned, so we can
+ * use one byte per dword to record the type.
+ */
+ map = kmalloc(pdev->cfg_size / 4, GFP_KERNEL);
+ if (!map)
+ return -ENOMEM;
+
+ vconfig = kmalloc(pdev->cfg_size, GFP_KERNEL);
+ if (!vconfig) {
+ kfree(map);
+ return -ENOMEM;
+ }
+
+ vdev->pci_config_map = map;
+ vdev->vconfig = vconfig;
+
+ memset(map, PCI_CAP_ID_BASIC, PCI_STD_HEADER_SIZEOF / 4);
+ memset(map + (PCI_STD_HEADER_SIZEOF / 4), PCI_CAP_ID_INVALID,
+ (pdev->cfg_size - PCI_STD_HEADER_SIZEOF) / 4);
+
+ ret = vfio_fill_vconfig_bytes(vdev, 0, PCI_STD_HEADER_SIZEOF);
+ if (ret)
+ goto out;
+
+ vdev->bardirty = true;
+
+ /*
+ * XXX can we just pci_load_saved_state/pci_restore_state?
+ * may need to rebuild vconfig after that
+ */
+
+ /* For restore after reset */
+ vdev->rbar[0] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_0];
+ vdev->rbar[1] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_1];
+ vdev->rbar[2] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_2];
+ vdev->rbar[3] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_3];
+ vdev->rbar[4] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_4];
+ vdev->rbar[5] = *(u32 *)&vconfig[PCI_BASE_ADDRESS_5];
+ vdev->rbar[6] = *(u32 *)&vconfig[PCI_ROM_ADDRESS];
+
+ if (pdev->is_virtfn) {
+ *(u16 *)&vconfig[PCI_VENDOR_ID] = cpu_to_le16(pdev->vendor);
+ *(u16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
+ }
+
+ ret = vfio_cap_init(vdev);
+ if (ret)
+ goto out;
+
+ ret = vfio_ecap_init(vdev);
+ if (ret)
+ goto out;
+
+ return 0;
+
+out:
+ kfree(map);
+ vdev->pci_config_map = NULL;
+ kfree(vconfig);
+ vdev->vconfig = NULL;
+ return pcibios_err_to_errno(ret);
+}
+
+void vfio_config_free(struct vfio_pci_device *vdev)
+{
+ kfree(vdev->vconfig);
+ vdev->vconfig = NULL;
+ kfree(vdev->pci_config_map);
+ vdev->pci_config_map = NULL;
+ kfree(vdev->msi_perm);
+ vdev->msi_perm = NULL;
+}
+
+ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool iswrite)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct perm_bits *perm;
+ u32 val = 0;
+ int cap_start = 0, offset;
+ u8 cap_id;
+
+
+ if (*ppos < 0 || *ppos + count > pdev->cfg_size)
+ return -EFAULT;
+
+ cap_id = vdev->pci_config_map[*ppos / 4];
+
+ if (cap_id == PCI_CAP_ID_INVALID) {
+ if (iswrite)
+ return count; /* drop */
+
+ /*
+ * Per PCI spec 3.0, section 6.1, reads from reserved and
+ * unimplemented registers return 0
+ */
+ if (copy_to_user(buf, &val, count))
+ return -EFAULT;
+
+ return count;
+ }
+
+ /*
+ * All capabilities are minimum 4 bytes and aligned on dword
+ * boundaries. Since we don't support unaligned accesses, we're
+ * only ever accessing a single capability.
+ */
+ if (*ppos >= PCI_CFG_SPACE_SIZE) {
+ WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
+
+ perm = &ecap_perms[cap_id];
+ cap_start = vfio_find_cap_start(vdev, *ppos);
+
+ } else {
+ WARN_ON(cap_id > PCI_CAP_ID_MAX);
+
+ perm = &cap_perms[cap_id];
+
+ if (cap_id == PCI_CAP_ID_MSI)
+ perm = vdev->msi_perm;
+
+ if (cap_id > PCI_CAP_ID_BASIC)
+ cap_start = vfio_find_cap_start(vdev, *ppos);
+ }
+
+ WARN_ON(!cap_start && cap_id != PCI_CAP_ID_BASIC);
+ WARN_ON(cap_start > *ppos);
+
+ offset = *ppos - cap_start;
+
+ if (iswrite) {
+ if (perm->writefn) {
+ if (copy_from_user(&val, buf, count))
+ return -EFAULT;
+
+ count = perm->writefn(vdev, *ppos, count,
+ perm, offset, val);
+ }
+ } else {
+ if (perm->readfn) {
+ count = perm->readfn(vdev, *ppos, count,
+ perm, offset, &val);
+ if (count < 0)
+ return count;
+ }
+
+ if (copy_to_user(buf, &val, count))
+ return -EFAULT;
+ }
+
+ return count;
+}
+
+ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ size_t done = 0;
+ int ret = 0;
+ loff_t pos = *ppos;
+
+ pos &= VFIO_PCI_OFFSET_MASK;
+
+ /*
+ * We want to both keep the access size the caller users as well as
+ * support reading large chunks of config space in a single call.
+ * PCI doesn't support unaligned accesses, so we can safely break
+ * those apart.
+ */
+ while (count) {
+ if (count >= 4 && !(pos % 4))
+ ret = vfio_config_do_rw(vdev, buf, 4, &pos, iswrite);
+ else if (count >= 2 && !(pos % 2))
+ ret = vfio_config_do_rw(vdev, buf, 2, &pos, iswrite);
+ else
+ ret = vfio_config_do_rw(vdev, buf, 1, &pos, iswrite);
+
+ if (ret < 0)
+ return ret;
+
+ count -= ret;
+ done += ret;
+ buf += ret;
+ pos += ret;
+ }
+
+ *ppos += done;
+
+ return done;
+}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
new file mode 100644
index 0000000..2996f37
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -0,0 +1,724 @@
+/*
+ * VFIO PCI interrupt handling
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+
+#include "vfio_pci_private.h"
+
+/*
+ * IRQfd - generic
+ */
+struct virqfd {
+ struct vfio_pci_device *vdev;
+ void *data;
+ struct eventfd_ctx *eventfd;
+ poll_table pt;
+ wait_queue_t wait;
+ struct work_struct inject;
+ struct work_struct shutdown;
+ struct virqfd **pvirqfd;
+};
+
+static struct workqueue_struct *vfio_irqfd_cleanup_wq;
+
+int __init vfio_pci_virqfd_init(void)
+{
+ vfio_irqfd_cleanup_wq =
+ create_singlethread_workqueue("vfio-irqfd-cleanup");
+ if (!vfio_irqfd_cleanup_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void vfio_pci_virqfd_exit(void)
+{
+ destroy_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+static void virqfd_deactivate(struct virqfd *virqfd)
+{
+ queue_work(vfio_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+ struct virqfd *virqfd = container_of(wait, struct virqfd, wait);
+ unsigned long flags = (unsigned long)key;
+
+ if (flags & POLLIN)
+ /* An event has been signaled, inject an interrupt */
+ schedule_work(&virqfd->inject);
+
+ if (flags & POLLHUP)
+ /* The eventfd is closing, detach from VFIO */
+ virqfd_deactivate(virqfd);
+
+ return 0;
+}
+
+static void virqfd_ptable_queue_proc(struct file *file,
+ wait_queue_head_t *wqh, poll_table *pt)
+{
+ struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
+ add_wait_queue(wqh, &virqfd->wait);
+}
+
+static void virqfd_shutdown(struct work_struct *work)
+{
+ u64 cnt;
+ struct virqfd *virqfd = container_of(work, struct virqfd, shutdown);
+ struct virqfd **pvirqfd = virqfd->pvirqfd;
+
+ eventfd_ctx_remove_wait_queue(virqfd->eventfd, &virqfd->wait, &cnt);
+ flush_work(&virqfd->inject);
+ eventfd_ctx_put(virqfd->eventfd);
+
+ kfree(virqfd);
+ *pvirqfd = NULL;
+}
+
+static int virqfd_enable(struct vfio_pci_device *vdev,
+ void (*inject)(struct work_struct *work),
+ void *data, struct virqfd **pvirqfd, int fd)
+{
+ struct file *file = NULL;
+ struct eventfd_ctx *ctx = NULL;
+ struct virqfd *virqfd;
+ int ret = 0;
+ unsigned int events;
+
+ if (*pvirqfd)
+ return -EBUSY;
+
+ virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+ if (!*pvirqfd)
+ return -ENOMEM;
+
+ virqfd->vdev = vdev;
+ virqfd->data = data;
+ virqfd->pvirqfd = pvirqfd;
+ *pvirqfd = virqfd;
+
+ INIT_WORK(&virqfd->inject, inject);
+ INIT_WORK(&virqfd->shutdown, virqfd_shutdown);
+
+ file = eventfd_fget(fd);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto fail;
+ }
+
+ ctx = eventfd_ctx_fileget(file);
+ if (IS_ERR(ctx)) {
+ ret = PTR_ERR(ctx);
+ goto fail;
+ }
+
+ virqfd->eventfd = ctx;
+
+ /*
+ * Install our own custom wake-up handling so we are notified via
+ * a callback whenever someone signals the underlying eventfd.
+ */
+ init_waitqueue_func_entry(&virqfd->wait, virqfd_wakeup);
+ init_poll_funcptr(&virqfd->pt, virqfd_ptable_queue_proc);
+
+ events = file->f_op->poll(file, &virqfd->pt);
+
+ /*
+ * Check if there was an event already pending on the eventfd
+ * before we registered and trigger it as if we didn't miss it.
+ */
+ if (events & POLLIN)
+ schedule_work(&virqfd->inject);
+
+ /*
+ * Do not drop the file until the irqfd is fully initialized,
+ * otherwise we might race against the POLLHUP.
+ */
+ fput(file);
+
+ return 0;
+
+fail:
+ if (ctx && !IS_ERR(ctx))
+ eventfd_ctx_put(ctx);
+
+ if (!IS_ERR(file))
+ fput(file);
+
+ kfree(virqfd);
+ *pvirqfd = NULL;
+
+ return ret;
+}
+
+static void virqfd_disable(struct virqfd *virqfd)
+{
+ if (!virqfd)
+ return;
+
+ virqfd_deactivate(virqfd);
+
+ /* Block until we know all outstanding shutdown jobs have completed. */
+ flush_workqueue(vfio_irqfd_cleanup_wq);
+}
+
+/*
+ * INTx
+ */
+static inline void vfio_send_intx_eventfd(struct vfio_pci_device *vdev)
+{
+ if (likely(is_intx(vdev) && !vdev->virq_disabled))
+ eventfd_signal(vdev->ctx[0].trigger, 1);
+}
+
+void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+
+ spin_lock_irq(&vdev->irqlock);
+
+ /*
+ * Masking can come from interrupt, ioctl, or config space
+ * via INTx disable. The latter means this can get called
+ * even when not using intx delivery. In this case, just
+ * try to have the physical bit follow the virtual bit.
+ */
+ if (unlikely(!is_intx(vdev))) {
+ if (vdev->pci_2_3)
+ pci_intx(pdev, 0);
+ } else if (!vdev->ctx[0].masked) {
+ /*
+ * Can't use check_and_mask here because we always want to
+ * mask, not just when something is pending.
+ */
+ if (vdev->pci_2_3)
+ pci_intx(pdev, 0);
+ else
+ disable_irq_nosync(pdev->irq);
+
+ vdev->ctx[0].masked = true;
+ }
+
+ spin_unlock_irq(&vdev->irqlock);
+}
+
+void vfio_pci_intx_unmask(struct vfio_pci_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ bool signal = false;
+
+ spin_lock_irq(&vdev->irqlock);
+
+ /*
+ * Unmasking comes from ioctl or config, so again, have the
+ * physical bit follow the virtual even when not using INTx.
+ */
+ if (unlikely(!is_intx(vdev))) {
+ if (vdev->pci_2_3)
+ pci_intx(pdev, 1);
+ } else if (vdev->ctx[0].masked && !vdev->virq_disabled) {
+ /*
+ * A pending interrupt here would immediately trigger,
+ * but we can avoid that overhead by just re-sending
+ * the interrupt to the user.
+ */
+ if (vdev->pci_2_3) {
+ if (!pci_check_and_unmask_intx(pdev))
+ signal = true;
+ } else
+ enable_irq(pdev->irq);
+
+ vdev->ctx[0].masked = signal;
+ }
+
+ spin_unlock_irq(&vdev->irqlock);
+
+ if (signal)
+ vfio_send_intx_eventfd(vdev);
+}
+
+static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
+{
+ struct vfio_pci_device *vdev = dev_id;
+ struct pci_dev *pdev = vdev->pdev;
+ irqreturn_t ret = IRQ_NONE;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vdev->irqlock, flags);
+
+ /* Non-PCI 2.3 device don't use this hard handler */
+ if (pci_check_and_mask_intx(pdev)) {
+ ret = IRQ_WAKE_THREAD;
+ vdev->ctx[0].masked = true;
+ }
+
+ spin_unlock_irqrestore(&vdev->irqlock, flags);
+
+ return ret;
+}
+
+static irqreturn_t vfio_intx_thread(int irq, void *dev_id)
+{
+ struct vfio_pci_device *vdev = dev_id;
+ int ret = IRQ_HANDLED;
+
+ if (unlikely(!vdev->pci_2_3)) {
+ spin_lock_irq(&vdev->irqlock);
+ if (!vdev->ctx[0].masked) {
+ disable_irq_nosync(vdev->pdev->irq);
+ vdev->ctx[0].masked = true;
+ } else
+ ret = IRQ_NONE;
+ spin_unlock_irq(&vdev->irqlock);
+ }
+
+ if (ret == IRQ_HANDLED)
+ vfio_send_intx_eventfd(vdev);
+
+ return ret;
+}
+
+static int vfio_intx_enable(struct vfio_pci_device *vdev)
+{
+ if (!is_irq_none(vdev))
+ return -EINVAL;
+
+ if (!vdev->pdev->irq)
+ return -ENODEV;
+
+ vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+ if (!vdev->ctx)
+ return -ENOMEM;
+
+ vdev->num_ctx = 1;
+ vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
+
+ return 0;
+}
+
+static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ irq_handler_t handler = vfio_intx_handler;
+ unsigned long irqflags = IRQF_SHARED;
+ int ret;
+
+ if (vdev->ctx[0].trigger) {
+ free_irq(pdev->irq, vdev);
+ kfree(vdev->ctx[0].name);
+ eventfd_ctx_put(vdev->ctx[0].trigger);
+ vdev->ctx[0].trigger = NULL;
+ }
+
+ if (fd >= 0) {
+ vdev->ctx[0].name = kasprintf(GFP_KERNEL, "vfio-intx(%s)",
+ pci_name(pdev));
+ if (!vdev->ctx[0].name)
+ return -ENOMEM;
+
+ vdev->ctx[0].trigger = eventfd_ctx_fdget(fd);
+ if (!vdev->ctx[0].trigger) {
+ kfree(vdev->ctx[0].name);
+ return -EINVAL;
+ }
+
+ if (!vdev->pci_2_3) {
+ handler = NULL;
+ irqflags = IRQF_ONESHOT;
+ }
+
+ ret = request_threaded_irq(pdev->irq, handler, vfio_intx_thread,
+ irqflags, vdev->ctx[0].name, vdev);
+ if (ret) {
+ eventfd_ctx_put(vdev->ctx[0].trigger);
+ kfree(vdev->ctx[0].name);
+ return ret;
+ }
+
+ /*
+ * INTx disable will stick across the new irq setup,
+ * disable_irq won't.
+ */
+ if (!vdev->pci_2_3)
+ if (vdev->ctx[0].masked || vdev->virq_disabled)
+ disable_irq_nosync(pdev->irq);
+ }
+ return 0;
+}
+
+static void vfio_intx_disable(struct vfio_pci_device *vdev)
+{
+ vfio_intx_set_signal(vdev, -1);
+ virqfd_disable(vdev->ctx[0].unmask);
+ virqfd_disable(vdev->ctx[0].mask);
+ vdev->irq_type = VFIO_PCI_NUM_IRQS;
+ vdev->num_ctx = 0;
+ kfree(vdev->ctx);
+}
+
+static void vfio_intx_unmask_inject(struct work_struct *work)
+{
+ struct virqfd *virqfd = container_of(work, struct virqfd, inject);
+ vfio_pci_intx_unmask(virqfd->vdev);
+}
+
+/*
+ * MSI/MSI-X
+ */
+static irqreturn_t vfio_msihandler(int irq, void *arg)
+{
+ struct eventfd_ctx *trigger = arg;
+
+ eventfd_signal(trigger, 1);
+ return IRQ_HANDLED;
+}
+
+static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int ret;
+
+ if (!is_irq_none(vdev))
+ return -EINVAL;
+
+ vdev->ctx = kzalloc(nvec * sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
+ if (!vdev->ctx)
+ return -ENOMEM;
+
+ if (msix) {
+ int i;
+
+ vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+ GFP_KERNEL);
+ if (!vdev->msix) {
+ kfree(vdev->ctx);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < nvec; i++)
+ vdev->msix[i].entry = i;
+
+ ret = pci_enable_msix(pdev, vdev->msix, nvec);
+ if (ret) {
+ kfree(vdev->msix);
+ kfree(vdev->ctx);
+ return ret;
+ }
+ } else {
+ ret = pci_enable_msi_block(pdev, nvec);
+ if (ret) {
+ kfree(vdev->ctx);
+ return ret;
+ }
+ }
+
+ vdev->num_ctx = nvec;
+ vdev->irq_type = msix ? VFIO_PCI_MSIX_IRQ_INDEX :
+ VFIO_PCI_MSI_IRQ_INDEX;
+
+ if (!msix) {
+ /*
+ * Compute the virtual hardware field for max msi vectors -
+ * it is the log base 2 of the number of vectors.
+ */
+ vdev->msi_qmax = fls(nvec * 2 - 1) - 1;
+ }
+
+ return 0;
+}
+
+static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
+ int vector, int fd, bool msix)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
+ char *name = msix ? "vfio-msix" : "vfio-msi";
+
+ if (vector >= vdev->num_ctx)
+ return -EINVAL;
+
+ if (vdev->ctx[vector].trigger) {
+ free_irq(irq, vdev->ctx[vector].trigger);
+ kfree(vdev->ctx[vector].name);
+ eventfd_ctx_put(vdev->ctx[vector].trigger);
+ vdev->ctx[vector].trigger = NULL;
+ }
+
+ if (fd >= 0) {
+ struct eventfd_ctx *trigger;
+ int ret;
+
+ vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "%s[%d](%s)",
+ name, vector,
+ pci_name(pdev));
+ if (!vdev->ctx[vector].name)
+ return -ENOMEM;
+
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger)) {
+ kfree(vdev->ctx[vector].name);
+ return PTR_ERR(trigger);
+ }
+
+ ret = request_threaded_irq(irq, NULL, vfio_msihandler, 0,
+ vdev->ctx[vector].name, trigger);
+ if (ret) {
+ eventfd_ctx_put(trigger);
+ kfree(vdev->ctx[vector].name);
+ return ret;
+ }
+
+ vdev->ctx[vector].trigger = trigger;
+ }
+
+ return 0;
+}
+
+static int vfio_msi_set_block(struct vfio_pci_device *vdev, int start,
+ int count, int32_t *fds, bool msix)
+{
+ int i, j, ret = 0;
+
+ if (start + count > vdev->num_ctx)
+ return -EINVAL;
+
+ for (i = 0, j = start; i < count && !ret; i++, j++) {
+ int fd = fds ? fds[i] : -1;
+ ret = vfio_msi_set_vector_signal(vdev, j, fd, msix);
+ }
+
+ if (ret) {
+ for (--j; j >= start; j--)
+ vfio_msi_set_vector_signal(vdev, j, -1, msix);
+ }
+
+ return ret;
+}
+
+static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int i;
+
+ vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);
+
+ for (i = 0; i < vdev->num_ctx; i++) {
+ virqfd_disable(vdev->ctx[i].unmask);
+ virqfd_disable(vdev->ctx[i].mask);
+ }
+
+ if (msix) {
+ pci_disable_msix(vdev->pdev);
+ kfree(vdev->msix);
+ } else
+ pci_disable_msi(pdev);
+
+ vdev->irq_type = VFIO_PCI_NUM_IRQS;
+ vdev->num_ctx = 0;
+ kfree(vdev->ctx);
+}
+
+/*
+ * IOCTL support
+ */
+static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev,
+ int index, int start, int count,
+ uint32_t flags, void *data)
+{
+ if (!is_intx(vdev) || start != 0 || count != 1)
+ return -EINVAL;
+
+ if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ vfio_pci_intx_unmask(vdev);
+ } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+ uint8_t unmask = *(uint8_t *)data;
+ if (unmask)
+ vfio_pci_intx_unmask(vdev);
+ } else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ int32_t fd = *(int32_t *)data;
+ if (fd >= 0)
+ return virqfd_enable(vdev, vfio_intx_unmask_inject,
+ NULL, &vdev->ctx[0].unmask, fd);
+
+ virqfd_disable(vdev->ctx[0].unmask);
+ }
+
+ return 0;
+}
+
+static int vfio_pci_set_intx_mask(struct vfio_pci_device *vdev,
+ int index, int start, int count,
+ uint32_t flags, void *data)
+{
+ if (!is_intx(vdev) || start != 0 || count != 1)
+ return -EINVAL;
+
+ if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ vfio_pci_intx_mask(vdev);
+ } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+ uint8_t mask = *(uint8_t *)data;
+ if (mask)
+ vfio_pci_intx_mask(vdev);
+ } else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ return -ENOTTY; /* XXX implement me */
+ }
+
+ return 0;
+}
+
+static int vfio_pci_set_intx_trigger(struct vfio_pci_device *vdev,
+ int index, int start, int count,
+ uint32_t flags, void *data)
+{
+ if (is_intx(vdev) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+ vfio_intx_disable(vdev);
+ return 0;
+ }
+
+ if (!(is_intx(vdev) || is_irq_none(vdev)) || start != 0 || count != 1)
+ return -EINVAL;
+
+ if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ int32_t fd = *(int32_t *)data;
+ int ret;
+
+ if (is_intx(vdev))
+ return vfio_intx_set_signal(vdev, fd);
+
+ ret = vfio_intx_enable(vdev);
+ if (ret)
+ return ret;
+
+ ret = vfio_intx_set_signal(vdev, fd);
+ if (ret)
+ vfio_intx_disable(vdev);
+
+ return ret;
+ }
+
+ if (!is_intx(vdev))
+ return -EINVAL;
+
+ if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ vfio_send_intx_eventfd(vdev);
+ } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+ uint8_t trigger = *(uint8_t *)data;
+ if (trigger)
+ vfio_send_intx_eventfd(vdev);
+ }
+ return 0;
+}
+
+static int vfio_pci_set_msi_trigger(struct vfio_pci_device *vdev,
+ int index, int start, int count,
+ uint32_t flags, void *data)
+{
+ int i;
+ bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
+
+ if (irq_is(vdev, index) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+ vfio_msi_disable(vdev, msix);
+ return 0;
+ }
+
+ if (!(irq_is(vdev, index) || is_irq_none(vdev)))
+ return -EINVAL;
+
+ if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ int32_t *fds = data;
+ int ret;
+
+ if (vdev->irq_type == index)
+ return vfio_msi_set_block(vdev, start, count,
+ fds, msix);
+
+ ret = vfio_msi_enable(vdev, start + count, msix);
+ if (ret)
+ return ret;
+
+ ret = vfio_msi_set_block(vdev, start, count, fds, msix);
+ if (ret)
+ vfio_msi_disable(vdev, msix);
+
+ return ret;
+ }
+
+ if (!irq_is(vdev, index) || start + count > vdev->num_ctx)
+ return -EINVAL;
+
+ for (i = start; i < start + count; i++) {
+ if (!vdev->ctx[i].trigger)
+ continue;
+ if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ eventfd_signal(vdev->ctx[i].trigger, 1);
+ } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+ uint8_t *bools = data;
+ if (bools[i - start])
+ eventfd_signal(vdev->ctx[i].trigger, 1);
+ }
+ }
+ return 0;
+}
+
+int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
+ int index, int start, int count, void *data)
+{
+ int (*func)(struct vfio_pci_device *vdev, int index, int start,
+ int count, uint32_t flags, void *data) = NULL;
+
+ switch (index) {
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ func = vfio_pci_set_intx_mask;
+ break;
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ func = vfio_pci_set_intx_unmask;
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ func = vfio_pci_set_intx_trigger;
+ break;
+ }
+ break;
+ case VFIO_PCI_MSI_IRQ_INDEX:
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ /* XXX Need masking support exported */
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ func = vfio_pci_set_msi_trigger;
+ break;
+ }
+ break;
+ }
+
+ if (!func)
+ return -ENOTTY;
+
+ return func(vdev, index, start, count, flags, data);
+}
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
new file mode 100644
index 0000000..a4a3678
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -0,0 +1,91 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#include <linux/mutex.h>
+#include <linux/pci.h>
+
+#ifndef VFIO_PCI_PRIVATE_H
+#define VFIO_PCI_PRIVATE_H
+
+#define VFIO_PCI_OFFSET_SHIFT 40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off) (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_pci_irq_ctx {
+ struct eventfd_ctx *trigger;
+ struct virqfd *unmask;
+ struct virqfd *mask;
+ char *name;
+ bool masked;
+};
+
+struct vfio_pci_device {
+ struct pci_dev *pdev;
+ void __iomem *barmap[PCI_STD_RESOURCE_END + 1];
+ u8 *pci_config_map;
+ u8 *vconfig;
+ struct perm_bits *msi_perm;
+ spinlock_t irqlock;
+ struct mutex igate;
+ struct msix_entry *msix;
+ struct vfio_pci_irq_ctx *ctx;
+ int num_ctx;
+ int irq_type;
+ u8 msi_qmax;
+ u8 msix_bar;
+ u16 msix_size;
+ u32 msix_offset;
+ u32 rbar[7];
+ bool pci_2_3;
+ bool virq_disabled;
+ bool reset_works;
+ bool extended_caps;
+ bool bardirty;
+ struct pci_saved_state *pci_saved_state;
+ atomic_t refcnt;
+};
+
+#define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
+#define is_msi(vdev) (vdev->irq_type == VFIO_PCI_MSI_IRQ_INDEX)
+#define is_msix(vdev) (vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+#define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
+#define irq_is(vdev, type) (vdev->irq_type == type)
+
+extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
+extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
+
+extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
+ uint32_t flags, int index, int start,
+ int count, void *data);
+
+extern ssize_t vfio_pci_config_readwrite(struct vfio_pci_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite);
+extern ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite);
+
+extern int vfio_pci_init_perm_bits(void);
+extern void vfio_pci_uninit_perm_bits(void);
+
+extern int vfio_pci_virqfd_init(void);
+extern void vfio_pci_virqfd_exit(void);
+
+extern int vfio_config_init(struct vfio_pci_device *vdev);
+extern void vfio_config_free(struct vfio_pci_device *vdev);
+#endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
new file mode 100644
index 0000000..883cd33
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -0,0 +1,269 @@
+/*
+ * VFIO PCI I/O Port & MMIO access
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include "vfio_pci_private.h"
+
+/* I/O Port BAR access */
+ssize_t vfio_pci_io_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool iswrite)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ void __iomem *io;
+ size_t done = 0;
+
+ if (!pci_resource_start(pdev, bar))
+ return -EINVAL;
+
+ if (pos + count > pci_resource_len(pdev, bar))
+ return -EINVAL;
+
+ if (!vdev->barmap[bar]) {
+ int ret;
+
+ ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
+ if (ret)
+ return ret;
+
+ vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+ if (!vdev->barmap[bar]) {
+ pci_release_selected_regions(pdev, 1 << bar);
+ return -EINVAL;
+ }
+ }
+
+ io = vdev->barmap[bar];
+
+ while (count) {
+ int filled;
+
+ if (count >= 3 && !(pos % 4)) {
+ u32 val;
+
+ if (iswrite) {
+ if (copy_from_user(&val, buf, 4))
+ return -EFAULT;
+
+ iowrite32(le32_to_cpu(val), io + pos);
+ } else {
+ val = cpu_to_le32(ioread32(io + pos));
+
+ if (copy_to_user(buf, &val, 4))
+ return -EFAULT;
+ }
+
+ filled = 4;
+
+ } else if ((pos % 2) == 0 && count >= 2) {
+ u16 val;
+
+ if (iswrite) {
+ if (copy_from_user(&val, buf, 2))
+ return -EFAULT;
+
+ iowrite16(le16_to_cpu(val), io + pos);
+ } else {
+ val = cpu_to_le16(ioread16(io + pos));
+
+ if (copy_to_user(buf, &val, 2))
+ return -EFAULT;
+ }
+
+ filled = 2;
+ } else {
+ u8 val;
+
+ if (iswrite) {
+ if (copy_from_user(&val, buf, 1))
+ return -EFAULT;
+
+ iowrite8(val, io + pos);
+ } else {
+ val = ioread8(io + pos);
+
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ }
+
+ filled = 1;
+ }
+
+ count -= filled;
+ done += filled;
+ buf += filled;
+ pos += filled;
+ }
+
+ *ppos += done;
+
+ return done;
+}
+
+/*
+ * MMIO BAR access
+ * We handle two excluded ranges here as well, if the user tries to read
+ * the ROM beyond what PCI tells us is available or the MSI-X table region,
+ * we return 0xFF and writes are dropped.
+ */
+ssize_t vfio_pci_mem_readwrite(struct vfio_pci_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool iswrite)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ void __iomem *io;
+ resource_size_t end;
+ size_t done = 0;
+ size_t x_start = 0, x_end = 0; /* excluded range */
+
+ if (!pci_resource_start(pdev, bar))
+ return -EINVAL;
+
+ end = pci_resource_len(pdev, bar);
+
+ if (pos > end)
+ return -EINVAL;
+
+ if (pos == end)
+ return 0;
+
+ if (pos + count > end)
+ count = end - pos;
+
+ if (bar == PCI_ROM_RESOURCE) {
+ io = pci_map_rom(pdev, &x_start);
+ x_end = end;
+ } else {
+ if (!vdev->barmap[bar]) {
+ int ret;
+
+ ret = pci_request_selected_regions(pdev, 1 << bar,
+ "vfio");
+ if (ret)
+ return ret;
+
+ vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+
+ if (!vdev->barmap[bar]) {
+ pci_release_selected_regions(pdev, 1 << bar);
+ return -EINVAL;
+ }
+ }
+
+ io = vdev->barmap[bar];
+
+ if (bar == vdev->msix_bar) {
+ x_start = vdev->msix_offset;
+ x_end = vdev->msix_offset + vdev->msix_size;
+ }
+ }
+
+ if (!io)
+ return -EINVAL;
+
+ while (count) {
+ size_t fillable, filled;
+
+ if (pos < x_start)
+ fillable = x_start - pos;
+ else if (pos >= x_end)
+ fillable = end - pos;
+ else
+ fillable = 0;
+
+ if (fillable >= 4 && !(pos % 4) && (count >= 4)) {
+ u32 val;
+
+ if (iswrite) {
+ if (copy_from_user(&val, buf, 4))
+ goto out;
+
+ iowrite32(le32_to_cpu(val), io + pos);
+ } else {
+ val = cpu_to_le32(ioread32(io + pos));
+
+ if (copy_to_user(buf, &val, 4))
+ goto out;
+ }
+
+ filled = 4;
+ } else if (fillable >= 2 && !(pos % 2) && (count >= 2)) {
+ u16 val;
+
+ if (iswrite) {
+ if (copy_from_user(&val, buf, 2))
+ goto out;
+
+ iowrite16(le16_to_cpu(val), io + pos);
+ } else {
+ val = cpu_to_le16(ioread16(io + pos));
+
+ if (copy_to_user(buf, &val, 2))
+ goto out;
+ }
+
+ filled = 2;
+ } else if (fillable) {
+ u8 val;
+
+ if (iswrite) {
+ if (copy_from_user(&val, buf, 1))
+ goto out;
+
+ iowrite8(val, io + pos);
+ } else {
+ val = ioread8(io + pos);
+
+ if (copy_to_user(buf, &val, 1))
+ goto out;
+ }
+
+ filled = 1;
+ } else {
+ /* Drop writes, fill reads with FF */
+ if (!iswrite) {
+ char val = 0xFF;
+ size_t i;
+
+ for (i = 0; i < x_end - pos; i++) {
+ if (put_user(val, buf + i))
+ goto out;
+ }
+ }
+
+ filled = x_end - pos;
+ }
+
+ count -= filled;
+ done += filled;
+ buf += filled;
+ pos += filled;
+ }
+
+ *ppos += done;
+
+out:
+ if (bar == PCI_ROM_RESOURCE)
+ pci_unmap_rom(pdev, io);
+
+ return count ? -EFAULT : done;
+}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b3e4583..3c2deac 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -222,6 +222,7 @@ struct vfio_device_info {
__u32 argsz;
__u32 flags;
#define VFIO_DEVICE_FLAGS_RESET (1 << 0) /* Device supports reset */
+#define VFIO_DEVICE_FLAGS_PCI (1 << 1) /* vfio-pci device */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
};
@@ -363,6 +364,31 @@ struct vfio_irq_set {
*/
#define VFIO_DEVICE_RESET _IO(VFIO_TYPE, VFIO_BASE + 11)

+/*
+ * The VFIO-PCI bus driver makes use of the following fixed region and
+ * IRQ index mapping. Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+ VFIO_PCI_BAR0_REGION_INDEX,
+ VFIO_PCI_BAR1_REGION_INDEX,
+ VFIO_PCI_BAR2_REGION_INDEX,
+ VFIO_PCI_BAR3_REGION_INDEX,
+ VFIO_PCI_BAR4_REGION_INDEX,
+ VFIO_PCI_BAR5_REGION_INDEX,
+ VFIO_PCI_ROM_REGION_INDEX,
+ VFIO_PCI_CONFIG_REGION_INDEX,
+ VFIO_PCI_NUM_REGIONS
+};
+
+enum {
+ VFIO_PCI_INTX_IRQ_INDEX,
+ VFIO_PCI_MSI_IRQ_INDEX,
+ VFIO_PCI_MSIX_IRQ_INDEX,
+ VFIO_PCI_NUM_IRQS
+};
+
/* -------- API for x86 VFIO IOMMU -------- */

/**

2012-05-22 05:07:31

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 10/13] pci: export pci_user functions for use by other drivers

VFIO PCI support will make use of these for user initiated
PCI config accesses.

Signed-off-by: Alex Williamson <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
---

drivers/pci/access.c | 6 ++++--
drivers/pci/pci.h | 7 -------
include/linux/pci.h | 8 ++++++++
3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 2a58164..ba91a7e 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -162,7 +162,8 @@ int pci_user_read_config_##size \
if (ret > 0) \
ret = -EINVAL; \
return ret; \
-}
+} \
+EXPORT_SYMBOL_GPL(pci_user_read_config_##size);

/* Returns 0 on success, negative values indicate error. */
#define PCI_USER_WRITE_CONFIG(size,type) \
@@ -181,7 +182,8 @@ int pci_user_write_config_##size \
if (ret > 0) \
ret = -EINVAL; \
return ret; \
-}
+} \
+EXPORT_SYMBOL_GPL(pci_user_write_config_##size);

PCI_USER_READ_CONFIG(byte, u8)
PCI_USER_READ_CONFIG(word, u16)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index e494347..f2dcc46 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -86,13 +86,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev)
return !!(pci_dev->subordinate);
}

-extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
-extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
-extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
-extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
-extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
-extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
-
struct pci_vpd_ops {
ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const void *buf);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2559735..0cf57d5 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -770,6 +770,14 @@ static inline int pci_write_config_dword(const struct pci_dev *dev, int where,
return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
}

+/* user-space driven config access */
+int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
+int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
+int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 *val);
+int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
+int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
+int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val);
+
int __must_check pci_enable_device(struct pci_dev *dev);
int __must_check pci_enable_device_io(struct pci_dev *dev);
int __must_check pci_enable_device_mem(struct pci_dev *dev);

2012-05-22 05:08:00

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 08/13] vfio: Add documentation

Signed-off-by: Alex Williamson <[email protected]>
---

Documentation/vfio.txt | 315 ++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 315 insertions(+), 0 deletions(-)
create mode 100644 Documentation/vfio.txt

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
new file mode 100644
index 0000000..1240874
--- /dev/null
+++ b/Documentation/vfio.txt
@@ -0,0 +1,315 @@
+VFIO - "Virtual Function I/O"[1]
+-------------------------------------------------------------------------------
+Many modern system now provide DMA and interrupt remapping facilities
+to help ensure I/O devices behave within the boundaries they've been
+allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
+POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
+systems such as Freescale PAMU. The VFIO driver is an IOMMU/device
+agnostic framework for exposing direct device access to userspace, in
+a secure, IOMMU protected environment. In other words, this allows
+safe[2], non-privileged, userspace drivers.
+
+Why do we want that? Virtual machines often make use of direct device
+access ("device assignment") when configured for the highest possible
+I/O performance. From a device and host perspective, this simply
+turns the VM into a userspace driver, with the benefits of
+significantly reduced latency, higher bandwidth, and direct use of
+bare-metal device drivers[3].
+
+Some applications, particularly in the high performance computing
+field, also benefit from low-overhead, direct device access from
+userspace. Examples include network adapters (often non-TCP/IP based)
+and compute accelerators. Prior to VFIO, these drivers had to either
+go through the full development cycle to become proper upstream
+driver, be maintained out of tree, or make use of the UIO framework,
+which has no notion of IOMMU protection, limited interrupt support,
+and requires root privileges to access things like PCI configuration
+space.
+
+The VFIO driver framework intends to unify these, replacing both the
+KVM PCI specific device assignment code as well as provide a more
+secure, more featureful userspace driver environment than UIO.
+
+Groups, Devices, and IOMMUs
+-------------------------------------------------------------------------------
+
+Devices are the main target of any I/O driver. Devices typically
+create a programming interface made up of I/O access, interrupts,
+and DMA. Without going into the details of each of these, DMA is
+by far the most critical aspect for maintaining a secure environment
+as allowing a device read-write access to system memory imposes the
+greatest risk to the overall system integrity.
+
+To help mitigate this risk, many modern IOMMUs now incorporate
+isolation properties into what was, in many cases, an interface only
+meant for translation (ie. solving the addressing problems of devices
+with limited address spaces). With this, devices can now be isolated
+from each other and from arbitrary memory access, thus allowing
+things like secure direct assignment of devices into virtual machines.
+
+This isolation is not always at the granularity of a single device
+though. Even when an IOMMU is capable of this, properties of devices,
+interconnects, and IOMMU topologies can each reduce this isolation.
+For instance, an individual device may be part of a larger multi-
+function enclosure. While the IOMMU may be able to distinguish
+between devices within the enclosure, the enclosure may not require
+transactions between devices to reach the IOMMU. Examples of this
+could be anything from a multi-function PCI device with backdoors
+between functions to a non-PCI-ACS (Access Control Services) capable
+bridge allowing redirection without reaching the IOMMU. Topology
+can also play a factor in terms of hiding devices. A PCIe-to-PCI
+bridge masks the devices behind it, making transaction appear as if
+from the bridge itself. Obviously IOMMU design plays a major factor
+as well.
+
+Therefore, while for the most part an IOMMU may have device level
+granularity, any system is susceptible to reduced granularity. The
+IOMMU API therefore supports a notion of IOMMU groups. A group is
+a set of devices which is isolatable from all other devices in the
+system. Groups are therefore the unit of ownership used by VFIO.
+
+While the group is the minimum granularity that must be used to
+ensure secure user access, it's not necessarily the preferred
+granularity. In IOMMUs which make use of page tables, it may be
+possible to share a set of page tables between different groups,
+reducing the overhead both to the platform (reduced TLB thrashing,
+reduced duplicate page tables), and to the user (programming only
+a single set of translations). For this reason, VFIO makes use of
+a container class, which may hold one or more groups. A container
+is created by simply opening the /dev/vfio/vfio character device.
+
+On its own, the container provides little functionality, with all
+but a couple version and extension query interfaces locked away.
+The user needs to add a group into the container for the next level
+of functionality. To do this, the user first needs to identify the
+group associated with the desired device. This can be done using
+the sysfs links described in the example below. By unbinding the
+device from the host driver and binding it to a VFIO driver, a new
+VFIO group will appear for the group as /dev/vfio/$GROUP, where
+$GROUP is the IOMMU group number of which the device is a member.
+If the IOMMU group contains multiple devices, each will need to
+be bound to a VFIO driver before operations on the VFIO group
+are allowed (it's also sufficient to only unbind the device from
+host drivers if a VFIO driver is unavailable; this will make the
+group available, but not that particular device). TBD - interface
+for disabling driver probing/locking a device.
+
+Once the group is ready, it may be added to the container by opening
+the VFIO group character device (/dev/vfio/$GROUP) and using the
+VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
+previously opened container file. If desired and if the IOMMU driver
+supports sharing the IOMMU context between groups, multiple groups may
+be set to the same container. If a group fails to set to a container
+with existing groups, a new empty container will need to be used
+instead.
+
+With a group (or groups) attached to a container, the remaining
+ioctls become available, enabling access to the VFIO IOMMU interfaces.
+Additionally, it now becomes possible to get file descriptors for each
+device within a group using an ioctl on the VFIO group file descriptor.
+
+The VFIO device API includes ioctls for describing the device, the I/O
+regions and their read/write/mmap offsets on the device descriptor, as
+well as mechanisms for describing and registering interrupt
+notifications.
+
+VFIO Usage Example
+-------------------------------------------------------------------------------
+
+Assume user wants to access PCI device 0000:06:0d.0
+
+$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
+../../../../kernel/iommu_groups/26
+
+This device is therefore in IOMMU group 26. This device is on the
+pci bus, therefore the user will make use of vfio-pci to manage the
+group:
+
+# modprobe vfio-pci
+
+Binding this device to the vfio-pci driver creates the VFIO group
+character devices for this group:
+
+$ lspci -n -s 0000:06:0d.0
+06:0d.0 0401: 1102:0002 (rev 08)
+# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
+# echo 1102 0002 > /sys/bus/pci/drivers/vfio/new_id
+# echo 0000:06:0d.0 > /sys/bus/pci/drivers/vfio/bind
+
+Now we need to look at what other devices are in the group to free
+it for use by VFIO:
+
+$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
+total 0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
+ ../../../../devices/pci0000:00/0000:00:1e.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
+ ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
+lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
+ ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
+
+This device is behind a PCIe-to-PCI bridge[4], therefore we also
+need to add device 0000:06:0d.1 to the group following the same
+procedure as above. Device 0000:00:1e.0 is a bridge that does
+not currently have a host driver, therefore it's not required to
+bind this device to the vfio-pci driver (vfio-pci does not currently
+support PCI bridges).
+
+The final step is to provide the user with access to the group if
+unprivileged operation is desired (note that /dev/vfio/vfio provides
+no capabilities on its own and is therefore expected to be set to
+mode 0666 by the system).
+
+# chown user:user /dev/vfio/26
+
+The user now has full access to all the devices and the iommu for this
+group and can access them as follows:
+
+ int container, group, device, i;
+ struct vfio_group_status group_status =
+ { .argsz = sizeof(group_status) };
+ struct vfio_iommu_x86_info iommu_info = { .argsz = sizeof(iommu_info) };
+ struct vfio_iommu_x86_dma_map dma_map = { .argsz = sizeof(dma_map) };
+ struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+ /* Create a new container */
+ container = open("/dev/vfio/vfio, O_RDWR);
+
+ if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
+ /* Unknown API version */
+
+ if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_X86_IOMMU))
+ /* Doesn't support the IOMMU driver we want. */
+
+ /* Open the group */
+ group = open("/dev/vfio/26", O_RDWR);
+
+ /* Test the group is viable and available */
+ ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
+
+ if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
+ /* Group is not viable (ie, not all devices bound for vfio) */
+
+ /* Add the group to the container */
+ ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
+
+ /* Enable the IOMMU model we want */
+ ioctl(container, VFIO_SET_IOMMU, VFIO_X86_IOMMU)
+
+ /* Get addition IOMMU info */
+ ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
+
+ /* Allocate some space and setup a DMA mapping */
+ dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ dma_map.size = 1024 * 1024;
+ dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+ dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+ ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+ /* Get a file descriptor for the device */
+ device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+ /* Test and setup the device */
+ ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+ for (i = 0; i < device_info.num_regions; i++) {
+ struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+ reg.index = i;
+
+ ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
+
+ /* Setup mappings... read/write offsets, mmaps
+ * For PCI devices, config space is a region */
+ }
+
+ for (i = 0; i < device_info.num_irqs; i++) {
+ struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+ irq.index = i;
+
+ ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &reg);
+
+ /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
+ }
+
+ /* Gratuitous device reset and go... */
+ ioctl(device, VFIO_DEVICE_RESET);
+
+VFIO User API
+-------------------------------------------------------------------------------
+
+Please see include/linux/vfio.h for complete API documentation.
+
+VFIO bus driver API
+-------------------------------------------------------------------------------
+
+VFIO bus drivers, such as vfio-pci make use of only a few interfaces
+into VFIO core. When devices are bound and unbound to the driver,
+the driver should call vfio_add_group_dev() and vfio_del_group_dev()
+respectively:
+
+extern int vfio_add_group_dev(struct iommu_group *iommu_group,
+ struct device *dev,
+ const struct vfio_device_ops *ops,
+ void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+vfio_add_group_dev() indicates to the core to begin tracking the
+specified iommu_group and register the specified dev as owned by
+a VFIO bus driver. The driver provides an ops structure for callbacks
+similar to a file operations structure:
+
+struct vfio_device_ops {
+ int (*open)(void *device_data);
+ void (*release)(void *device_data);
+ ssize_t (*read)(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos);
+ ssize_t (*write)(void *device_data, const char __user *buf,
+ size_t size, loff_t *ppos);
+ long (*ioctl)(void *device_data, unsigned int cmd,
+ unsigned long arg);
+ int (*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+Each function is passed the device_data that was originally registered
+in the vfio_add_group_dev() call above. This allows the bus driver
+an easy place to store its opaque, private data. The open/release
+callbacks are issued when a new file descriptor is created for a
+device (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides
+a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap
+interfaces implement the device region access defined by the device's
+own VFIO_DEVICE_GET_REGION_INFO ioctl.
+
+-------------------------------------------------------------------------------
+
+[1] VFIO was originally an acronym for "Virtual Function I/O" in its
+initial implementation by Tom Lyon while as Cisco. We've since
+outgrown the acronym, but it's catchy.
+
+[2] "safe" also depends upon a device being "well behaved". It's
+possible for multi-function devices to have backdoors between
+functions and even for single function devices to have alternative
+access to things like PCI config space through MMIO registers. To
+guard against the former we can include additional precautions in the
+IOMMU driver to group multi-function PCI devices together
+(iommu=group_mf). The latter we can't prevent, but the IOMMU should
+still provide isolation. For PCI, SR-IOV Virtual Functions are the
+best indicator of "well behaved", as these are designed for
+virtualization usage models.
+
+[3] As always there are trade-offs to virtual machine device
+assignment that are beyond the scope of VFIO. It's expected that
+future IOMMU technologies will reduce some, but maybe not all, of
+these trade-offs.
+
+[4] In this case the device is below a PCI bridge, so transactions
+from either function of the device are indistinguishable to the iommu:
+
+-[0000:00]-+-1e.0-[06]--+-0d.0
+ \-0d.1
+
+00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)

2012-05-22 05:07:57

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 09/13] vfio: x86 IOMMU implementation

x86 is probably the wrong name for this VFIO IOMMU driver, but x86
is the primary target for it. This driver support a very simple
usage model using the existing IOMMU API. The IOMMU is expected to
support the full host address space with no special IOVA windows,
number of mappings restrictions, or unique processor target options.

Signed-off-by: Alex Williamson <[email protected]>
---

Documentation/ioctl/ioctl-number.txt | 2
drivers/vfio/Kconfig | 6
drivers/vfio/Makefile | 2
drivers/vfio/vfio.c | 7
drivers/vfio/vfio_iommu_x86.c | 743 ++++++++++++++++++++++++++++++++++
include/linux/vfio.h | 52 ++
6 files changed, 811 insertions(+), 1 deletions(-)
create mode 100644 drivers/vfio/vfio_iommu_x86.c

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 111e30a..9d1694e 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,7 +88,7 @@ Code Seq#(hex) Include File Comments
and kernel/power/user.c
'8' all SNP8023 advanced NIC card
<mailto:[email protected]>
-';' 64-6F linux/vfio.h
+';' 64-72 linux/vfio.h
'@' 00-0F linux/radeonfb.h conflict!
'@' 00-0F drivers/video/aty/aty128fb.c conflict!
'A' 00-1F linux/apm_bios.h conflict!
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9acb1e7..bd88a30 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -1,6 +1,12 @@
+config VFIO_IOMMU_X86
+ tristate
+ depends on VFIO && X86
+ default n
+
menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
+ select VFIO_IOMMU_X86 if X86
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7500a67..1f1abee 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1 +1,3 @@
obj-$(CONFIG_VFIO) += vfio.o
+obj-$(CONFIG_VFIO_IOMMU_X86) += vfio_iommu_x86.o
+obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6558eef..89899a8 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1369,6 +1369,13 @@ static int __init vfio_init(void)

pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");

+ /*
+ * Attempt to load known iommu-drivers. This gives us a working
+ * environment without the user needing to explicitly load iommu
+ * drivers.
+ */
+ request_module_nowait("vfio_iommu_x86");
+
return 0;

err_groups_cdev:
diff --git a/drivers/vfio/vfio_iommu_x86.c b/drivers/vfio/vfio_iommu_x86.c
new file mode 100644
index 0000000..a52391d
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_x86.c
@@ -0,0 +1,743 @@
+/*
+ * VFIO: IOMMU DMA mapping support for x86 (Intel VT-d & AMD-Vi)
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/pci.h> /* pci_bus_type */
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/workqueue.h>
+
+#define DRIVER_VERSION "0.2"
+#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
+#define DRIVER_DESC "x86 IOMMU driver for VFIO"
+
+static bool allow_unsafe_interrupts;
+module_param_named(allow_unsafe_interrupts,
+ allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(allow_unsafe_interrupts,
+ "Enable VFIO IOMMU support for on platforms without interrupt remapping support.");
+
+struct vfio_iommu {
+ struct iommu_domain *domain;
+ struct mutex lock;
+ struct list_head dma_list;
+ struct list_head group_list;
+ bool cache;
+};
+
+struct vfio_dma {
+ struct list_head next;
+ dma_addr_t iova; /* Device address */
+ unsigned long vaddr; /* Process virtual addr */
+ long npage; /* Number of pages */
+ int prot; /* IOMMU_READ/WRITE */
+};
+
+struct vfio_group {
+ struct iommu_group *iommu_group;
+ struct list_head next;
+};
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage) ((size_t)(npage) << PAGE_SHIFT)
+
+struct vwork {
+ struct mm_struct *mm;
+ long npage;
+ struct work_struct work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void vfio_lock_acct_bg(struct work_struct *work)
+{
+ struct vwork *vwork = container_of(work, struct vwork, work);
+ struct mm_struct *mm;
+
+ mm = vwork->mm;
+ down_write(&mm->mmap_sem);
+ mm->locked_vm += vwork->npage;
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ kfree(vwork);
+}
+
+static void vfio_lock_acct(long npage)
+{
+ struct vwork *vwork;
+ struct mm_struct *mm;
+
+ if (!current->mm)
+ return; /* process exited */
+
+ if (down_write_trylock(&current->mm->mmap_sem)) {
+ current->mm->locked_vm += npage;
+ up_write(&current->mm->mmap_sem);
+ return;
+ }
+
+ /*
+ * Couldn't get mmap_sem lock, so must setup to update
+ * mm->locked_vm later. If locked_vm were atomic, we
+ * wouldn't need this silliness
+ */
+ vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+ if (!vwork)
+ return;
+ mm = get_task_mm(current);
+ if (!mm) {
+ kfree(vwork);
+ return;
+ }
+ INIT_WORK(&vwork->work, vfio_lock_acct_bg);
+ vwork->mm = mm;
+ vwork->npage = npage;
+ schedule_work(&vwork->work);
+}
+
+/*
+ * Some mappings aren't backed by a struct page, for example an mmap'd
+ * MMIO range for our own or another device. These use a different
+ * pfn conversion and shouldn't be tracked as locked pages.
+ */
+static bool is_invalid_reserved_pfn(unsigned long pfn)
+{
+ if (pfn_valid(pfn)) {
+ bool reserved;
+ struct page *tail = pfn_to_page(pfn);
+ struct page *head = compound_trans_head(tail);
+ reserved = !!(PageReserved(head));
+ if (head != tail) {
+ /*
+ * "head" is not a dangling pointer
+ * (compound_trans_head takes care of that)
+ * but the hugepage may have been split
+ * from under us (and we may not hold a
+ * reference count on the head page so it can
+ * be reused before we run PageReferenced), so
+ * we've to check PageTail before returning
+ * what we just read.
+ */
+ smp_rmb();
+ if (PageTail(tail))
+ return reserved;
+ }
+ return PageReserved(tail);
+ }
+
+ return true;
+}
+
+static int put_pfn(unsigned long pfn, int prot)
+{
+ if (!is_invalid_reserved_pfn(pfn)) {
+ struct page *page = pfn_to_page(pfn);
+ if (prot & IOMMU_WRITE)
+ SetPageDirty(page);
+ put_page(page);
+ return 1;
+ }
+ return 0;
+}
+
+/* Unmap DMA region */
+static long __vfio_dma_do_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+ long npage, int prot)
+{
+ long i, unlocked = 0;
+
+ for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
+ unsigned long pfn;
+
+ pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
+ if (pfn) {
+ iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+ unlocked += put_pfn(pfn, prot);
+ }
+ }
+ return unlocked;
+}
+
+static void vfio_dma_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+ long npage, int prot)
+{
+ long unlocked;
+
+ unlocked = __vfio_dma_do_unmap(iommu, iova, npage, prot);
+ vfio_lock_acct(-unlocked);
+}
+
+static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+{
+ struct page *page[1];
+ struct vm_area_struct *vma;
+ int ret = -EFAULT;
+
+ if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+ *pfn = page_to_pfn(page[0]);
+ return 0;
+ }
+
+ down_read(&current->mm->mmap_sem);
+
+ vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+
+ if (vma && vma->vm_flags & VM_PFNMAP) {
+ *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ if (is_invalid_reserved_pfn(*pfn))
+ ret = 0;
+ }
+
+ up_read(&current->mm->mmap_sem);
+
+ return ret;
+}
+
+/* Map DMA region */
+static int __vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
+ unsigned long vaddr, long npage, int prot)
+{
+ dma_addr_t start = iova;
+ long i, locked = 0;
+ int ret;
+
+ /* Verify that pages are not already mapped */
+ for (i = 0; i < npage; i++, iova += PAGE_SIZE)
+ if (iommu_iova_to_phys(iommu->domain, iova))
+ return -EBUSY;
+
+ iova = start;
+
+ if (iommu->cache)
+ prot |= IOMMU_CACHE;
+
+ /*
+ * XXX We break mappings into pages and use get_user_pages_fast to
+ * pin the pages in memory. It's been suggested that mlock might
+ * provide a more efficient mechanism, but nothing prevents the
+ * user from munlocking the pages, which could then allow the user
+ * access to random host memory. We also have no guarantee from the
+ * IOMMU API that the iommu driver can unmap sub-pages of previous
+ * mappings. This means we might lose an entire range if a single
+ * page within it is unmapped. Single page mappings are inefficient,
+ * but provide the most flexibility for now.
+ */
+ for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
+ unsigned long pfn = 0;
+
+ ret = vaddr_get_pfn(vaddr, prot, &pfn);
+ if (ret) {
+ __vfio_dma_do_unmap(iommu, start, i, prot);
+ return ret;
+ }
+
+ /*
+ * Only add actual locked pages to accounting
+ * XXX We're effectively marking a page locked for every
+ * IOVA page even though it's possible the user could be
+ * backing multiple IOVAs with the same vaddr. This over-
+ * penalizes the user process, but we currently have no
+ * easy way to do this properly.
+ */
+ if (!is_invalid_reserved_pfn(pfn))
+ locked++;
+
+ ret = iommu_map(iommu->domain, iova,
+ (phys_addr_t)pfn << PAGE_SHIFT,
+ PAGE_SIZE, prot);
+ if (ret) {
+ /* Back out mappings on error */
+ put_pfn(pfn, prot);
+ __vfio_dma_do_unmap(iommu, start, i, prot);
+ return ret;
+ }
+ }
+ vfio_lock_acct(locked);
+ return 0;
+}
+
+static inline bool ranges_overlap(dma_addr_t start1, size_t size1,
+ dma_addr_t start2, size_t size2)
+{
+ if (start1 < start2)
+ return (start2 - start1 < size1);
+ else if (start2 < start1)
+ return (start1 - start2 < size2);
+ return (size1 > 0 && size2 > 0);
+}
+
+static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
+ dma_addr_t start, size_t size)
+{
+ struct vfio_dma *dma;
+
+ list_for_each_entry(dma, &iommu->dma_list, next) {
+ if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+ start, size))
+ return dma;
+ }
+ return NULL;
+}
+
+static long vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
+ size_t size, struct vfio_dma *dma)
+{
+ struct vfio_dma *split;
+ long npage_lo, npage_hi;
+
+ /* Existing dma region is completely covered, unmap all */
+ if (start <= dma->iova &&
+ start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+ vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+ list_del(&dma->next);
+ npage_lo = dma->npage;
+ kfree(dma);
+ return npage_lo;
+ }
+
+ /* Overlap low address of existing range */
+ if (start <= dma->iova) {
+ size_t overlap;
+
+ overlap = start + size - dma->iova;
+ npage_lo = overlap >> PAGE_SHIFT;
+
+ vfio_dma_unmap(iommu, dma->iova, npage_lo, dma->prot);
+ dma->iova += overlap;
+ dma->vaddr += overlap;
+ dma->npage -= npage_lo;
+ return npage_lo;
+ }
+
+ /* Overlap high address of existing range */
+ if (start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+ size_t overlap;
+
+ overlap = dma->iova + NPAGE_TO_SIZE(dma->npage) - start;
+ npage_hi = overlap >> PAGE_SHIFT;
+
+ vfio_dma_unmap(iommu, start, npage_hi, dma->prot);
+ dma->npage -= npage_hi;
+ return npage_hi;
+ }
+
+ /* Split existing */
+ npage_lo = (start - dma->iova) >> PAGE_SHIFT;
+ npage_hi = dma->npage - (size >> PAGE_SHIFT) - npage_lo;
+
+ split = kzalloc(sizeof *split, GFP_KERNEL);
+ if (!split)
+ return -ENOMEM;
+
+ vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, dma->prot);
+
+ dma->npage = npage_lo;
+
+ split->npage = npage_hi;
+ split->iova = start + size;
+ split->vaddr = dma->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
+ split->prot = dma->prot;
+ list_add(&split->next, &iommu->dma_list);
+ return size >> PAGE_SHIFT;
+}
+
+static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
+ struct vfio_iommu_x86_dma_unmap *unmap)
+{
+ long ret = 0, npage = unmap->size >> PAGE_SHIFT;
+ struct vfio_dma *dma, *tmp;
+ uint64_t mask;
+
+ mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+ if (unmap->iova & mask)
+ return -EINVAL;
+ if (unmap->size & mask)
+ return -EINVAL;
+
+ /* XXX We still break these down into PAGE_SIZE */
+ WARN_ON(mask & PAGE_MASK);
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry_safe(dma, tmp, &iommu->dma_list, next) {
+ if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+ unmap->iova, unmap->size)) {
+ ret = vfio_remove_dma_overlap(iommu, unmap->iova,
+ unmap->size, dma);
+ if (ret > 0)
+ npage -= ret;
+ if (ret < 0 || npage == 0)
+ break;
+ }
+ }
+ mutex_unlock(&iommu->lock);
+ return ret > 0 ? 0 : (int)ret;
+}
+
+static int vfio_dma_do_map(struct vfio_iommu *iommu,
+ struct vfio_iommu_x86_dma_map *map)
+{
+ struct vfio_dma *dma, *pdma = NULL;
+ dma_addr_t iova = map->iova;
+ unsigned long locked, lock_limit, vaddr = map->vaddr;
+ size_t size = map->size;
+ int ret = 0, prot = 0;
+ uint64_t mask;
+ long npage;
+
+ mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+ /* READ/WRITE from device perspective */
+ if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
+ prot |= IOMMU_WRITE;
+ if (map->flags & VFIO_DMA_MAP_FLAG_READ)
+ prot |= IOMMU_READ;
+
+ if (!prot)
+ return -EINVAL; /* No READ/WRITE? */
+
+ if (vaddr & mask)
+ return -EINVAL;
+ if (iova & mask)
+ return -EINVAL;
+ if (size & mask)
+ return -EINVAL;
+
+ /* XXX We still break these down into PAGE_SIZE */
+ WARN_ON(mask & PAGE_MASK);
+
+ /* Don't allow IOVA wrap */
+ if (iova + size && iova + size < iova)
+ return -EINVAL;
+
+ /* Don't allow virtual address wrap */
+ if (vaddr + size && vaddr + size < vaddr)
+ return -EINVAL;
+
+ npage = size >> PAGE_SHIFT;
+ if (!npage)
+ return -EINVAL;
+
+ mutex_lock(&iommu->lock);
+
+ if (vfio_find_dma(iommu, iova, size)) {
+ ret = -EBUSY;
+ goto out_lock;
+ }
+
+ /* account for locked pages */
+ locked = current->mm->locked_vm + npage;
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+ printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+ __func__, rlimit(RLIMIT_MEMLOCK));
+ ret = -ENOMEM;
+ goto out_lock;
+ }
+
+ ret = __vfio_dma_map(iommu, iova, vaddr, npage, prot);
+ if (ret)
+ goto out_lock;
+
+ /* Check if we abut a region below - nothing below 0 */
+ if (iova) {
+ dma = vfio_find_dma(iommu, iova - 1, 1);
+ if (dma && dma->prot == prot &&
+ dma->vaddr + NPAGE_TO_SIZE(dma->npage) == vaddr) {
+
+ dma->npage += npage;
+ iova = dma->iova;
+ vaddr = dma->vaddr;
+ npage = dma->npage;
+ size = NPAGE_TO_SIZE(npage);
+
+ pdma = dma;
+ }
+ }
+
+ /* Check if we abut a region above - nothing above ~0 + 1 */
+ if (iova + size) {
+ dma = vfio_find_dma(iommu, iova + size, 1);
+ if (dma && dma->prot == prot &&
+ dma->vaddr == vaddr + size) {
+
+ dma->npage += npage;
+ dma->iova = iova;
+ dma->vaddr = vaddr;
+
+ /*
+ * If merged above and below, remove previously
+ * merged entry. New entry covers it.
+ */
+ if (pdma) {
+ list_del(&pdma->next);
+ kfree(pdma);
+ }
+ pdma = dma;
+ }
+ }
+
+ /* Isolated, new region */
+ if (!pdma) {
+ dma = kzalloc(sizeof *dma, GFP_KERNEL);
+ if (!dma) {
+ ret = -ENOMEM;
+ vfio_dma_unmap(iommu, iova, npage, prot);
+ goto out_lock;
+ }
+
+ dma->npage = npage;
+ dma->iova = iova;
+ dma->vaddr = vaddr;
+ dma->prot = prot;
+ list_add(&dma->next, &iommu->dma_list);
+ }
+
+out_lock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static int vfio_iommu_x86_attach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_group *group, *tmp;
+ int ret;
+
+ group = kzalloc(sizeof(*group), GFP_KERNEL);
+ if (!group)
+ return -ENOMEM;
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry(tmp, &iommu->group_list, next) {
+ if (tmp->iommu_group == iommu_group) {
+ mutex_unlock(&iommu->lock);
+ kfree(group);
+ return -EINVAL;
+ }
+ }
+
+ /*
+ * TODO: Domain have capabilities that might change as we add
+ * groups (see iommu->cache, currently never set). Check for
+ * them and potentially disallow groups to be attached when it
+ * would change capabilities (ugh).
+ */
+ ret = iommu_attach_group(iommu->domain, iommu_group);
+ if (ret) {
+ mutex_unlock(&iommu->lock);
+ kfree(group);
+ return ret;
+ }
+
+ group->iommu_group = iommu_group;
+ list_add(&group->next, &iommu->group_list);
+
+ mutex_unlock(&iommu->lock);
+
+ return 0;
+}
+
+static void vfio_iommu_x86_detach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_group *group;
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry(group, &iommu->group_list, next) {
+ if (group->iommu_group == iommu_group) {
+ iommu_detach_group(iommu->domain, iommu_group);
+ list_del(&group->next);
+ kfree(group);
+ break;
+ }
+ }
+
+ mutex_unlock(&iommu->lock);
+}
+
+static void *vfio_iommu_x86_open(unsigned long arg)
+{
+ struct vfio_iommu *iommu;
+
+ if (arg != VFIO_X86_IOMMU)
+ return ERR_PTR(-EINVAL);
+
+ iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+ if (!iommu)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&iommu->group_list);
+ INIT_LIST_HEAD(&iommu->dma_list);
+ mutex_init(&iommu->lock);
+
+ /*
+ * Wish we didn't have to know about bus_type here.
+ */
+ iommu->domain = iommu_domain_alloc(&pci_bus_type);
+ if (!iommu->domain) {
+ kfree(iommu);
+ return ERR_PTR(-EIO);
+ }
+
+ /*
+ * Wish we could specify required capabilities rather than create
+ * a domain, see what comes out and hope it doesn't change along
+ * the way. Fortunately we know interrupt remapping is global for
+ * our iommus.
+ */
+ if (!allow_unsafe_interrupts &&
+ !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
+ printk(KERN_WARNING
+ "%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+ __func__);
+ iommu_domain_free(iommu->domain);
+ kfree(iommu);
+ return ERR_PTR(-EPERM);
+ }
+
+ return iommu;
+}
+
+static void vfio_iommu_x86_release(void *iommu_data)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_group *group, *group_tmp;
+ struct vfio_dma *dma, *dma_tmp;
+
+ list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
+ iommu_detach_group(iommu->domain, group->iommu_group);
+ list_del(&group->next);
+ kfree(group);
+ }
+
+ list_for_each_entry_safe(dma, dma_tmp, &iommu->dma_list, next) {
+ vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+ list_del(&dma->next);
+ kfree(dma);
+ }
+
+ iommu_domain_free(iommu->domain);
+ iommu->domain = NULL;
+ kfree(iommu);
+}
+
+static long vfio_iommu_x86_ioctl(void *iommu_data,
+ unsigned int cmd, unsigned long arg)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ unsigned long minsz;
+
+ if (cmd == VFIO_CHECK_EXTENSION) {
+ switch (arg) {
+ case VFIO_X86_IOMMU:
+ return 1;
+ default:
+ return 0;
+ }
+ } else if (cmd == VFIO_IOMMU_GET_INFO) {
+ struct vfio_iommu_x86_info info;
+
+ minsz = offsetofend(struct vfio_iommu_x86_info, iova_pgsizes);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ info.flags = 0;
+
+ info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
+
+ return copy_to_user((void __user *)arg, &info, minsz);
+
+ } else if (cmd == VFIO_IOMMU_MAP_DMA) {
+ struct vfio_iommu_x86_dma_map map;
+ uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
+ VFIO_DMA_MAP_FLAG_WRITE;
+
+ minsz = offsetofend(struct vfio_iommu_x86_dma_map, size);
+
+ if (copy_from_user(&map, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (map.argsz < minsz || map.flags & ~mask)
+ return -EINVAL;
+
+ return vfio_dma_do_map(iommu, &map);
+
+ } else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
+ struct vfio_iommu_x86_dma_unmap unmap;
+
+ minsz = offsetofend(struct vfio_iommu_x86_dma_unmap, size);
+
+ if (copy_from_user(&unmap, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (unmap.argsz < minsz || unmap.flags)
+ return -EINVAL;
+
+ return vfio_dma_do_unmap(iommu, &unmap);
+ }
+
+ return -ENOTTY;
+}
+
+const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_x86 = {
+ .name = "vfio-iommu-x86",
+ .owner = THIS_MODULE,
+ .open = vfio_iommu_x86_open,
+ .release = vfio_iommu_x86_release,
+ .ioctl = vfio_iommu_x86_ioctl,
+ .attach_group = vfio_iommu_x86_attach_group,
+ .detach_group = vfio_iommu_x86_detach_group,
+};
+
+static int __init vfio_iommu_x86_init(void)
+{
+ if (!iommu_present(&pci_bus_type))
+ return -ENODEV;
+
+ return vfio_register_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+static void __exit vfio_iommu_x86_cleanup(void)
+{
+ vfio_unregister_iommu_driver(&vfio_iommu_driver_ops_x86);
+}
+
+module_init(vfio_iommu_x86_init);
+module_exit(vfio_iommu_x86_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 76a8f97..b3e4583 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -363,4 +363,56 @@ struct vfio_irq_set {
*/
#define VFIO_DEVICE_RESET _IO(VFIO_TYPE, VFIO_BASE + 11)

+/* -------- API for x86 VFIO IOMMU -------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object. Fills in provided
+ * struct vfio_iommu_info. Caller sets argsz.
+ *
+ * XXX Should we do these by CHECK_EXTENSION too?
+ */
+struct vfio_iommu_x86_info {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
+ __u64 iova_pgsizes; /* Bitmap of supported page sizes */
+};
+
+#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ */
+struct vfio_iommu_x86_dma_map {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */
+ __u64 vaddr; /* Process virtual address */
+ __u64 iova; /* IO virtual address */
+ __u64 size; /* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 14, struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.
+ */
+struct vfio_iommu_x86_dma_unmap {
+ __u32 argsz;
+ __u32 flags;
+ __u64 iova; /* IO virtual address */
+ __u64 size; /* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
#endif /* VFIO_H */

2012-05-22 05:08:51

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 06/13] iommu: Make use of DMA quirking and ACS enabled check for groups

Incorporate DMA quirking and ACS checking into amd_iommu and
intel-iommu. Note that IOMMU groups are not yet used for
streaming DMA, so this doesn't immediately solve the problems
with broken Ricoh devices. This a very strict implementation of
ACS checking, which will often result in multifunction devices
being grouped together. This is actually a good thing as we
generally have no reason to trust isolation between functions,
but I won't be surprised if we later add a boot option to relax
this if a user wants to opt-in to a less secure grouping.

Signed-off-by: Alex Williamson <[email protected]>
---

drivers/iommu/amd_iommu.c | 18 ++++++++++++++++++
drivers/iommu/intel-iommu.c | 18 ++++++++++++++++++
2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index b7e5ddf..be72d6d 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -254,6 +254,8 @@ static bool check_device(struct device *dev)
return true;
}

+#define PCI_ACS_ENABLED (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
+
static int iommu_init_device(struct device *dev)
{
struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
@@ -291,6 +293,22 @@ static int iommu_init_device(struct device *dev)
dma_pdev = pci_get_slot(pdev->bus,
PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));

+ dma_pdev = pci_dma_source(dma_pdev);
+
+ if (dma_pdev->multifunction &&
+ !pci_acs_enabled(dma_pdev, PCI_ACS_ENABLED))
+ dma_pdev = pci_get_slot(dma_pdev->bus,
+ PCI_DEVFN(PCI_SLOT(dma_pdev->devfn),
+ 0));
+
+ while (!pci_is_root_bus(dma_pdev->bus)) {
+ if (pci_acs_path_enabled(dma_pdev->bus->self,
+ NULL, PCI_ACS_ENABLED))
+ break;
+
+ dma_pdev = dma_pdev->bus->self;
+ }
+
group = iommu_group_get(&dma_pdev->dev);
if (!group) {
group = iommu_group_alloc();
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e63b33b..cf2a650 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,6 +4087,8 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
return 0;
}

+#define PCI_ACS_ENABLED (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
+
static int intel_iommu_add_device(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
@@ -4113,6 +4115,22 @@ static int intel_iommu_add_device(struct device *dev)
dma_pdev = pci_get_slot(pdev->bus,
PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));

+ dma_pdev = pci_dma_source(dma_pdev);
+
+ if (dma_pdev->multifunction &&
+ !pci_acs_enabled(dma_pdev, PCI_ACS_ENABLED))
+ dma_pdev = pci_get_slot(dma_pdev->bus,
+ PCI_DEVFN(PCI_SLOT(dma_pdev->devfn),
+ 0));
+
+ while (!pci_is_root_bus(dma_pdev->bus)) {
+ if (pci_acs_path_enabled(dma_pdev->bus->self,
+ NULL, PCI_ACS_ENABLED))
+ break;
+
+ dma_pdev = dma_pdev->bus->self;
+ }
+
group = iommu_group_get(&dma_pdev->dev);
if (!group) {
group = iommu_group_alloc();

2012-05-22 05:05:26

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 02/13] iommu: IOMMU Groups

IOMMU device groups are currently a rather vague associative notion
with assembly required by the user or user level driver provider to
do anything useful. This patch intends to grow the IOMMU group concept
into something a bit more consumable.

To do this, we first create an object representing the group, struct
iommu_group. This structure is allocated (iommu_group_alloc) and
filled (iommu_group_add_device) by the iommu driver. The iommu driver
is free to add devices to the group using it's own set of policies.
This allows inclusion of devices based on physical hardware or topology
limitations of the platform, as well as soft requirements, such as
multi-function trust levels or peer-to-peer protection of the
interconnects. Each device may only belong to a single iommu group,
which is linked from struct device.iommu_group. IOMMU groups are
maintained using kobject reference counting, allowing for automatic
removal of empty, unreferenced groups. It is the responsibility of
the iommu driver to remove devices from the group
(iommu_group_remove_device).

IOMMU groups also include a userspace representation in sysfs under
/sys/kernel/iommu_groups. When allocated, each group is given a
dynamically assign ID (int). The ID is managed by the core IOMMU group
code to support multiple heterogeneous iommu drivers, which could
potentially collide in group naming/numbering. This also keeps group
IDs to small, easily managed values. A directory is created under
/sys/kernel/iommu_groups for each group. A further subdirectory named
"devices" contains links to each device within the group. The iommu_group
file in the device's sysfs directory, which formerly contained a group
number when read, is now a link to the iommu group. Example:

$ ls -l /sys/kernel/iommu_groups/26/devices/
total 0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
../../../../devices/pci0000:00/0000:00:1e.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1

$ ls -l /sys/kernel/iommu_groups/26/devices/*/iommu_group
[truncating perms/owner/timestamp]
/sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
../../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
../../../../kernel/iommu_groups/26

Groups also include several exported functions for use by user level
driver providers, for example VFIO. These include:

iommu_group_get(): Acquires a reference to a group from a device
iommu_group_put(): Releases reference
iommu_group_for_each_dev(): Iterates over group devices using callback
iommu_group_[un]register_notifier(): Allows notification of device add
and remove operations relevant to the group
iommu_group_id(): Return the group number

This patch also extends the IOMMU API to allow attaching groups to
domains. This is currently a simple wrapper for iterating through
devices within a group, but it's expected that the IOMMU API may
eventually make groups a more integral part of domains.

Groups intentionally do not try to manage group ownership. A user
level driver provider must independently acquire ownership for each
device within a group before making use of the group as a whole.
This may change in the future if group usage becomes more pervasive
across both DMA and IOMMU ops.

Groups intentionally do not provide a mechanism for driver locking
or otherwise manipulating driver matching/probing of devices within
the group. Such interfaces are generic to devices and beyond the
scope of IOMMU groups. If implemented, user level providers have
ready access via iommu_group_for_each_dev and group notifiers.

iommu_device_group() is removed here as it has no users. The
replacement is:

group = iommu_group_get(dev);
id = iommu_group_id(group);
iommu_group_put(group);

AMD-Vi & Intel VT-d support re-added in following patches.

Signed-off-by: Alex Williamson <[email protected]>
---

.../ABI/testing/sysfs-kernel-iommu_groups | 14
drivers/iommu/amd_iommu.c | 21 -
drivers/iommu/intel-iommu.c | 49 --
drivers/iommu/iommu.c | 578 +++++++++++++++++++-
include/linux/iommu.h | 104 +++-
5 files changed, 663 insertions(+), 103 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-iommu_groups

diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
new file mode 100644
index 0000000..9b31556
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
@@ -0,0 +1,14 @@
+What: /sys/kernel/iommu_groups/
+Date: May 2012
+KernelVersion: v3.5
+Contact: Alex Williamson <[email protected]>
+Description: /sys/kernel/iommu_groups/ contains a number of sub-
+ directories, each representing an IOMMU group. The
+ name of the sub-directory matches the iommu_group_id()
+ for the group, which is an integer value. Within each
+ subdirectory is another directory named "devices" with
+ links to the sysfs devices contained in this group.
+ The group directory also optionally contains a "name"
+ file if the IOMMU driver has chosen to register a more
+ common name for the group.
+Users:
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a5bee8e..32c00cd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3193,26 +3193,6 @@ static int amd_iommu_domain_has_cap(struct iommu_domain *domain,
return 0;
}

-static int amd_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
- struct iommu_dev_data *dev_data = dev->archdata.iommu;
- struct pci_dev *pdev = to_pci_dev(dev);
- u16 devid;
-
- if (!dev_data)
- return -ENODEV;
-
- if (pdev->is_virtfn || !iommu_group_mf)
- devid = dev_data->devid;
- else
- devid = calc_devid(pdev->bus->number,
- PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
-
- *groupid = amd_iommu_alias_table[devid];
-
- return 0;
-}
-
static struct iommu_ops amd_iommu_ops = {
.domain_init = amd_iommu_domain_init,
.domain_destroy = amd_iommu_domain_destroy,
@@ -3222,7 +3202,6 @@ static struct iommu_ops amd_iommu_ops = {
.unmap = amd_iommu_unmap,
.iova_to_phys = amd_iommu_iova_to_phys,
.domain_has_cap = amd_iommu_domain_has_cap,
- .device_group = amd_iommu_device_group,
.pgsize_bitmap = AMD_IOMMU_PGSIZES,
};

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f93d5ac..d4a0ff7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,54 +4087,6 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
return 0;
}

-/*
- * Group numbers are arbitrary. Device with the same group number
- * indicate the iommu cannot differentiate between them. To avoid
- * tracking used groups we just use the seg|bus|devfn of the lowest
- * level we're able to differentiate devices
- */
-static int intel_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
- struct pci_dev *pdev = to_pci_dev(dev);
- struct pci_dev *bridge;
- union {
- struct {
- u8 devfn;
- u8 bus;
- u16 segment;
- } pci;
- u32 group;
- } id;
-
- if (iommu_no_mapping(dev))
- return -ENODEV;
-
- id.pci.segment = pci_domain_nr(pdev->bus);
- id.pci.bus = pdev->bus->number;
- id.pci.devfn = pdev->devfn;
-
- if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
- return -ENODEV;
-
- bridge = pci_find_upstream_pcie_bridge(pdev);
- if (bridge) {
- if (pci_is_pcie(bridge)) {
- id.pci.bus = bridge->subordinate->number;
- id.pci.devfn = 0;
- } else {
- id.pci.bus = bridge->bus->number;
- id.pci.devfn = bridge->devfn;
- }
- }
-
- if (!pdev->is_virtfn && iommu_group_mf)
- id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
-
- *groupid = id.group;
-
- return 0;
-}
-
static struct iommu_ops intel_iommu_ops = {
.domain_init = intel_iommu_domain_init,
.domain_destroy = intel_iommu_domain_destroy,
@@ -4144,7 +4096,6 @@ static struct iommu_ops intel_iommu_ops = {
.unmap = intel_iommu_unmap,
.iova_to_phys = intel_iommu_iova_to_phys,
.domain_has_cap = intel_iommu_domain_has_cap,
- .device_group = intel_iommu_device_group,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
};

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2198b2d..9fdda5e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -26,60 +26,535 @@
#include <linux/slab.h>
#include <linux/errno.h>
#include <linux/iommu.h>
+#include <linux/idr.h>
+#include <linux/notifier.h>
+#include <linux/err.h>
+
+static struct kset *iommu_group_kset;
+static struct ida iommu_group_ida;
+static struct mutex iommu_group_mutex;
+
+struct iommu_group {
+ struct kobject kobj;
+ struct kobject *devices_kobj;
+ struct list_head devices;
+ struct mutex mutex;
+ struct blocking_notifier_head notifier;
+ void *iommu_data;
+ void (*iommu_data_release)(void *iommu_data);
+ char *name;
+ int id;
+};
+
+struct iommu_device {
+ struct list_head list;
+ struct device *dev;
+ char *name;
+};
+
+struct iommu_group_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct iommu_group *group, char *buf);
+ ssize_t (*store)(struct iommu_group *group,
+ const char *buf, size_t count);
+};
+
+#define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \
+struct iommu_group_attribute iommu_group_attr_##_name = \
+ __ATTR(_name, _mode, _show, _store)

-static ssize_t show_iommu_group(struct device *dev,
- struct device_attribute *attr, char *buf)
+#define to_iommu_group_attr(_attr) \
+ container_of(_attr, struct iommu_group_attribute, attr)
+#define to_iommu_group(_kobj) \
+ container_of(_kobj, struct iommu_group, kobj)
+
+static ssize_t iommu_group_attr_show(struct kobject *kobj,
+ struct attribute *__attr, char *buf)
{
- unsigned int groupid;
+ struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+ struct iommu_group *group = to_iommu_group(kobj);
+ ssize_t ret = -EIO;

- if (iommu_device_group(dev, &groupid))
- return 0;
+ if (attr->show)
+ ret = attr->show(group, buf);
+ return ret;
+}
+
+static ssize_t iommu_group_attr_store(struct kobject *kobj,
+ struct attribute *__attr,
+ const char *buf, size_t count)
+{
+ struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+ struct iommu_group *group = to_iommu_group(kobj);
+ ssize_t ret = -EIO;

- return sprintf(buf, "%u", groupid);
+ if (attr->store)
+ ret = attr->store(group, buf, count);
+ return ret;
}
-static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);

-static int add_iommu_group(struct device *dev, void *data)
+static const struct sysfs_ops iommu_group_sysfs_ops = {
+ .show = iommu_group_attr_show,
+ .store = iommu_group_attr_store,
+};
+
+static int iommu_group_create_file(struct iommu_group *group,
+ struct iommu_group_attribute *attr)
+{
+ return sysfs_create_file(&group->kobj, &attr->attr);
+}
+
+static void iommu_group_remove_file(struct iommu_group *group,
+ struct iommu_group_attribute *attr)
+{
+ sysfs_remove_file(&group->kobj, &attr->attr);
+}
+
+static ssize_t iommu_group_show_name(struct iommu_group *group, char *buf)
+{
+ return sprintf(buf, "%s\n", group->name);
+}
+
+static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL);
+
+static void iommu_group_release(struct kobject *kobj)
+{
+ struct iommu_group *group = to_iommu_group(kobj);
+
+ if (group->iommu_data_release)
+ group->iommu_data_release(group->iommu_data);
+
+ mutex_lock(&iommu_group_mutex);
+ ida_remove(&iommu_group_ida, group->id);
+ mutex_unlock(&iommu_group_mutex);
+
+ kfree(group->name);
+ kfree(group);
+}
+
+static struct kobj_type iommu_group_ktype = {
+ .sysfs_ops = &iommu_group_sysfs_ops,
+ .release = iommu_group_release,
+};
+
+/**
+ * iommu_group_alloc - Allocate a new group
+ * @name: Optional name to associate with group, visible in sysfs
+ *
+ * This function is called by an iommu driver to allocate a new iommu
+ * group. The iommu group represents the minimum granularity of the iommu.
+ * Upon successful return, the caller holds a reference to the supplied
+ * group in order to hold the group until devices are added. Use
+ * iommu_group_put() to release this extra reference count, allowing the
+ * group to be automatically reclaimed once it has no devices or external
+ * references.
+ */
+struct iommu_group *iommu_group_alloc(void)
{
- unsigned int groupid;
+ struct iommu_group *group;
+ int ret;
+
+ group = kzalloc(sizeof(*group), GFP_KERNEL);
+ if (!group)
+ return ERR_PTR(-ENOMEM);
+
+ group->kobj.kset = iommu_group_kset;
+ mutex_init(&group->mutex);
+ INIT_LIST_HEAD(&group->devices);
+ BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
+
+ mutex_lock(&iommu_group_mutex);
+
+again:
+ if (unlikely(0 == ida_pre_get(&iommu_group_ida, GFP_KERNEL))) {
+ kfree(group);
+ mutex_unlock(&iommu_group_mutex);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ if (-EAGAIN == ida_get_new(&iommu_group_ida, &group->id))
+ goto again;
+
+ mutex_unlock(&iommu_group_mutex);

- if (iommu_device_group(dev, &groupid) == 0)
- return device_create_file(dev, &dev_attr_iommu_group);
+ ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
+ NULL, "%d", group->id);
+ if (ret) {
+ mutex_lock(&iommu_group_mutex);
+ ida_remove(&iommu_group_ida, group->id);
+ mutex_unlock(&iommu_group_mutex);
+ kfree(group);
+ return ERR_PTR(ret);
+ }
+
+ group->devices_kobj = kobject_create_and_add("devices", &group->kobj);
+ if (!group->devices_kobj) {
+ kobject_put(&group->kobj); /* triggers .release & free */
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * The devices_kobj holds a reference on the group kobject, so
+ * as long as that exists so will the group. We can therefore
+ * use the devices_kobj for reference counting.
+ */
+ kobject_put(&group->kobj);
+
+ return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_alloc);
+
+/**
+ * iommu_group_get_iommudata - retrieve iommu_data registered for a group
+ * @group: the group
+ *
+ * iommu drivers can store data in the group for use when doing iommu
+ * operations. This function provides a way to retrieve it. Caller
+ * should hold a group reference.
+ */
+void *iommu_group_get_iommudata(struct iommu_group *group)
+{
+ return group->iommu_data;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get_iommudata);
+
+/**
+ * iommu_group_set_iommudata - set iommu_data for a group
+ * @group: the group
+ * @iommu_data: new data
+ * @release: release function for iommu_data
+ *
+ * iommu drivers can store data in the group for use when doing iommu
+ * operations. This function provides a way to set the data after
+ * the group has been allocated. Caller should hold a group reference.
+ */
+void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
+ void (*release)(void *iommu_data))
+{
+ group->iommu_data = iommu_data;
+ group->iommu_data_release = release;
+}
+EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
+
+/**
+ * iommu_group_set_name - set name for a group
+ * @group: the group
+ * @name: name
+ *
+ * Allow iommu driver to set a name for a group. When set it will
+ * appear in a name attribute file under the group in sysfs.
+ */
+int iommu_group_set_name(struct iommu_group *group, const char *name)
+{
+ int ret;
+
+ if (group->name) {
+ iommu_group_remove_file(group, &iommu_group_attr_name);
+ kfree(group->name);
+ group->name = NULL;
+ if (!name)
+ return 0;
+ }
+
+ group->name = kstrdup(name, GFP_KERNEL);
+ if (!group->name)
+ return -ENOMEM;
+
+ ret = iommu_group_create_file(group, &iommu_group_attr_name);
+ if (ret) {
+ kfree(group->name);
+ group->name = NULL;
+ return ret;
+ }

return 0;
}
+EXPORT_SYMBOL_GPL(iommu_group_set_name);

-static int remove_iommu_group(struct device *dev)
+/**
+ * iommu_group_add_device - add a device to an iommu group
+ * @group: the group into which to add the device (reference should be held)
+ * @dev: the device
+ *
+ * This function is called by an iommu driver to add a device into a
+ * group. Adding a device increments the group reference count.
+ */
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
{
- unsigned int groupid;
+ int ret, i = 0;
+ struct iommu_device *device;
+
+ device = kzalloc(sizeof(*device), GFP_KERNEL);
+ if (!device)
+ return -ENOMEM;
+
+ device->dev = dev;

- if (iommu_device_group(dev, &groupid) == 0)
- device_remove_file(dev, &dev_attr_iommu_group);
+ ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
+ if (ret) {
+ kfree(device);
+ return ret;
+ }
+
+ device->name = kasprintf(GFP_KERNEL, "%s", kobject_name(&dev->kobj));
+rename:
+ if (!device->name) {
+ sysfs_remove_link(&dev->kobj, "iommu_group");
+ kfree(device);
+ return -ENOMEM;
+ }

+ ret = sysfs_create_link_nowarn(group->devices_kobj,
+ &dev->kobj, device->name);
+ if (ret) {
+ kfree(device->name);
+ if (ret == -EEXIST && i >= 0) {
+ /*
+ * Account for the slim chance of collision
+ * and append an instance to the name.
+ */
+ device->name = kasprintf(GFP_KERNEL, "%s.%d",
+ kobject_name(&dev->kobj), i++);
+ goto rename;
+ }
+
+ sysfs_remove_link(&dev->kobj, "iommu_group");
+ kfree(device);
+ return ret;
+ }
+
+ kobject_get(group->devices_kobj);
+
+ dev->iommu_group = group;
+
+ mutex_lock(&group->mutex);
+ list_add_tail(&device->list, &group->devices);
+ mutex_unlock(&group->mutex);
+
+ /* Notify any listeners about change to group. */
+ blocking_notifier_call_chain(&group->notifier,
+ IOMMU_GROUP_NOTIFY_ADD_DEVICE, dev);
return 0;
}
+EXPORT_SYMBOL_GPL(iommu_group_add_device);

-static int iommu_device_notifier(struct notifier_block *nb,
- unsigned long action, void *data)
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group. This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+ struct iommu_device *tmp_device, *device = NULL;
+
+ /* Pre-notify listeners that a device is being removed. */
+ blocking_notifier_call_chain(&group->notifier,
+ IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
+
+ mutex_lock(&group->mutex);
+ list_for_each_entry(tmp_device, &group->devices, list) {
+ if (tmp_device->dev == dev) {
+ device = tmp_device;
+ list_del(&device->list);
+ break;
+ }
+ }
+ mutex_unlock(&group->mutex);
+
+ if (!device)
+ return;
+
+ sysfs_remove_link(group->devices_kobj, device->name);
+ sysfs_remove_link(&dev->kobj, "iommu_group");
+
+ kfree(device->name);
+ kfree(device);
+ dev->iommu_group = NULL;
+ kobject_put(group->devices_kobj);
+}
+EXPORT_SYMBOL_GPL(iommu_group_remove_device);
+
+/**
+ * iommu_group_for_each_dev - iterate over each device in the group
+ * @group: the group
+ * @data: caller opaque data to be passed to callback function
+ * @fn: caller supplied callback function
+ *
+ * This function is called by group users to iterate over group devices.
+ * Callers should hold a reference count to the group during callback.
+ * The group->mutex is held across callbacks, which will block calls to
+ * iommu_group_add/remove_device.
+ */
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+ int (*fn)(struct device *, void *))
+{
+ struct iommu_device *device;
+ int ret = 0;
+
+ mutex_lock(&group->mutex);
+ list_for_each_entry(device, &group->devices, list) {
+ ret = fn(device->dev, data);
+ if (ret)
+ break;
+ }
+ mutex_unlock(&group->mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_group_for_each_dev);
+
+/**
+ * iommu_group_get - Return the group for a device and increment reference
+ * @dev: get the group that this device belongs to
+ *
+ * This function is called by iommu drivers and users to get the group
+ * for the specified device. If found, the group is returned and the group
+ * reference in incremented, else NULL.
+ */
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+
+ if (group)
+ kobject_get(group->devices_kobj);
+
+ return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get);
+
+/**
+ * iommu_group_put - Decrement group reference
+ * @group: the group to use
+ *
+ * This function is called by iommu drivers and users to release the
+ * iommu group. Once the reference count is zero, the group is released.
+ */
+void iommu_group_put(struct iommu_group *group)
+{
+ if (group)
+ kobject_put(group->devices_kobj);
+}
+EXPORT_SYMBOL_GPL(iommu_group_put);
+
+/**
+ * iommu_group_register_notifier - Register a notifier for group changes
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * This function allows iommu group users to track changes in a group.
+ * See include/linux/iommu.h for actions sent via this notifier. Caller
+ * should hold a reference to the group throughout notifier registration.
+ */
+int iommu_group_register_notifier(struct iommu_group *group,
+ struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_register_notifier);
+
+/**
+ * iommu_group_unregister_notifier - Unregister a notifier
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * Unregister a previously registered group notifier block.
+ */
+int iommu_group_unregister_notifier(struct iommu_group *group,
+ struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
+
+/**
+ * iommu_group_id - Return ID for a group
+ * @group: the group to ID
+ *
+ * Return the unique ID for the group matching the sysfs group number.
+ */
+int iommu_group_id(struct iommu_group *group)
+{
+ return group->id;
+}
+EXPORT_SYMBOL_GPL(iommu_group_id);
+
+static int add_iommu_group(struct device *dev, void *data)
+{
+ struct iommu_ops *ops = data;
+
+ if (!ops->add_device)
+ return -ENODEV;
+
+ WARN_ON(dev->iommu_group);
+
+ ops->add_device(dev);
+
+ return 0;
+}
+
+static int iommu_bus_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
{
struct device *dev = data;
+ struct iommu_ops *ops = dev->bus->iommu_ops;
+ struct iommu_group *group;
+ unsigned long group_action = 0;
+
+ /*
+ * ADD/DEL call into iommu driver ops if provided, which may
+ * result in ADD/DEL notifiers to group->notifier
+ */
+ if (action == BUS_NOTIFY_ADD_DEVICE) {
+ if (ops->add_device)
+ return ops->add_device(dev);
+ } else if (action == BUS_NOTIFY_DEL_DEVICE) {
+ if (ops->remove_device && dev->iommu_group) {
+ ops->remove_device(dev);
+ return 0;
+ }
+ }

- if (action == BUS_NOTIFY_ADD_DEVICE)
- return add_iommu_group(dev, NULL);
- else if (action == BUS_NOTIFY_DEL_DEVICE)
- return remove_iommu_group(dev);
+ /*
+ * Remaining BUS_NOTIFYs get filtered and republished to the
+ * group, if anyone is listening
+ */
+ group = iommu_group_get(dev);
+ if (!group)
+ return 0;

+ switch (action) {
+ case BUS_NOTIFY_BIND_DRIVER:
+ group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
+ break;
+ case BUS_NOTIFY_BOUND_DRIVER:
+ group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
+ break;
+ case BUS_NOTIFY_UNBIND_DRIVER:
+ group_action = IOMMU_GROUP_NOTIFY_UNBIND_DRIVER;
+ break;
+ case BUS_NOTIFY_UNBOUND_DRIVER:
+ group_action = IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER;
+ break;
+ }
+
+ if (group_action)
+ blocking_notifier_call_chain(&group->notifier,
+ group_action, dev);
+
+ iommu_group_put(group);
return 0;
}

-static struct notifier_block iommu_device_nb = {
- .notifier_call = iommu_device_notifier,
+static struct notifier_block iommu_bus_nb = {
+ .notifier_call = iommu_bus_notifier,
};

static void iommu_bus_init(struct bus_type *bus, struct iommu_ops *ops)
{
- bus_register_notifier(bus, &iommu_device_nb);
- bus_for_each_dev(bus, NULL, NULL, add_iommu_group);
+ bus_register_notifier(bus, &iommu_bus_nb);
+ bus_for_each_dev(bus, NULL, ops, add_iommu_group);
}

/**
@@ -189,6 +664,45 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
}
EXPORT_SYMBOL_GPL(iommu_detach_device);

+/*
+ * IOMMU groups are really the natrual working unit of the IOMMU, but
+ * the IOMMU API works on domains and devices. Bridge that gap by
+ * iterating over the devices in a group. Ideally we'd have a single
+ * device which represents the requestor ID of the group, but we also
+ * allow IOMMU drivers to create policy defined minimum sets, where
+ * the physical hardware may be able to distiguish members, but we
+ * wish to group them at a higher level (ex. untrusted multi-function
+ * PCI devices). Thus we attach each device.
+ */
+static int iommu_group_do_attach_device(struct device *dev, void *data)
+{
+ struct iommu_domain *domain = data;
+
+ return iommu_attach_device(domain, dev);
+}
+
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+ return iommu_group_for_each_dev(group, domain,
+ iommu_group_do_attach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_group);
+
+static int iommu_group_do_detach_device(struct device *dev, void *data)
+{
+ struct iommu_domain *domain = data;
+
+ iommu_detach_device(domain, dev);
+
+ return 0;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+ iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_group);
+
phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
unsigned long iova)
{
@@ -333,11 +847,15 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
}
EXPORT_SYMBOL_GPL(iommu_unmap);

-int iommu_device_group(struct device *dev, unsigned int *groupid)
+static int __init iommu_init(void)
{
- if (iommu_present(dev->bus) && dev->bus->iommu_ops->device_group)
- return dev->bus->iommu_ops->device_group(dev, groupid);
+ iommu_group_kset = kset_create_and_add("iommu_groups",
+ NULL, kernel_kobj);
+ ida_init(&iommu_group_ida);
+ mutex_init(&iommu_group_mutex);

- return -ENODEV;
+ BUG_ON(!iommu_group_kset);
+
+ return 0;
}
-EXPORT_SYMBOL_GPL(iommu_device_group);
+subsys_initcall(iommu_init);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d937580..c18ad33 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -26,6 +26,7 @@
#define IOMMU_CACHE (4) /* DMA cache coherency */

struct iommu_ops;
+struct iommu_group;
struct bus_type;
struct device;
struct iommu_domain;
@@ -59,6 +60,8 @@ struct iommu_domain {
* @iova_to_phys: translate iova to physical address
* @domain_has_cap: domain capabilities query
* @commit: commit iommu domain
+ * @add_device: add device to iommu grouping
+ * @remove_device: remove device from iommu grouping
* @pgsize_bitmap: bitmap of supported page sizes
*/
struct iommu_ops {
@@ -74,10 +77,18 @@ struct iommu_ops {
unsigned long iova);
int (*domain_has_cap)(struct iommu_domain *domain,
unsigned long cap);
- int (*device_group)(struct device *dev, unsigned int *groupid);
+ int (*add_device)(struct device *dev);
+ void (*remove_device)(struct device *dev);
unsigned long pgsize_bitmap;
};

+#define IOMMU_GROUP_NOTIFY_ADD_DEVICE 1 /* Device added */
+#define IOMMU_GROUP_NOTIFY_DEL_DEVICE 2 /* Pre Device removed */
+#define IOMMU_GROUP_NOTIFY_BIND_DRIVER 3 /* Pre Driver bind */
+#define IOMMU_GROUP_NOTIFY_BOUND_DRIVER 4 /* Post Driver bind */
+#define IOMMU_GROUP_NOTIFY_UNBIND_DRIVER 5 /* Pre Driver unbind */
+#define IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER 6 /* Post Driver unbind */
+
extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
extern bool iommu_present(struct bus_type *bus);
extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
@@ -96,7 +107,29 @@ extern int iommu_domain_has_cap(struct iommu_domain *domain,
unsigned long cap);
extern void iommu_set_fault_handler(struct iommu_domain *domain,
iommu_fault_handler_t handler);
-extern int iommu_device_group(struct device *dev, unsigned int *groupid);
+
+extern int iommu_attach_group(struct iommu_domain *domain,
+ struct iommu_group *group);
+extern void iommu_detach_group(struct iommu_domain *domain,
+ struct iommu_group *group);
+extern struct iommu_group *iommu_group_alloc(void);
+extern void *iommu_group_get_iommudata(struct iommu_group *group);
+extern void iommu_group_set_iommudata(struct iommu_group *group,
+ void *iommu_data,
+ void (*release)(void *iommu_data));
+extern int iommu_group_set_name(struct iommu_group *group, const char *name);
+extern int iommu_group_add_device(struct iommu_group *group,
+ struct device *dev);
+extern void iommu_group_remove_device(struct device *dev);
+extern int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+ int (*fn)(struct device *, void *));
+extern struct iommu_group *iommu_group_get(struct device *dev);
+extern void iommu_group_put(struct iommu_group *group);
+extern int iommu_group_register_notifier(struct iommu_group *group,
+ struct notifier_block *nb);
+extern int iommu_group_unregister_notifier(struct iommu_group *group,
+ struct notifier_block *nb);
+extern int iommu_group_id(struct iommu_group *group);

/**
* report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
@@ -140,6 +173,7 @@ static inline int report_iommu_fault(struct iommu_domain *domain,
#else /* CONFIG_IOMMU_API */

struct iommu_ops {};
+struct iommu_group {};

static inline bool iommu_present(struct bus_type *bus)
{
@@ -195,11 +229,75 @@ static inline void iommu_set_fault_handler(struct iommu_domain *domain,
{
}

-static inline int iommu_device_group(struct device *dev, unsigned int *groupid)
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+ return -ENODEV;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+}
+
+struct iommu_group *iommu_group_alloc(void)
+{
+ return ERR_PTR(-ENODEV);
+}
+
+void *iommu_group_get_iommudata(struct iommu_group *group)
+{
+ return NULL;
+}
+
+void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
+ void (*release)(void *iommu_data))
+{
+}
+
+int iommu_group_set_name(struct iommu_group *group, const char *name)
+{
+ return -ENODEV;
+}
+
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+ return -ENODEV;
+}
+
+void iommu_group_remove_device(struct device *dev)
+{
+}
+
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+ int (*fn)(struct device *, void *))
+{
+ return -ENODEV;
+}
+
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+ return NULL;
+}
+
+void iommu_group_put(struct iommu_group *group)
+{
+}
+
+int iommu_group_register_notifier(struct iommu_group *group,
+ struct notifier_block *nb)
{
return -ENODEV;
}

+int iommu_group_unregister_notifier(struct iommu_group *group,
+ struct notifier_block *nb)
+{
+ return 0;
+}
+
+int iommu_group_id(struct iommu_group *group)
+{
+ return -ENODEV;
+}
#endif /* CONFIG_IOMMU_API */

#endif /* __LINUX_IOMMU_H */

2012-05-22 05:09:34

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 05/13] pci: Add ACS validation utility

In a PCI environment, transactions aren't always required to reach
the root bus before being re-routed. Intermediate switches between
an endpoint and the root bus can redirect DMA back downstream before
things like IOMMUs have a chance to intervene. Legacy PCI is always
susceptible to this as it operates on a shared bus. PCIe added a
new capability to describe and control this behavior, Access Control
Services, or ACS. The utility function pci_acs_enabled() allows us
to test the ACS capabilities of an individual devices against a set
of flags while pci_acs_path_enabled() tests a complete path from
a given downstream device up to the specified upstream device. We
also include the ability to add device specific tests as it's
likely we'll see devices that do no implement ACS, but want to
indicate support for various capabilities in this space.

Signed-off-by: Alex Williamson <[email protected]>
---

drivers/pci/pci.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/quirks.c | 29 +++++++++++++++++++
include/linux/pci.h | 10 ++++++-
3 files changed, 114 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..ab6c2a6 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2359,6 +2359,82 @@ void pci_enable_acs(struct pci_dev *dev)
}

/**
+ * pci_acs_enable - test ACS against required flags for a given device
+ * @pdev: device to test
+ * @acs_flags: required PCI ACS flags
+ *
+ * Return true if the device supports the provided flags. Automatically
+ * filters out flags that are not implemented on multifunction devices.
+ */
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
+{
+ int pos;
+ u16 ctrl;
+
+ if (pci_dev_specific_acs_enabled(pdev, acs_flags))
+ return true;
+
+ if (!pci_is_pcie(pdev))
+ return false;
+
+ if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
+ pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+ pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+ if (!pos)
+ return false;
+
+ pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+ if ((ctrl & acs_flags) != acs_flags)
+ return false;
+ } else if (pdev->multifunction) {
+ /* Filter out flags not applicable to multifunction */
+ acs_flags &= (PCI_ACS_RR | PCI_ACS_CR |
+ PCI_ACS_EC | PCI_ACS_DT);
+
+ pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+ if (!pos)
+ return false;
+
+ pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+ if ((ctrl & acs_flags) != acs_flags)
+ return false;
+ }
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_enabled);
+
+/**
+ * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
+ * @start: starting downstream device
+ * @end: ending upstream device or NULL to search to the root bus
+ * @acs_flags: required flags
+ *
+ * Walk up a device tree from start to end testing PCI ACS support. If
+ * any step along the way does not support the required flags, return false.
+ */
+bool pci_acs_path_enabled(struct pci_dev *start,
+ struct pci_dev *end, u16 acs_flags)
+{
+ struct pci_dev *pdev, *parent = start;
+
+ do {
+ pdev = parent;
+
+ if (!pci_acs_enabled(pdev, acs_flags))
+ return false;
+
+ if (pci_is_root_bus(pdev->bus))
+ return (end == NULL);
+
+ parent = pdev->bus->self;
+ } while (pdev != end);
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(pci_acs_path_enabled);
+
+/**
* pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
* @dev: the PCI device
* @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index a2dd77f..4ed6aa6 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3149,3 +3149,32 @@ struct pci_dev *pci_dma_source(struct pci_dev *dev)

return dev;
}
+
+static const struct pci_dev_acs_enabled {
+ u16 vendor;
+ u16 device;
+ bool (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
+} pci_dev_acs_enabled[] = {
+ { 0 }
+};
+
+bool pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags)
+{
+ const struct pci_dev_acs_enabled *i;
+
+ /*
+ * Allow devices that do not expose standard PCI ACS capabilities
+ * or control to indicate their support here. Multi-function devices
+ * which do not allow internal peer-to-peer between functions, but
+ * do not implement PCI ACS may wish to return true here.
+ */
+ for (i = pci_dev_acs_enabled; i->acs_enabled; i++) {
+ if ((i->vendor == dev->vendor ||
+ i->vendor == (u16)PCI_ANY_ID) &&
+ (i->device == dev->device ||
+ i->device == (u16)PCI_ANY_ID))
+ return i->acs_enabled(dev, acs_flags);
+ }
+
+ return false;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 02dbfed..2559735 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1480,6 +1480,7 @@ enum pci_fixup_pass {
#ifdef CONFIG_PCI_QUIRKS
void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev);
struct pci_dev *pci_dma_source(struct pci_dev *dev);
+bool pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags);
#else
static inline void pci_fixup_device(enum pci_fixup_pass pass,
struct pci_dev *dev) {}
@@ -1487,6 +1488,11 @@ static inline struct pci_dev *pci_dma_source(struct pci_dev *dev)
{
return dev;
}
+static inline bool pci_dev_specific_acs_enabled(struct pci_dev *dev,
+ u16 acs_flags)
+{
+ return false;
+}
#endif

void __iomem *pcim_iomap(struct pci_dev *pdev, int bar, unsigned long maxlen);
@@ -1589,7 +1595,9 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
}

void pci_request_acs(void);
-
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
+bool pci_acs_path_enabled(struct pci_dev *start,
+ struct pci_dev *end, u16 acs_flags);

#define PCI_VPD_LRDT 0x80 /* Large Resource Data Type */
#define PCI_VPD_LRDT_ID(x) (x | PCI_VPD_LRDT)

2012-05-22 05:05:00

by Alex Williamson

[permalink] [raw]
Subject: [PATCH v2 01/13] driver core: Add iommu_group tracking to struct device

IOMMU groups allow IOMMU drivers to represent DMA visibility
and isolation of devices. Multiple devices may be grouped
together for the purposes of DMA. Placing a pointer on
struct device enable easy access for things like streaming
DMA programming and drivers like VFIO.

Signed-off-by: Alex Williamson <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
---

include/linux/device.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 5ad17cc..13dd26b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -35,6 +35,7 @@ struct subsys_private;
struct bus_type;
struct device_node;
struct iommu_ops;
+struct iommu_group;

struct bus_attribute {
struct attribute attr;
@@ -677,6 +678,7 @@ struct device {
const struct attribute_group **groups; /* optional groups */

void (*release)(struct device *dev);
+ struct iommu_group *iommu_group;
};

/* Get the wakeup routines, which depend on struct device */

2012-05-24 21:01:43

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v2 03/13] iommu: IOMMU groups for VT-d and AMD-Vi

On 05/22/2012 01:04 AM, Alex Williamson wrote:
> Add back group support for AMD& Intel. amd_iommu already tracks
> devices and has init and uninit routines to manage groups.
> intel-iommu does this on the fly, so we make use of the notifier
> support built into iommu groups to create and remove groups.
>
> Signed-off-by: Alex Williamson<[email protected]>
> ---
>
> drivers/iommu/amd_iommu.c | 28 +++++++++++++++++++++++++-
> drivers/iommu/intel-iommu.c | 46 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 73 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 32c00cd..b7e5ddf 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -256,9 +256,11 @@ static bool check_device(struct device *dev)
>
> static int iommu_init_device(struct device *dev)
> {
> - struct pci_dev *pdev = to_pci_dev(dev);
> + struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
> struct iommu_dev_data *dev_data;
> + struct iommu_group *group;
> u16 alias;
> + int ret;
>
> if (dev->archdata.iommu)
> return 0;
> @@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
> return -ENOTSUPP;
> }
> dev_data->alias_data = alias_data;
> +
> + dma_pdev = pci_get_bus_and_slot(alias>> 8, alias& 0xff);
> + } else
> + dma_pdev = pdev;
> +
> + if (!pdev->is_virtfn&& PCI_FUNC(pdev->devfn)&& iommu_group_mf&&
> + pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
> + dma_pdev = pci_get_slot(pdev->bus,
> + PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
> +
> + group = iommu_group_get(&dma_pdev->dev);
> + if (!group) {
> + group = iommu_group_alloc();
> + if (IS_ERR(group))
> + return PTR_ERR(group);
> }
>
> + ret = iommu_group_add_device(group, dev);
> +
> + iommu_group_put(group);
> +
do you want to do a put if there is a failure in the iommu_group_add_device()?
> + if (ret)
> + return ret;
> +
> if (pci_iommuv2_capable(pdev)) {
> struct amd_iommu *iommu;
>
> @@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)
>
> static void iommu_uninit_device(struct device *dev)
> {
> + iommu_group_remove_device(dev);
> +
> /*
> * Nothing to do here - we keep dev_data around for unplugged devices
> * and reuse it when the device is re-plugged - not doing so would
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index d4a0ff7..e63b33b 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
> return 0;
> }
>
> +static int intel_iommu_add_device(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct pci_dev *bridge, *dma_pdev = pdev;
> + struct iommu_group *group;
> + int ret;
> +
> + if (!device_to_iommu(pci_domain_nr(pdev->bus),
> + pdev->bus->number, pdev->devfn))
> + return -ENODEV;
> +
> + bridge = pci_find_upstream_pcie_bridge(pdev);
> + if (bridge) {
> + if (pci_is_pcie(bridge))
> + dma_pdev = pci_get_domain_bus_and_slot(
> + pci_domain_nr(pdev->bus),
> + bridge->subordinate->number, 0);
> + else
> + dma_pdev = bridge;
> + }
> +
> + if (!pdev->is_virtfn&& PCI_FUNC(pdev->devfn)&& iommu_group_mf&&
> + pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
> + dma_pdev = pci_get_slot(pdev->bus,
> + PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
> +
> + group = iommu_group_get(&dma_pdev->dev);
> + if (!group) {
> + group = iommu_group_alloc();
> + if (IS_ERR(group))
> + return PTR_ERR(group);
> + }
> +
> + ret = iommu_group_add_device(group, dev);
> +
ditto.
> + iommu_group_put(group);
> + return ret;
> +}
> +
> +static void intel_iommu_remove_device(struct device *dev)
> +{
> + iommu_group_remove_device(dev);
> +}
> +
> static struct iommu_ops intel_iommu_ops = {
> .domain_init = intel_iommu_domain_init,
> .domain_destroy = intel_iommu_domain_destroy,
> @@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
> .unmap = intel_iommu_unmap,
> .iova_to_phys = intel_iommu_iova_to_phys,
> .domain_has_cap = intel_iommu_domain_has_cap,
> + .add_device = intel_iommu_add_device,
> + .remove_device = intel_iommu_remove_device,
> .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
> };
>
>

2012-05-24 21:31:14

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v2 05/13] pci: Add ACS validation utility

On 05/22/2012 01:05 AM, Alex Williamson wrote:
> In a PCI environment, transactions aren't always required to reach
> the root bus before being re-routed. Intermediate switches between
> an endpoint and the root bus can redirect DMA back downstream before
> things like IOMMUs have a chance to intervene. Legacy PCI is always
> susceptible to this as it operates on a shared bus. PCIe added a
> new capability to describe and control this behavior, Access Control
> Services, or ACS. The utility function pci_acs_enabled() allows us
> to test the ACS capabilities of an individual devices against a set
> of flags while pci_acs_path_enabled() tests a complete path from
> a given downstream device up to the specified upstream device. We
> also include the ability to add device specific tests as it's
> likely we'll see devices that do no implement ACS, but want to
> indicate support for various capabilities in this space.
>
> Signed-off-by: Alex Williamson<[email protected]>
> ---
>
> drivers/pci/pci.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++
> drivers/pci/quirks.c | 29 +++++++++++++++++++
> include/linux/pci.h | 10 ++++++-
> 3 files changed, 114 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 111569c..ab6c2a6 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2359,6 +2359,82 @@ void pci_enable_acs(struct pci_dev *dev)
> }
>
> /**
> + * pci_acs_enable - test ACS against required flags for a given device
typo: ^^^ missing 'd'

> + * @pdev: device to test
> + * @acs_flags: required PCI ACS flags
> + *
> + * Return true if the device supports the provided flags. Automatically
> + * filters out flags that are not implemented on multifunction devices.
> + */
> +bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
> +{
> + int pos;
> + u16 ctrl;
> +
> + if (pci_dev_specific_acs_enabled(pdev, acs_flags))
> + return true;
> +
> + if (!pci_is_pcie(pdev))
> + return false;
> +
> + if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
> + pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return false;
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL,&ctrl);
> + if ((ctrl& acs_flags) != acs_flags)
> + return false;
> + } else if (pdev->multifunction) {
> + /* Filter out flags not applicable to multifunction */
> + acs_flags&= (PCI_ACS_RR | PCI_ACS_CR |
> + PCI_ACS_EC | PCI_ACS_DT);
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return false;
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL,&ctrl);
> + if ((ctrl& acs_flags) != acs_flags)
> + return false;
> + }
> +
> + return true;
or, to reduce duplicated code (which compiler may do?):

/* Filter out flags not applicable to multifunction */
if (pdev->multifunction)
acs_flags &= (PCI_ACS_RR | PCI_ACS_CR |
PCI_ACS_EC | PCI_ACS_DT);

if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT ||
pdev->multifunction) {
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
if (!pos)
return false;
pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
if ((ctrl & acs_flags) != acs_flags)
return false;
}

return true;
> +}

But the above doesn't handle the case where the RC does not do
peer-to-peer btwn root ports. Per ACS spec, such a RC's root ports
don't need to provide an ACS cap, since peer-to-peer port xfers aren't
allowed/enabled/supported, so by design, the root port is ACS compliant.
ATM, an IOMMU-capable system is a pre-req for VFIO,
and all such systems have an ACS cap, but they may not always be true.

> +EXPORT_SYMBOL_GPL(pci_acs_enabled);
> +
> +/**
> + * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
> + * @start: starting downstream device
> + * @end: ending upstream device or NULL to search to the root bus
> + * @acs_flags: required flags
> + *
> + * Walk up a device tree from start to end testing PCI ACS support. If
> + * any step along the way does not support the required flags, return false.
> + */
> +bool pci_acs_path_enabled(struct pci_dev *start,
> + struct pci_dev *end, u16 acs_flags)
> +{
> + struct pci_dev *pdev, *parent = start;
> +
> + do {
> + pdev = parent;
> +
> + if (!pci_acs_enabled(pdev, acs_flags))
> + return false;
> +
> + if (pci_is_root_bus(pdev->bus))
> + return (end == NULL);
doesn't this mean that a caller can't pass the pdev of the root port?
I would think that is a valid call, albeit not the common one.
Also worried that the above code may be true on Intel machines, but not on AMD
machines (the latter reps its IOMMU as a pdev of root bus, doesn't it?)

> +
> + parent = pdev->bus->self;
> + } while (pdev != end);
> +
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(pci_acs_path_enabled);
> +
> +/**
> * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
> * @dev: the PCI device
> * @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD)
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index a2dd77f..4ed6aa6 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3149,3 +3149,32 @@ struct pci_dev *pci_dma_source(struct pci_dev *dev)
>
> return dev;
> }
> +
> +static const struct pci_dev_acs_enabled {
> + u16 vendor;
> + u16 device;
> + bool (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
> +} pci_dev_acs_enabled[] = {
> + { 0 }
> +};
> +
> +bool pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags)
> +{
> + const struct pci_dev_acs_enabled *i;
> +
> + /*
> + * Allow devices that do not expose standard PCI ACS capabilities
> + * or control to indicate their support here. Multi-function devices
> + * which do not allow internal peer-to-peer between functions, but
> + * do not implement PCI ACS may wish to return true here.
> + */
> + for (i = pci_dev_acs_enabled; i->acs_enabled; i++) {
> + if ((i->vendor == dev->vendor ||
> + i->vendor == (u16)PCI_ANY_ID)&&
> + (i->device == dev->device ||
> + i->device == (u16)PCI_ANY_ID))
> + return i->acs_enabled(dev, acs_flags);
> + }
> +
> + return false;
> +}
I can't wait until these quirks are filled in! :)

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 02dbfed..2559735 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1480,6 +1480,7 @@ enum pci_fixup_pass {
> #ifdef CONFIG_PCI_QUIRKS
> void pci_fixup_device(enum pci_fixup_pass pass, struct pci_dev *dev);
> struct pci_dev *pci_dma_source(struct pci_dev *dev);
> +bool pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags);
> #else
> static inline void pci_fixup_device(enum pci_fixup_pass pass,
> struct pci_dev *dev) {}
> @@ -1487,6 +1488,11 @@ static inline struct pci_dev *pci_dma_source(struct pci_dev *dev)
> {
> return dev;
> }
> +static inline bool pci_dev_specific_acs_enabled(struct pci_dev *dev,
> + u16 acs_flags)
> +{
> + return false;
> +}
> #endif
>
> void __iomem *pcim_iomap(struct pci_dev *pdev, int bar, unsigned long maxlen);
> @@ -1589,7 +1595,9 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
> }
>
> void pci_request_acs(void);
> -
> +bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> +bool pci_acs_path_enabled(struct pci_dev *start,
> + struct pci_dev *end, u16 acs_flags);
>
> #define PCI_VPD_LRDT 0x80 /* Large Resource Data Type */
> #define PCI_VPD_LRDT_ID(x) (x | PCI_VPD_LRDT)
>

2012-05-24 21:39:09

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v2 09/13] vfio: x86 IOMMU implementation

On 05/22/2012 01:05 AM, Alex Williamson wrote:
> x86 is probably the wrong name for this VFIO IOMMU driver, but x86
> is the primary target for it. This driver support a very simple
> usage model using the existing IOMMU API. The IOMMU is expected to
> support the full host address space with no special IOVA windows,
> number of mappings restrictions, or unique processor target options.
>
> Signed-off-by: Alex Williamson<[email protected]>
> ---
>
> Documentation/ioctl/ioctl-number.txt | 2
> drivers/vfio/Kconfig | 6
> drivers/vfio/Makefile | 2
> drivers/vfio/vfio.c | 7
> drivers/vfio/vfio_iommu_x86.c | 743 ++++++++++++++++++++++++++++++++++
> include/linux/vfio.h | 52 ++
> 6 files changed, 811 insertions(+), 1 deletions(-)
> create mode 100644 drivers/vfio/vfio_iommu_x86.c
>
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 111e30a..9d1694e 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -88,7 +88,7 @@ Code Seq#(hex) Include File Comments
> and kernel/power/user.c
> '8' all SNP8023 advanced NIC card
> <mailto:[email protected]>
> -';' 64-6F linux/vfio.h
> +';' 64-72 linux/vfio.h
> '@' 00-0F linux/radeonfb.h conflict!
> '@' 00-0F drivers/video/aty/aty128fb.c conflict!
> 'A' 00-1F linux/apm_bios.h conflict!
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 9acb1e7..bd88a30 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -1,6 +1,12 @@
> +config VFIO_IOMMU_X86
> + tristate
> + depends on VFIO&& X86
> + default n
> +
> menuconfig VFIO
> tristate "VFIO Non-Privileged userspace driver framework"
> depends on IOMMU_API
> + select VFIO_IOMMU_X86 if X86
> help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.

So a future refactoring that uses some chunk of this support
on a non-x86 machine could be a lot of useless renaming.

Why not rename vfio_iommu_x86 to something like vfio_iommu_no_iova
and just make it conditionally compiled on X86 (as you've done above in Kconfig's)?
Then if another arch can use it, or refactors the file to use
some of it, and split x86 vs <other-arch> into separate per-arch files,
or per-iova schemes, it's more descriptive and less disruptive?

> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7500a67..1f1abee 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1 +1,3 @@
> obj-$(CONFIG_VFIO) += vfio.o
> +obj-$(CONFIG_VFIO_IOMMU_X86) += vfio_iommu_x86.o
> +obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6558eef..89899a8 100644

2012-05-24 21:49:52

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 03/13] iommu: IOMMU groups for VT-d and AMD-Vi

On Thu, 2012-05-24 at 17:01 -0400, Don Dutile wrote:
> On 05/22/2012 01:04 AM, Alex Williamson wrote:
> > Add back group support for AMD& Intel. amd_iommu already tracks
> > devices and has init and uninit routines to manage groups.
> > intel-iommu does this on the fly, so we make use of the notifier
> > support built into iommu groups to create and remove groups.
> >
> > Signed-off-by: Alex Williamson<[email protected]>
> > ---
> >
> > drivers/iommu/amd_iommu.c | 28 +++++++++++++++++++++++++-
> > drivers/iommu/intel-iommu.c | 46 +++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 73 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> > index 32c00cd..b7e5ddf 100644
> > --- a/drivers/iommu/amd_iommu.c
> > +++ b/drivers/iommu/amd_iommu.c
> > @@ -256,9 +256,11 @@ static bool check_device(struct device *dev)
> >
> > static int iommu_init_device(struct device *dev)
> > {
> > - struct pci_dev *pdev = to_pci_dev(dev);
> > + struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
> > struct iommu_dev_data *dev_data;
> > + struct iommu_group *group;
> > u16 alias;
> > + int ret;
> >
> > if (dev->archdata.iommu)
> > return 0;
> > @@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
> > return -ENOTSUPP;
> > }
> > dev_data->alias_data = alias_data;
> > +
> > + dma_pdev = pci_get_bus_and_slot(alias>> 8, alias& 0xff);
> > + } else
> > + dma_pdev = pdev;
> > +
> > + if (!pdev->is_virtfn&& PCI_FUNC(pdev->devfn)&& iommu_group_mf&&
> > + pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
> > + dma_pdev = pci_get_slot(pdev->bus,
> > + PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
> > +
> > + group = iommu_group_get(&dma_pdev->dev);
> > + if (!group) {
> > + group = iommu_group_alloc();
> > + if (IS_ERR(group))
> > + return PTR_ERR(group);
> > }
> >
> > + ret = iommu_group_add_device(group, dev);
> > +
> > + iommu_group_put(group);
> > +
> do you want to do a put if there is a failure in the iommu_group_add_device()?

Yes, this was intentional. iommu_group_alloc() adds a reference to the
group it returns so that it doesn't disappear while we're working on it
(documented in iommu.c). iommu_group_get() also gets a reference. So
this put will free the group if it was new or just drop the reference if
existing. Thanks,

Alex

> > + if (ret)
> > + return ret;
> > +
> > if (pci_iommuv2_capable(pdev)) {
> > struct amd_iommu *iommu;
> >
> > @@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)
> >
> > static void iommu_uninit_device(struct device *dev)
> > {
> > + iommu_group_remove_device(dev);
> > +
> > /*
> > * Nothing to do here - we keep dev_data around for unplugged devices
> > * and reuse it when the device is re-plugged - not doing so would
> > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > index d4a0ff7..e63b33b 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
> > return 0;
> > }
> >
> > +static int intel_iommu_add_device(struct device *dev)
> > +{
> > + struct pci_dev *pdev = to_pci_dev(dev);
> > + struct pci_dev *bridge, *dma_pdev = pdev;
> > + struct iommu_group *group;
> > + int ret;
> > +
> > + if (!device_to_iommu(pci_domain_nr(pdev->bus),
> > + pdev->bus->number, pdev->devfn))
> > + return -ENODEV;
> > +
> > + bridge = pci_find_upstream_pcie_bridge(pdev);
> > + if (bridge) {
> > + if (pci_is_pcie(bridge))
> > + dma_pdev = pci_get_domain_bus_and_slot(
> > + pci_domain_nr(pdev->bus),
> > + bridge->subordinate->number, 0);
> > + else
> > + dma_pdev = bridge;
> > + }
> > +
> > + if (!pdev->is_virtfn&& PCI_FUNC(pdev->devfn)&& iommu_group_mf&&
> > + pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
> > + dma_pdev = pci_get_slot(pdev->bus,
> > + PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
> > +
> > + group = iommu_group_get(&dma_pdev->dev);
> > + if (!group) {
> > + group = iommu_group_alloc();
> > + if (IS_ERR(group))
> > + return PTR_ERR(group);
> > + }
> > +
> > + ret = iommu_group_add_device(group, dev);
> > +
> ditto.
> > + iommu_group_put(group);
> > + return ret;
> > +}
> > +
> > +static void intel_iommu_remove_device(struct device *dev)
> > +{
> > + iommu_group_remove_device(dev);
> > +}
> > +
> > static struct iommu_ops intel_iommu_ops = {
> > .domain_init = intel_iommu_domain_init,
> > .domain_destroy = intel_iommu_domain_destroy,
> > @@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
> > .unmap = intel_iommu_unmap,
> > .iova_to_phys = intel_iommu_iova_to_phys,
> > .domain_has_cap = intel_iommu_domain_has_cap,
> > + .add_device = intel_iommu_add_device,
> > + .remove_device = intel_iommu_remove_device,
> > .pgsize_bitmap = INTEL_IOMMU_PGSIZES,
> > };
> >
> >
>


2012-05-24 21:49:51

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v2 12/13] pci: Misc pci_reg additions

On 05/22/2012 01:05 AM, Alex Williamson wrote:
> Fill in many missing definitions and add sizeof fields for many
> sections allowing for more extensive config parsing.
>
> Signed-off-by: Alex Williamson<[email protected]>
> ---
>
overall, i'm very glad to see defines instead of hardcoded numbers in the code, but....

> include/linux/pci_regs.h | 112 +++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 100 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> index 4b608f5..379be84 100644
> --- a/include/linux/pci_regs.h
> +++ b/include/linux/pci_regs.h
> @@ -26,6 +26,7 @@
> * Under PCI, each device has 256 bytes of configuration address space,
> * of which the first 64 bytes are standardized as follows:
> */
> +#define PCI_STD_HEADER_SIZEOF 64
> #define PCI_VENDOR_ID 0x00 /* 16 bits */
> #define PCI_DEVICE_ID 0x02 /* 16 bits */
> #define PCI_COMMAND 0x04 /* 16 bits */
> @@ -209,9 +210,12 @@
> #define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */
> #define PCI_CAP_ID_SSVID 0x0D /* Bridge subsystem vendor/device ID */
> #define PCI_CAP_ID_AGP3 0x0E /* AGP Target PCI-PCI bridge */
> +#define PCI_CAP_ID_SECDEV 0x0F /* Secure Device */
> #define PCI_CAP_ID_EXP 0x10 /* PCI Express */
> #define PCI_CAP_ID_MSIX 0x11 /* MSI-X */
> +#define PCI_CAP_ID_SATA 0x12 /* SATA Data/Index Conf. */
> #define PCI_CAP_ID_AF 0x13 /* PCI Advanced Features */
> +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */
> #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */
> #define PCI_CAP_SIZEOF 4
> @@ -276,6 +280,7 @@
> #define PCI_VPD_ADDR_MASK 0x7fff /* Address mask */
> #define PCI_VPD_ADDR_F 0x8000 /* Write 0, 1 indicates completion */
> #define PCI_VPD_DATA 4 /* 32-bits of data returned here */
> +#define PCI_CAP_VPD_SIZEOF 8
>
> /* Slot Identification */
>
> @@ -297,8 +302,10 @@
> #define PCI_MSI_ADDRESS_HI 8 /* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */
> #define PCI_MSI_DATA_32 8 /* 16 bits of data for 32-bit devices */
> #define PCI_MSI_MASK_32 12 /* Mask bits register for 32-bit devices */
> +#define PCI_MSI_PENDING_32 16 /* Pending intrs for 32-bit devices */
> #define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */
> #define PCI_MSI_MASK_64 16 /* Mask bits register for 64-bit devices */
> +#define PCI_MSI_PENDING_64 20 /* Pending intrs for 64-bit devices */
>
> /* MSI-X registers */
> #define PCI_MSIX_FLAGS 2
> @@ -308,6 +315,7 @@
> #define PCI_MSIX_TABLE 4
> #define PCI_MSIX_PBA 8
> #define PCI_MSIX_FLAGS_BIRMASK (7<< 0)
> +#define PCI_CAP_MSIX_SIZEOF 12 /* size of MSIX registers */
>
> /* MSI-X entry's format */
> #define PCI_MSIX_ENTRY_SIZE 16
> @@ -338,6 +346,7 @@
> #define PCI_AF_CTRL_FLR 0x01
> #define PCI_AF_STATUS 5
> #define PCI_AF_STATUS_TP 0x01
> +#define PCI_CAP_AF_SIZEOF 6 /* size of AF registers */
>
> /* PCI-X registers */
>
> @@ -374,6 +383,9 @@
> #define PCI_X_STATUS_SPL_ERR 0x20000000 /* Rcvd Split Completion Error Msg */
> #define PCI_X_STATUS_266MHZ 0x40000000 /* 266 MHz capable */
> #define PCI_X_STATUS_533MHZ 0x80000000 /* 533 MHz capable */
> +#define PCI_X_ECC_CSR 8 /* ECC control and status */
> +#define PCI_CAP_PCIX_SIZEOF_V0 8 /* size of registers for Version 0 */
> +#define PCI_CAP_PCIX_SIZEOF_V12 24 /* size for Version 1& 2 */
ew!
unlikely that version 12 will ever exist, but why not:
#define PCI_CAP_PCIX_SIZEOF_V1 24
#define PCI_CAP_PCIX_SIZEOF_V2 PCI_CAP_PCIX_SIZEOF_V1


>
> /* PCI Bridge Subsystem ID registers */
>
> @@ -462,6 +474,7 @@
> #define PCI_EXP_LNKSTA_DLLLA 0x2000 /* Data Link Layer Link Active */
> #define PCI_EXP_LNKSTA_LBMS 0x4000 /* Link Bandwidth Management Status */
> #define PCI_EXP_LNKSTA_LABS 0x8000 /* Link Autonomous Bandwidth Status */
> +#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20 /* v1 endpoints end here */
> #define PCI_EXP_SLTCAP 20 /* Slot Capabilities */
> #define PCI_EXP_SLTCAP_ABP 0x00000001 /* Attention Button Present */
> #define PCI_EXP_SLTCAP_PCP 0x00000002 /* Power Controller Present */
> @@ -521,6 +534,7 @@
> #define PCI_EXP_OBFF_MSGA_EN 0x2000 /* OBFF enable with Message type A */
> #define PCI_EXP_OBFF_MSGB_EN 0x4000 /* OBFF enable with Message type B */
> #define PCI_EXP_OBFF_WAKE_EN 0x6000 /* OBFF using WAKE# signaling */
> +#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44 /* v2 endpoints end here */
> #define PCI_EXP_LNKCTL2 48 /* Link Control 2 */
> #define PCI_EXP_SLTCTL2 56 /* Slot Control 2 */
>
> @@ -529,23 +543,43 @@
> #define PCI_EXT_CAP_VER(header) ((header>> 16)& 0xf)
> #define PCI_EXT_CAP_NEXT(header) ((header>> 20)& 0xffc)
>
> -#define PCI_EXT_CAP_ID_ERR 1
> -#define PCI_EXT_CAP_ID_VC 2
> -#define PCI_EXT_CAP_ID_DSN 3
> -#define PCI_EXT_CAP_ID_PWR 4
> -#define PCI_EXT_CAP_ID_VNDR 11
> -#define PCI_EXT_CAP_ID_ACS 13
> -#define PCI_EXT_CAP_ID_ARI 14
> -#define PCI_EXT_CAP_ID_ATS 15
> -#define PCI_EXT_CAP_ID_SRIOV 16
> -#define PCI_EXT_CAP_ID_PRI 19
> -#define PCI_EXT_CAP_ID_LTR 24
> -#define PCI_EXT_CAP_ID_PASID 27
> +#define PCI_EXT_CAP_ID_ERR 0x01 /* Advanced Error Reporting */
> +#define PCI_EXT_CAP_ID_VC 0x02 /* Virtual Channel Capability */
> +#define PCI_EXT_CAP_ID_DSN 0x03 /* Device Serial Number */
> +#define PCI_EXT_CAP_ID_PWR 0x04 /* Power Budgeting */
> +#define PCI_EXT_CAP_ID_RCLD 0x05 /* Root Complex Link Declaration */
> +#define PCI_EXT_CAP_ID_RCILC 0x06 /* Root Complex Internal Link Control */
> +#define PCI_EXT_CAP_ID_RCEC 0x07 /* Root Complex Event Collector */
> +#define PCI_EXT_CAP_ID_MFVC 0x08 /* Multi-Function VC Capability */
> +#define PCI_EXT_CAP_ID_VC9 0x09 /* same as _VC */
> +#define PCI_EXT_CAP_ID_RCRB 0x0A /* Root Complex RB? */
> +#define PCI_EXT_CAP_ID_VNDR 0x0B /* Vendor Specific */
> +#define PCI_EXT_CAP_ID_CAC 0x0C /* Config Access - obsolete */
> +#define PCI_EXT_CAP_ID_ACS 0x0D /* Access Control Services */
> +#define PCI_EXT_CAP_ID_ARI 0x0E /* Alternate Routing ID */
> +#define PCI_EXT_CAP_ID_ATS 0x0F /* Address Translation Services */
> +#define PCI_EXT_CAP_ID_SRIOV 0x10 /* Single Root I/O Virtualization */
> +#define PCI_EXT_CAP_ID_MRIOV 0x11 /* Multi Root I/O Virtualization */
> +#define PCI_EXT_CAP_ID_MCAST 0x12 /* Multicast */
> +#define PCI_EXT_CAP_ID_PRI 0x13 /* Page Request Interface */
> +#define PCI_EXT_CAP_ID_AMD_XXX 0x14 /* reserved for AMD */
> +#define PCI_EXT_CAP_ID_REBAR 0x15 /* resizable BAR */
> +#define PCI_EXT_CAP_ID_DPA 0x16 /* dynamic power alloc */
> +#define PCI_EXT_CAP_ID_TPH 0x17 /* TPH request */
> +#define PCI_EXT_CAP_ID_LTR 0x18 /* latency tolerance reporting */
> +#define PCI_EXT_CAP_ID_SECPCI 0x19 /* Secondary PCIe */
> +#define PCI_EXT_CAP_ID_PMUX 0x1A /* Protocol Multiplexing */
> +#define PCI_EXT_CAP_ID_PASID 0x1B /* Process Address Space ID */
> +#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PASID
> +
> +#define PCI_EXT_CAP_DSN_SIZEOF 12
> +#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
>
> /* Advanced Error Reporting */
> #define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */
> #define PCI_ERR_UNC_TRAIN 0x00000001 /* Training */
> #define PCI_ERR_UNC_DLP 0x00000010 /* Data Link Protocol */
> +#define PCI_ERR_UNC_SURPDN 0x00000020 /* Surprise Down */
> #define PCI_ERR_UNC_POISON_TLP 0x00001000 /* Poisoned TLP */
> #define PCI_ERR_UNC_FCP 0x00002000 /* Flow Control Protocol */
> #define PCI_ERR_UNC_COMP_TIME 0x00004000 /* Completion Timeout */
> @@ -555,6 +589,11 @@
> #define PCI_ERR_UNC_MALF_TLP 0x00040000 /* Malformed TLP */
> #define PCI_ERR_UNC_ECRC 0x00080000 /* ECRC Error Status */
> #define PCI_ERR_UNC_UNSUP 0x00100000 /* Unsupported Request */
> +#define PCI_ERR_UNC_ACSV 0x00200000 /* ACS Violation */
> +#define PCI_ERR_UNC_INTN 0x00400000 /* internal error */
> +#define PCI_ERR_UNC_MCBTLP 0x00800000 /* MC blocked TLP */
> +#define PCI_ERR_UNC_ATOMEG 0x01000000 /* Atomic egress blocked */
> +#define PCI_ERR_UNC_TLPPRE 0x02000000 /* TLP prefix blocked */
> #define PCI_ERR_UNCOR_MASK 8 /* Uncorrectable Error Mask */
> /* Same bits as above */
> #define PCI_ERR_UNCOR_SEVER 12 /* Uncorrectable Error Severity */
> @@ -565,6 +604,9 @@
> #define PCI_ERR_COR_BAD_DLLP 0x00000080 /* Bad DLLP Status */
> #define PCI_ERR_COR_REP_ROLL 0x00000100 /* REPLAY_NUM Rollover */
> #define PCI_ERR_COR_REP_TIMER 0x00001000 /* Replay Timer Timeout */
> +#define PCI_ERR_COR_ADV_NFAT 0x00002000 /* Advisory Non-Fatal */
> +#define PCI_ERR_COR_INTERNAL 0x00004000 /* Corrected Internal */
> +#define PCI_ERR_COR_LOG_OVER 0x00008000 /* Header Log Overflow */
> #define PCI_ERR_COR_MASK 20 /* Correctable Error Mask */
> /* Same bits as above */
> #define PCI_ERR_CAP 24 /* Advanced Error Capabilities */
> @@ -596,12 +638,18 @@
>
> /* Virtual Channel */
> #define PCI_VC_PORT_REG1 4
> +#define PCI_VC_REG1_EVCC 0x7 /* extended vc count */
> #define PCI_VC_PORT_REG2 8
> +#define PCI_VC_REG2_32_PHASE 0x2
> +#define PCI_VC_REG2_64_PHASE 0x4
> +#define PCI_VC_REG2_128_PHASE 0x8
> #define PCI_VC_PORT_CTRL 12
> #define PCI_VC_PORT_STATUS 14
> #define PCI_VC_RES_CAP 16
> #define PCI_VC_RES_CTRL 20
> #define PCI_VC_RES_STATUS 26
> +#define PCI_CAP_VC_BASE_SIZEOF 0x10
> +#define PCI_CAP_VC_PER_VC_SIZEOF 0x0C
>
> /* Power Budgeting */
> #define PCI_PWR_DSR 4 /* Data Select Register */
> @@ -614,6 +662,7 @@
> #define PCI_PWR_DATA_RAIL(x) (((x)>> 18)& 7) /* Power Rail */
> #define PCI_PWR_CAP 12 /* Capability */
> #define PCI_PWR_CAP_BUDGET(x) ((x)& 1) /* Included in system budget */
> +#define PCI_EXT_CAP_PWR_SIZEOF 16
>
> /*
> * Hypertransport sub capability types
> @@ -646,6 +695,8 @@
> #define HT_CAPTYPE_ERROR_RETRY 0xC0 /* Retry on error configuration */
> #define HT_CAPTYPE_GEN3 0xD0 /* Generation 3 hypertransport configuration */
> #define HT_CAPTYPE_PM 0xE0 /* Hypertransport powermanagement configuration */
> +#define HT_CAP_SIZEOF_LONG 28 /* slave& primary */
> +#define HT_CAP_SIZEOF_SHORT 24 /* host& secondary */
>
> /* Alternative Routing-ID Interpretation */
> #define PCI_ARI_CAP 0x04 /* ARI Capability Register */
> @@ -656,6 +707,7 @@
> #define PCI_ARI_CTRL_MFVC 0x0001 /* MFVC Function Groups Enable */
> #define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */
> #define PCI_ARI_CTRL_FG(x) (((x)>> 4)& 7) /* Function Group */
> +#define PCI_EXT_CAP_ARI_SIZEOF 8
>
> /* Address Translation Service */
> #define PCI_ATS_CAP 0x04 /* ATS Capability Register */
> @@ -665,6 +717,7 @@
> #define PCI_ATS_CTRL_ENABLE 0x8000 /* ATS Enable */
> #define PCI_ATS_CTRL_STU(x) ((x)& 0x1f) /* Smallest Translation Unit */
> #define PCI_ATS_MIN_STU 12 /* shift of minimum STU block */
> +#define PCI_EXT_CAP_ATS_SIZEOF 8
>
> /* Page Request Interface */
> #define PCI_PRI_CTRL 0x04 /* PRI control register */
> @@ -676,6 +729,7 @@
> #define PCI_PRI_STATUS_STOPPED 0x100 /* PRI Stopped */
> #define PCI_PRI_MAX_REQ 0x08 /* PRI max reqs supported */
> #define PCI_PRI_ALLOC_REQ 0x0c /* PRI max reqs allowed */
> +#define PCI_EXT_CAP_PRI_SIZEOF 16
>
> /* PASID capability */
> #define PCI_PASID_CAP 0x04 /* PASID feature register */
> @@ -685,6 +739,7 @@
> #define PCI_PASID_CTRL_ENABLE 0x01 /* Enable bit */
> #define PCI_PASID_CTRL_EXEC 0x02 /* Exec permissions Enable */
> #define PCI_PASID_CTRL_PRIV 0x04 /* Priviledge Mode Enable */
> +#define PCI_EXT_CAP_PASID_SIZEOF 8
>
> /* Single Root I/O Virtualization */
> #define PCI_SRIOV_CAP 0x04 /* SR-IOV Capabilities */
> @@ -716,12 +771,14 @@
> #define PCI_SRIOV_VFM_MI 0x1 /* Dormant.MigrateIn */
> #define PCI_SRIOV_VFM_MO 0x2 /* Active.MigrateOut */
> #define PCI_SRIOV_VFM_AV 0x3 /* Active.Available */
> +#define PCI_EXT_CAP_SRIOV_SIZEOF 64
>
> #define PCI_LTR_MAX_SNOOP_LAT 0x4
> #define PCI_LTR_MAX_NOSNOOP_LAT 0x6
> #define PCI_LTR_VALUE_MASK 0x000003ff
> #define PCI_LTR_SCALE_MASK 0x00001c00
> #define PCI_LTR_SCALE_SHIFT 10
> +#define PCI_EXT_CAP_LTR_SIZEOF 8
>
> /* Access Control Service */
> #define PCI_ACS_CAP 0x04 /* ACS Capability Register */
> @@ -732,7 +789,38 @@
> #define PCI_ACS_UF 0x10 /* Upstream Forwarding */
> #define PCI_ACS_EC 0x20 /* P2P Egress Control */
> #define PCI_ACS_DT 0x40 /* Direct Translated P2P */
> +#define PCI_ACS_EGRESS_BITS 0x05 /* ACS Egress Control Vector Size */
> #define PCI_ACS_CTRL 0x06 /* ACS Control Register */
> #define PCI_ACS_EGRESS_CTL_V 0x08 /* ACS Egress Control Vector */
>
> +#define PCI_VSEC_HDR 4 /* extended cap - vendor specific */
> +#define PCI_VSEC_HDR_LEN_SHIFT 20 /* shift for length field */
> +
> +/* sata capability */
> +#define PCI_SATA_REGS 4 /* SATA REGs specifier */
> +#define PCI_SATA_REGS_MASK 0xF /* location - BAR#/inline */
> +#define PCI_SATA_REGS_INLINE 0xF /* REGS in config space */
> +#define PCI_SATA_SIZEOF_SHORT 8
> +#define PCI_SATA_SIZEOF_LONG 16
> +
> +/* resizable BARs */
> +#define PCI_REBAR_CTRL 8 /* control register */
> +#define PCI_REBAR_CTRL_NBAR_MASK (7<< 5) /* mask for # bars */
> +#define PCI_REBAR_CTRL_NBAR_SHIFT 5 /* shift for # bars */
> +
> +/* dynamic power allocation */
> +#define PCI_DPA_CAP 4 /* capability register */
> +#define PCI_DPA_CAP_SUBSTATE_MASK 0x1F /* # substates - 1 */
> +#define PCI_DPA_BASE_SIZEOF 16 /* size with 0 substates */
> +
> +/* TPH Requester */
> +#define PCI_TPH_CAP 4 /* capability register */
> +#define PCI_TPH_CAP_LOC_MASK 0x600 /* location mask */
> +#define PCI_TPH_LOC_NONE 0x000 /* no location */
> +#define PCI_TPH_LOC_CAP 0x200 /* in capability */
> +#define PCI_TPH_LOC_MSIX 0x400 /* in MSI-X */
> +#define PCI_TPH_CAP_ST_MASK 0x07FF0000 /* st table mask */
> +#define PCI_TPH_CAP_ST_SHIFT 16 /* st table shift */
> +#define PCI_TPH_BASE_SIZEOF 12 /* size with no st table */
> +
> #endif /* LINUX_PCI_REGS_H */
>

2012-05-24 21:56:37

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v2 00/13] IOMMU Groups + VFIO

On 05/22/2012 01:04 AM, Alex Williamson wrote:
> Version 2 incorporating acks and feedback from v1. The PCI DMA quirk
> and ACS check are reworked, sysfs iommu groups ABI Documentation
> added as well as numerous other fixes, including patches from Alexey
> Kardashevskiy towards supporting POWER usage of VFIO and IOMMU groups.
>
> This series can be found here on top of 3.4:
>
> git://github.com/awilliam/linux-vfio.git iommu-group-vfio-20120521
>
> The Qemu tree has also been updated to Qemu 1.1 and can be found here:
>
> git://github.com/awilliam/qemu-vfio.git iommu-group-vfio
>
> I'd really like to make a push to get this in for 3.5, so let's talk
> about how to do that across iommu, pci, and new driver. Joerg, are
> you sufficiently happy with the IOMMU group concept and code? We'll
> also need David Woodhouse buyin on the intel-iommu changes in patches
> 3& 6. Who needs to approve VFIO as a new driver, GregKH? Bjorn,
> I'd be happy to send the PCI changes as a series for you, but I
> wonder if it makes sense to collect acks for them if you approve and
> bundle them in with the associated code that needs them so you're
> not left with unused code. Let me know which you prefer. If there
> are better ways to do it, please let me know. Thanks,
>
> Alex
>
> ---
ack to 1,2,4,6,8,10 & 11.
provided some minor feedback on 3,9,&12.
have to do final review of the big stuff, 7 & 13.
>
> Alex Williamson (13):
> vfio: Add PCI device driver
> pci: Misc pci_reg additions
> pci: Create common pcibios_err_to_errno
> pci: export pci_user functions for use by other drivers
> vfio: x86 IOMMU implementation
> vfio: Add documentation
> vfio: VFIO core
> iommu: Make use of DMA quirking and ACS enabled check for groups
> pci: Add ACS validation utility
> pci: Add PCI DMA source ID quirk
> iommu: IOMMU groups for VT-d and AMD-Vi
> iommu: IOMMU Groups
> driver core: Add iommu_group tracking to struct device
>
>
> .../ABI/testing/sysfs-kernel-iommu_groups | 14
> Documentation/ioctl/ioctl-number.txt | 1
> Documentation/vfio.txt | 315 ++++
> MAINTAINERS | 8
> drivers/Kconfig | 2
> drivers/Makefile | 1
> drivers/iommu/amd_iommu.c | 67 +
> drivers/iommu/intel-iommu.c | 87 +
> drivers/iommu/iommu.c | 578 +++++++-
> drivers/pci/access.c | 6
> drivers/pci/pci.c | 76 +
> drivers/pci/pci.h | 7
> drivers/pci/quirks.c | 69 +
> drivers/vfio/Kconfig | 16
> drivers/vfio/Makefile | 3
> drivers/vfio/pci/Kconfig | 8
> drivers/vfio/pci/Makefile | 4
> drivers/vfio/pci/vfio_pci.c | 557 +++++++
> drivers/vfio/pci/vfio_pci_config.c | 1522 ++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_intrs.c | 724 ++++++++++
> drivers/vfio/pci/vfio_pci_private.h | 91 +
> drivers/vfio/pci/vfio_pci_rdwr.c | 269 ++++
> drivers/vfio/vfio.c | 1413 +++++++++++++++++++
> drivers/vfio/vfio_iommu_x86.c | 743 ++++++++++
> drivers/xen/xen-pciback/conf_space.c | 6
> include/linux/device.h | 2
> include/linux/iommu.h | 104 +
> include/linux/pci.h | 49 +
> include/linux/pci_regs.h | 112 +
> include/linux/vfio.h | 444 ++++++
> 30 files changed, 7182 insertions(+), 116 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-iommu_groups
> create mode 100644 Documentation/vfio.txt
> create mode 100644 drivers/vfio/Kconfig
> create mode 100644 drivers/vfio/Makefile
> create mode 100644 drivers/vfio/pci/Kconfig
> create mode 100644 drivers/vfio/pci/Makefile
> create mode 100644 drivers/vfio/pci/vfio_pci.c
> create mode 100644 drivers/vfio/pci/vfio_pci_config.c
> create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
> create mode 100644 drivers/vfio/pci/vfio_pci_private.h
> create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
> create mode 100644 drivers/vfio/vfio.c
> create mode 100644 drivers/vfio/vfio_iommu_x86.c
> create mode 100644 include/linux/vfio.h

2012-05-24 22:18:32

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 12/13] pci: Misc pci_reg additions

On Thu, 2012-05-24 at 17:49 -0400, Don Dutile wrote:
> On 05/22/2012 01:05 AM, Alex Williamson wrote:
> > Fill in many missing definitions and add sizeof fields for many
> > sections allowing for more extensive config parsing.
> >
> > Signed-off-by: Alex Williamson<[email protected]>
> > ---
> >
> overall, i'm very glad to see defines instead of hardcoded numbers in the code, but....
>
> > include/linux/pci_regs.h | 112 +++++++++++++++++++++++++++++++++++++++++-----
> > 1 files changed, 100 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> > index 4b608f5..379be84 100644
> > --- a/include/linux/pci_regs.h
> > +++ b/include/linux/pci_regs.h
> > @@ -26,6 +26,7 @@
> > * Under PCI, each device has 256 bytes of configuration address space,
> > * of which the first 64 bytes are standardized as follows:
> > */
> > +#define PCI_STD_HEADER_SIZEOF 64
> > #define PCI_VENDOR_ID 0x00 /* 16 bits */
> > #define PCI_DEVICE_ID 0x02 /* 16 bits */
> > #define PCI_COMMAND 0x04 /* 16 bits */
> > @@ -209,9 +210,12 @@
> > #define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */
> > #define PCI_CAP_ID_SSVID 0x0D /* Bridge subsystem vendor/device ID */
> > #define PCI_CAP_ID_AGP3 0x0E /* AGP Target PCI-PCI bridge */
> > +#define PCI_CAP_ID_SECDEV 0x0F /* Secure Device */
> > #define PCI_CAP_ID_EXP 0x10 /* PCI Express */
> > #define PCI_CAP_ID_MSIX 0x11 /* MSI-X */
> > +#define PCI_CAP_ID_SATA 0x12 /* SATA Data/Index Conf. */
> > #define PCI_CAP_ID_AF 0x13 /* PCI Advanced Features */
> > +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> > #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */
> > #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */
> > #define PCI_CAP_SIZEOF 4
> > @@ -276,6 +280,7 @@
> > #define PCI_VPD_ADDR_MASK 0x7fff /* Address mask */
> > #define PCI_VPD_ADDR_F 0x8000 /* Write 0, 1 indicates completion */
> > #define PCI_VPD_DATA 4 /* 32-bits of data returned here */
> > +#define PCI_CAP_VPD_SIZEOF 8
> >
> > /* Slot Identification */
> >
> > @@ -297,8 +302,10 @@
> > #define PCI_MSI_ADDRESS_HI 8 /* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */
> > #define PCI_MSI_DATA_32 8 /* 16 bits of data for 32-bit devices */
> > #define PCI_MSI_MASK_32 12 /* Mask bits register for 32-bit devices */
> > +#define PCI_MSI_PENDING_32 16 /* Pending intrs for 32-bit devices */
> > #define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */
> > #define PCI_MSI_MASK_64 16 /* Mask bits register for 64-bit devices */
> > +#define PCI_MSI_PENDING_64 20 /* Pending intrs for 64-bit devices */
> >
> > /* MSI-X registers */
> > #define PCI_MSIX_FLAGS 2
> > @@ -308,6 +315,7 @@
> > #define PCI_MSIX_TABLE 4
> > #define PCI_MSIX_PBA 8
> > #define PCI_MSIX_FLAGS_BIRMASK (7<< 0)
> > +#define PCI_CAP_MSIX_SIZEOF 12 /* size of MSIX registers */
> >
> > /* MSI-X entry's format */
> > #define PCI_MSIX_ENTRY_SIZE 16
> > @@ -338,6 +346,7 @@
> > #define PCI_AF_CTRL_FLR 0x01
> > #define PCI_AF_STATUS 5
> > #define PCI_AF_STATUS_TP 0x01
> > +#define PCI_CAP_AF_SIZEOF 6 /* size of AF registers */
> >
> > /* PCI-X registers */
> >
> > @@ -374,6 +383,9 @@
> > #define PCI_X_STATUS_SPL_ERR 0x20000000 /* Rcvd Split Completion Error Msg */
> > #define PCI_X_STATUS_266MHZ 0x40000000 /* 266 MHz capable */
> > #define PCI_X_STATUS_533MHZ 0x80000000 /* 533 MHz capable */
> > +#define PCI_X_ECC_CSR 8 /* ECC control and status */
> > +#define PCI_CAP_PCIX_SIZEOF_V0 8 /* size of registers for Version 0 */
> > +#define PCI_CAP_PCIX_SIZEOF_V12 24 /* size for Version 1& 2 */
> ew!
> unlikely that version 12 will ever exist, but why not:
> #define PCI_CAP_PCIX_SIZEOF_V1 24
> #define PCI_CAP_PCIX_SIZEOF_V2 PCI_CAP_PCIX_SIZEOF_V1

Works for me, will fix. Thanks,

Alex

2012-05-24 22:36:13

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 05/13] pci: Add ACS validation utility

On Thu, 2012-05-24 at 17:30 -0400, Don Dutile wrote:
> On 05/22/2012 01:05 AM, Alex Williamson wrote:
> > In a PCI environment, transactions aren't always required to reach
> > the root bus before being re-routed. Intermediate switches between
> > an endpoint and the root bus can redirect DMA back downstream before
> > things like IOMMUs have a chance to intervene. Legacy PCI is always
> > susceptible to this as it operates on a shared bus. PCIe added a
> > new capability to describe and control this behavior, Access Control
> > Services, or ACS. The utility function pci_acs_enabled() allows us
> > to test the ACS capabilities of an individual devices against a set
> > of flags while pci_acs_path_enabled() tests a complete path from
> > a given downstream device up to the specified upstream device. We
> > also include the ability to add device specific tests as it's
> > likely we'll see devices that do no implement ACS, but want to
> > indicate support for various capabilities in this space.
> >
> > Signed-off-by: Alex Williamson<[email protected]>
> > ---
> >
> > drivers/pci/pci.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > drivers/pci/quirks.c | 29 +++++++++++++++++++
> > include/linux/pci.h | 10 ++++++-
> > 3 files changed, 114 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 111569c..ab6c2a6 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -2359,6 +2359,82 @@ void pci_enable_acs(struct pci_dev *dev)
> > }
> >
> > /**
> > + * pci_acs_enable - test ACS against required flags for a given device
> typo: ^^^ missing 'd'

d'oh. fixed

> > + * @pdev: device to test
> > + * @acs_flags: required PCI ACS flags
> > + *
> > + * Return true if the device supports the provided flags. Automatically
> > + * filters out flags that are not implemented on multifunction devices.
> > + */
> > +bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
> > +{
> > + int pos;
> > + u16 ctrl;
> > +
> > + if (pci_dev_specific_acs_enabled(pdev, acs_flags))
> > + return true;
> > +
> > + if (!pci_is_pcie(pdev))
> > + return false;
> > +
> > + if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
> > + pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
> > + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> > + if (!pos)
> > + return false;
> > +
> > + pci_read_config_word(pdev, pos + PCI_ACS_CTRL,&ctrl);
> > + if ((ctrl& acs_flags) != acs_flags)
> > + return false;
> > + } else if (pdev->multifunction) {
> > + /* Filter out flags not applicable to multifunction */
> > + acs_flags&= (PCI_ACS_RR | PCI_ACS_CR |
> > + PCI_ACS_EC | PCI_ACS_DT);
> > +
> > + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> > + if (!pos)
> > + return false;
> > +
> > + pci_read_config_word(pdev, pos + PCI_ACS_CTRL,&ctrl);
> > + if ((ctrl& acs_flags) != acs_flags)
> > + return false;
> > + }
> > +
> > + return true;
> or, to reduce duplicated code (which compiler may do?):
>
> /* Filter out flags not applicable to multifunction */
> if (pdev->multifunction)
> acs_flags &= (PCI_ACS_RR | PCI_ACS_CR |
> PCI_ACS_EC | PCI_ACS_DT);
>
> if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
> pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT ||
> pdev->multifunction) {
> pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> if (!pos)
> return false;
> pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> if ((ctrl & acs_flags) != acs_flags)
> return false;
> }
>
> return true;

Good suggestion

> > +}
>
> But the above doesn't handle the case where the RC does not do
> peer-to-peer btwn root ports. Per ACS spec, such a RC's root ports
> don't need to provide an ACS cap, since peer-to-peer port xfers aren't
> allowed/enabled/supported, so by design, the root port is ACS compliant.
> ATM, an IOMMU-capable system is a pre-req for VFIO,
> and all such systems have an ACS cap, but they may not always be true.

How do we know the behavior of the RC? This is why I allowed the NULL
bailout below so we don't have to check the RC and can assume the iommu
driver knows something about it.

> > +EXPORT_SYMBOL_GPL(pci_acs_enabled);
> > +
> > +/**
> > + * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
> > + * @start: starting downstream device
> > + * @end: ending upstream device or NULL to search to the root bus
> > + * @acs_flags: required flags
> > + *
> > + * Walk up a device tree from start to end testing PCI ACS support. If
> > + * any step along the way does not support the required flags, return false.
> > + */
> > +bool pci_acs_path_enabled(struct pci_dev *start,
> > + struct pci_dev *end, u16 acs_flags)
> > +{
> > + struct pci_dev *pdev, *parent = start;
> > +
> > + do {
> > + pdev = parent;
> > +
> > + if (!pci_acs_enabled(pdev, acs_flags))
> > + return false;
> > +
> > + if (pci_is_root_bus(pdev->bus))
> > + return (end == NULL);
> doesn't this mean that a caller can't pass the pdev of the root port?
> I would think that is a valid call, albeit not the common one.

I think there's nothing to step up to after this point, IIRC
pdev->bus->self segfaults from here. So if we reach the end and you
didn't ask for the end and haven't found your end device, we're done
either way. Is that not true?

> Also worried that the above code may be true on Intel machines, but not on AMD
> machines (the latter reps its IOMMU as a pdev of root bus, doesn't it?)

I hope it would be the usage in the respective iommu drivers that would
need to account for this. I've designed this code to not care. Patch
06/13 does a search to NULL for both AMD and Intel, which seems to work.
AMD does expose the IOMMU as a PCI device, but it's just a peer device
of everything else on the root bus, not a parent, so we can't search
with it as the end device.

> > +
> > + parent = pdev->bus->self;
> > + } while (pdev != end);
> > +
> > + return true;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_acs_path_enabled);
> > +
> > +/**
> > * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
> > * @dev: the PCI device
> > * @pin: the INTx pin (1=INTA, 2=INTB, 3=INTD, 4=INTD)
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index a2dd77f..4ed6aa6 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3149,3 +3149,32 @@ struct pci_dev *pci_dma_source(struct pci_dev *dev)
> >
> > return dev;
> > }
> > +
> > +static const struct pci_dev_acs_enabled {
> > + u16 vendor;
> > + u16 device;
> > + bool (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
> > +} pci_dev_acs_enabled[] = {
> > + { 0 }
> > +};
> > +
> > +bool pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags)
> > +{
> > + const struct pci_dev_acs_enabled *i;
> > +
> > + /*
> > + * Allow devices that do not expose standard PCI ACS capabilities
> > + * or control to indicate their support here. Multi-function devices
> > + * which do not allow internal peer-to-peer between functions, but
> > + * do not implement PCI ACS may wish to return true here.
> > + */
> > + for (i = pci_dev_acs_enabled; i->acs_enabled; i++) {
> > + if ((i->vendor == dev->vendor ||
> > + i->vendor == (u16)PCI_ANY_ID)&&
> > + (i->device == dev->device ||
> > + i->device == (u16)PCI_ANY_ID))
> > + return i->acs_enabled(dev, acs_flags);
> > + }
> > +
> > + return false;
> > +}
> I can't wait until these quirks are filled in! :)

The list could get long...

Thanks,
Alex

2012-05-24 22:47:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 09/13] vfio: x86 IOMMU implementation

On Thu, 2012-05-24 at 17:38 -0400, Don Dutile wrote:
> On 05/22/2012 01:05 AM, Alex Williamson wrote:
> > x86 is probably the wrong name for this VFIO IOMMU driver, but x86
> > is the primary target for it. This driver support a very simple
> > usage model using the existing IOMMU API. The IOMMU is expected to
> > support the full host address space with no special IOVA windows,
> > number of mappings restrictions, or unique processor target options.
> >
> > Signed-off-by: Alex Williamson<[email protected]>
> > ---
> >
> > Documentation/ioctl/ioctl-number.txt | 2
> > drivers/vfio/Kconfig | 6
> > drivers/vfio/Makefile | 2
> > drivers/vfio/vfio.c | 7
> > drivers/vfio/vfio_iommu_x86.c | 743 ++++++++++++++++++++++++++++++++++
> > include/linux/vfio.h | 52 ++
> > 6 files changed, 811 insertions(+), 1 deletions(-)
> > create mode 100644 drivers/vfio/vfio_iommu_x86.c
> >
> > diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> > index 111e30a..9d1694e 100644
> > --- a/Documentation/ioctl/ioctl-number.txt
> > +++ b/Documentation/ioctl/ioctl-number.txt
> > @@ -88,7 +88,7 @@ Code Seq#(hex) Include File Comments
> > and kernel/power/user.c
> > '8' all SNP8023 advanced NIC card
> > <mailto:[email protected]>
> > -';' 64-6F linux/vfio.h
> > +';' 64-72 linux/vfio.h
> > '@' 00-0F linux/radeonfb.h conflict!
> > '@' 00-0F drivers/video/aty/aty128fb.c conflict!
> > 'A' 00-1F linux/apm_bios.h conflict!
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> > index 9acb1e7..bd88a30 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -1,6 +1,12 @@
> > +config VFIO_IOMMU_X86
> > + tristate
> > + depends on VFIO&& X86
> > + default n
> > +
> > menuconfig VFIO
> > tristate "VFIO Non-Privileged userspace driver framework"
> > depends on IOMMU_API
> > + select VFIO_IOMMU_X86 if X86
> > help
> > VFIO provides a framework for secure userspace device drivers.
> > See Documentation/vfio.txt for more details.
>
> So a future refactoring that uses some chunk of this support
> on a non-x86 machine could be a lot of useless renaming.
>
> Why not rename vfio_iommu_x86 to something like vfio_iommu_no_iova
> and just make it conditionally compiled on X86 (as you've done above in Kconfig's)?
> Then if another arch can use it, or refactors the file to use
> some of it, and split x86 vs <other-arch> into separate per-arch files,
> or per-iova schemes, it's more descriptive and less disruptive?

Yep, the problem is how to concisely describe what we expect to support
here. This file supports IOMMU API based usage of an IOMMU with
effectively no DMA window or mapping constraints, optimized for static
mapping of an address space. What's a good name for that? Maybe I
should follow the example of others and just call it a Type 1 IOMMU
implementation so the marketing material looks better! ;-P That may
honestly be better than calling it x86. Thoughts? Thanks,

Alex

2012-05-25 15:22:35

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v2 09/13] vfio: x86 IOMMU implementation

On 05/24/2012 06:46 PM, Alex Williamson wrote:
> On Thu, 2012-05-24 at 17:38 -0400, Don Dutile wrote:
>> On 05/22/2012 01:05 AM, Alex Williamson wrote:
>>> x86 is probably the wrong name for this VFIO IOMMU driver, but x86
>>> is the primary target for it. This driver support a very simple
>>> usage model using the existing IOMMU API. The IOMMU is expected to
>>> support the full host address space with no special IOVA windows,
>>> number of mappings restrictions, or unique processor target options.
>>>
>>> Signed-off-by: Alex Williamson<[email protected]>
>>> ---
>>>
>>> Documentation/ioctl/ioctl-number.txt | 2
>>> drivers/vfio/Kconfig | 6
>>> drivers/vfio/Makefile | 2
>>> drivers/vfio/vfio.c | 7
>>> drivers/vfio/vfio_iommu_x86.c | 743 ++++++++++++++++++++++++++++++++++
>>> include/linux/vfio.h | 52 ++
>>> 6 files changed, 811 insertions(+), 1 deletions(-)
>>> create mode 100644 drivers/vfio/vfio_iommu_x86.c
>>>
>>> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
>>> index 111e30a..9d1694e 100644
>>> --- a/Documentation/ioctl/ioctl-number.txt
>>> +++ b/Documentation/ioctl/ioctl-number.txt
>>> @@ -88,7 +88,7 @@ Code Seq#(hex) Include File Comments
>>> and kernel/power/user.c
>>> '8' all SNP8023 advanced NIC card
>>> <mailto:[email protected]>
>>> -';' 64-6F linux/vfio.h
>>> +';' 64-72 linux/vfio.h
>>> '@' 00-0F linux/radeonfb.h conflict!
>>> '@' 00-0F drivers/video/aty/aty128fb.c conflict!
>>> 'A' 00-1F linux/apm_bios.h conflict!
>>> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
>>> index 9acb1e7..bd88a30 100644
>>> --- a/drivers/vfio/Kconfig
>>> +++ b/drivers/vfio/Kconfig
>>> @@ -1,6 +1,12 @@
>>> +config VFIO_IOMMU_X86
>>> + tristate
>>> + depends on VFIO&& X86
>>> + default n
>>> +
>>> menuconfig VFIO
>>> tristate "VFIO Non-Privileged userspace driver framework"
>>> depends on IOMMU_API
>>> + select VFIO_IOMMU_X86 if X86
>>> help
>>> VFIO provides a framework for secure userspace device drivers.
>>> See Documentation/vfio.txt for more details.
>>
>> So a future refactoring that uses some chunk of this support
>> on a non-x86 machine could be a lot of useless renaming.
>>
>> Why not rename vfio_iommu_x86 to something like vfio_iommu_no_iova
>> and just make it conditionally compiled on X86 (as you've done above in Kconfig's)?
>> Then if another arch can use it, or refactors the file to use
>> some of it, and split x86 vs<other-arch> into separate per-arch files,
>> or per-iova schemes, it's more descriptive and less disruptive?
>
> Yep, the problem is how to concisely describe what we expect to support
> here. This file supports IOMMU API based usage of an IOMMU with
> effectively no DMA window or mapping constraints, optimized for static
> mapping of an address space. What's a good name for that? Maybe I
> should follow the example of others and just call it a Type 1 IOMMU
> implementation so the marketing material looks better! ;-P That may
> honestly be better than calling it x86. Thoughts? Thanks,
>
> Alex
>
I'll vote for 'type1' over 'x86' ....
Add a comment in the file what a 'type1 IOMMU' is.
Then others can dupe format for typeX.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html