2010-07-16 21:58:53

by Tom Lyon

[permalink] [raw]
Subject: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

The VFIO "driver" is used to allow privileged AND non-privileged processes to
implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
devices.
Signed-off-by: Tom Lyon <[email protected]>
---
In this version:

There are lots of bug fixes and cleanups in this version, but the main
change is to check to make sure that the IOMMU has interrupt remapping
enabled, which is necessary to prevent user level code from triggering
spurious interrupts for other devices. Since most platforms today
do not have the necessary hardware and/or software for this, a module
option can override this check, thus making vfio useful (but not safe)
on many more platforms.

In the next version I plan to add kernel to user messaging using the
generic netlink mechanism to allow the user driver to react to hot add
and remove, and power management requests.

Blurb from version 2:

This version now requires an IOMMU domain to be set before any access to
device registers is granted (except that config space may be read). In
addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
which does not have sufficient controls around IOMMU usage. The IOMMU domain
is obtained from the 'uiommu' driver which is included in this patch.

Various locking, security, and documentation issues have also been fixed.

Please commit - it or me!
But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?

Blurb from version 1:

This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.

[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read & write calls) - but includes table
driven code to determine whats safe to write and what is not. Also, some
virtualization of the config space to allow drivers to think they're writing
some registers when they're not. Also, IO space accesses are also allowed.
Drivers for devices which use MSI-X are now prevented from directly writing
the MSI-X vector area.

All interrupts are now handled using eventfds, which makes things very simple.

The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
but the driver does support many more types of devices. I was none too sure
what driver directory this should live in, so for now I made up my own under
drivers/vfio. As a new driver/new directory, who makes the commit decision?

I currently have user level drivers working for 3 different network adapters
- the Cisco "Palo" enic, the Intel 82599 VF, and the Intel 82576 VF (but the
whole user level framework is a long ways from release). This driver could
also clearly replace a number of other drivers written just to give user
access to certain devices - but that will take time.

Documentation/ioctl/ioctl-number.txt | 1 +
Documentation/vfio.txt | 183 ++++++++++
MAINTAINERS | 8 +
drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/vfio/Kconfig | 18 +
drivers/vfio/Makefile | 6 +
drivers/vfio/uiommu.c | 126 +++++++
drivers/vfio/vfio_dma.c | 342 ++++++++++++++++++
drivers/vfio/vfio_intrs.c | 191 ++++++++++
drivers/vfio/vfio_main.c | 642 ++++++++++++++++++++++++++++++++++
drivers/vfio/vfio_pci_config.c | 605 ++++++++++++++++++++++++++++++++
drivers/vfio/vfio_rdwr.c | 152 ++++++++
drivers/vfio/vfio_sysfs.c | 153 ++++++++
include/linux/Kbuild | 1 +
include/linux/uiommu.h | 76 ++++
include/linux/vfio.h | 202 +++++++++++
17 files changed, 2709 insertions(+), 0 deletions(-)
diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index dd5806f..fe85fbb 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -87,6 +87,7 @@ Code Seq#(hex) Include File Comments
and kernel/power/user.c
'8' all SNP8023 advanced NIC card
<mailto:[email protected]>
+';' 64-6F linux/vfio.h
'@' 00-0F linux/radeonfb.h conflict!
'@' 00-0F drivers/video/aty/aty128fb.c conflict!
'A' 00-1F linux/apm_bios.h conflict!
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index e69de29..4f4740e 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -0,0 +1,183 @@
+-------------------------------------------------------------------------------
+The VFIO "driver" is used to allow privileged AND non-privileged processes to
+implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
+devices.
+
+Why is this interesting? Some applications, especially in the high performance
+computing field, need access to hardware functions with as little overhead as
+possible. Examples are in network adapters (typically non TCP/IP based) and
+in compute accelerators - i.e., array processors, FPGA processors, etc.
+Previous to the VFIO drivers these apps would need either a kernel-level
+driver (with corresponding overheads), or else root permissions to directly
+access the hardware. The VFIO driver allows generic access to the hardware
+from non-privileged apps IF the hardware is "well-behaved" enough for this
+to be safe.
+
+While there have long been ways to implement user-level drivers using specific
+corresponding drivers in the kernel, it was not until the introduction of the
+UIO driver framework, and the uio_pci_generic driver that one could have a
+generic kernel component supporting many types of user level drivers. However,
+even with the uio_pci_generic driver, processes implementing the user level
+drivers had to be trusted - they could do dangerous manipulation of DMA
+addreses and were required to be root to write PCI configuration space
+registers.
+
+Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
+new hardware capabilities which the VFIO solution exploits to allow non-root
+user level drivers. The main role of the IOMMU is to ensure that DMA accesses
+from devices go only to the appropriate memory locations; this allows VFIO to
+ensure that user level drivers do not corrupt inappropriate memory. PCI I/O
+virtualization (SR-IOV) was defined to allow "pass-through" of virtual devices
+to guest virtual machines. VFIO in essence implements pass-through of devices
+to user processes, not virtual machines. SR-IOV devices implement a
+traditional PCI device (the physical function) and a dynamic number of special
+PCI devices (virtual functions) whose feature set is somewhat restricted - in
+order to allow the operating system or virtual machine monitor to ensure the
+safe operation of the system.
+
+Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
+there are many other non-IOV PCI devices which also meet the defintion.
+Elements of this definition are:
+- The size of any memory BARs to be mmap'ed into the user process space must be
+ a multiple of the system page size.
+- If MSI-X interrupts are used, the device driver must not attempt to mmap or
+ write the MSI-X vector area.
+- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
+ revision 2.3 to allow its interrupts to be masked in a generic way.
+- The device must not use the PCI configuration space in any non-standard way,
+ i.e., the user level driver will be permitted only to read and write standard
+ fields of the PCI config space, and only if those fields cannot cause harm to
+ the system. In addition, some fields are "virtualized", so that the user
+ driver can read/write them like a kernel driver, but they do not affect the
+ real device.
+- For now, there is no support for user access to the PCIe and PCI-X extended
+ capabilities configuration space.
+
+Only a very few platforms today (Intel X7500 is one) fully support both DMA
+remapping and interrupt remapping in the IOMMU. Everyone has DMA remapping
+but interrupt remapping is missing in some Intel hardware and software, and
+it is missing in the AMD IOMMU software. Interrupt remapping is needed to
+protect a user level driver from triggering interrupts for other devices in
+the system. Until interrupt remapping is in more platforms we allow the
+admin to load the module with allow_unsafe_intrs=1 which will make this driver useful (but not safe) on those platforms.
+
+When the vfio module is loaded, it will have access to no devices until the
+desired PCI devices are "bound" to the driver. First, make sure the devices
+are not bound to another kernel driver. You can unload that driver if you wish
+to unbind all its devices, or else enter the driver's sysfs directory, and
+unbind a specific device:
+ cd /sys/bus/pci/drivers/<drivername>
+ echo 0000:06:02.00 > unbind
+(The 0000:06:02.00 is a fully qualified PCI device name - different for each
+device). Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
+write the PCI device type of the target device to the new_id file:
+ echo 8086 10ca > new_id
+(8086 10ca are the vendor and device type for the Intel 82576 virtual function
+devices). A /dev/vfio<N> entry will be created for each device bound. The final
+step is to grant users permission by changing the mode and/or owner of the /dev
+entry - "chmod 666 /dev/vfio0".
+
+Reads & Writes:
+
+The user driver will typically use mmap to access the memory BAR(s) of a
+device; the I/O BARs and the PCI config space may be accessed through normal
+read and write system calls. Only 1 file descriptor is needed for all driver
+functions -- the desired BAR for I/O, memory, or config space is indicated via
+high-order bits of the file offset. For instance, the following implements a
+write to the PCI config space:
+
+ #include <linux/vfio.h>
+ void pci_write_config_word(int pci_fd, u16 off, u16 wd)
+ {
+ off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
+
+ if (pwrite(pci_fd, &wd, 2, cfg_off) != 2)
+ perror("pwrite config_dword");
+ }
+
+The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
+in vfio.h to convert BAR numbers to file offsets and vice-versa.
+
+Interrupts:
+
+Device interrupts are translated by the vfio driver into input events on event
+notification file descriptors created by the eventfd system call. The user
+program must create one or more event descriptors and pass them to the vfio
+driver via ioctls to arrange for the interrupt mapping:
+1.
+ efd = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
+ This provides an eventfd for traditional IRQ interrupts.
+ IRQs will be disabled after each interrupt until the driver
+ re-enables them via the PCI COMMAND register.
+2.
+ efd = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
+ This connects MSI interrupts to an eventfd.
+3.
+ int arg[N+1];
+ arg[0] = N;
+ arg[1..N] = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
+ This connects N MSI-X interrupts with N eventfds.
+
+Waiting and checking for interrupts is done by the user program by reads,
+polls, or selects on the related event file descriptors.
+
+DMA:
+
+The VFIO driver uses ioctls to allow the user level driver to get DMA
+addresses which correspond to virtual addresses. In systems with IOMMUs,
+each PCI device will have its own address space for DMA operations, so when
+the user level driver programs the device registers, only addresses known to
+the IOMMU will be valid, any others will be rejected. The IOMMU creates the
+illusion (to the device) that multi-page buffers are physically contiguous,
+so a single DMA operation can safely span multiple user pages.
+
+If the user process desires many DMA buffers, it may be wise to do a mapping
+of a single large buffer, and then allocate the smaller buffers from the
+large one.
+
+The DMA buffers are locked into physical memory for the duration of their
+existence - until VFIO_DMA_UNMAP is called, until the user pages are
+unmapped from the user process, or until the vfio file descriptor is closed.
+The user process must have permission to lock the pages given by the ulimit(-l)
+command, which in turn relies on settings in the /etc/security/limits.conf
+file.
+
+The vfio_dma_map structure is used as an argument to the ioctls which
+do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
+multiple of a page. Its rdwr field is zero for read-only (outbound), and
+non-zero for read/write buffers.
+
+ struct vfio_dma_map {
+ __u64 vaddr; /* process virtual addr */
+ __u64 dmaaddr; /* desired and/or returned dma address */
+ __u64 size; /* size in bytes */
+ int rdwr; /* bool: 0 for r/o; 1 for r/w */
+ };
+
+The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
+dmaaddr field already assigned. The system will attempt to map the DMA
+buffer into the IO space at the given dmaaddr. This is expected to be
+useful if KVM or other virtualization facilities use this driver.
+Use of VFIO_DMA_MAP_IOVA requires an explicit assignment of the device
+to an IOMMU domain. A file descriptor for an empty IOMMU domain is
+acquired by opening /dev/uiommu. The device is then attached to the
+domain by issuing a VFIO_DOMAIN_SET ioctl with the domain fd address as
+the argument. The device may be detached from the domain with the
+VFIO_DOMAIN_UNSET ioctl (no argument). It is expected that hypervisors
+may wish to attach many devices to the same domain.
+
+The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
+the buffer and releases the corresponding system resources.
+
+The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
+(device dependent). It takes a single unsigned 64 bit integer as an argument.
+This call also has the side effect of enabling PCI bus mastership.
+
+Miscellaneous:
+
+The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
+device's base address region. It is passed a single integer specifying which
+BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.
diff --git a/MAINTAINERS b/MAINTAINERS
index d329b05..33eab03 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5968,6 +5968,14 @@ S: Maintained
F: Documentation/fb/uvesafb.txt
F: drivers/video/uvesafb.*

+VFIO DRIVER
+M: Tom Lyon <[email protected]>
+L: [email protected]
+S: Supported
+F: Documentation/vfio.txt
+F: drivers/vfio/
+F: include/linux/vfio.h
+
VFAT/FAT/MSDOS FILESYSTEM
M: OGAWA Hirofumi <[email protected]>
S: Maintained
diff --git a/drivers/Kconfig b/drivers/Kconfig
index a2b902f..711c1cb 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
source "drivers/staging/Kconfig"

source "drivers/platform/Kconfig"
+
+source "drivers/vfio/Kconfig"
endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index f42a030..3d595e8 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_FUSION) += message/
obj-$(CONFIG_FIREWIRE) += firewire/
obj-y += ieee1394/
obj-$(CONFIG_UIO) += uio/
+obj-$(CONFIG_VFIO) += vfio/
obj-y += cdrom/
obj-y += auxdisplay/
obj-$(CONFIG_PCCARD) += pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index e69de29..2bbc1da 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,18 @@
+menuconfig VFIO
+ tristate "Non-Privileged User Space PCI drivers"
+ depends on UIOMMU && PCI && IOMMU_API
+ help
+ Driver to allow advanced user space drivers for PCI, PCI-X,
+ and PCIe devices. Requires IOMMU to allow non-privileged
+ processes to directly control the PCI devices.
+
+ If you don't know what to do here, say N.
+
+menuconfig UIOMMU
+ tristate "User level manipulation of IOMMU"
+ help
+ Device driver to allow user level programs to
+ manipulate IOMMU domains.
+
+ If you don't know what to do here, say N.
+
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index e69de29..4d2d8b7 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_VFIO) := vfio.o
+obj-$(CONFIG_UIOMMU) += uiommu.o
+
+vfio-y := vfio_main.o vfio_dma.o vfio_intrs.o \
+ vfio_pci_config.o vfio_rdwr.o vfio_sysfs.o
+
diff --git a/drivers/vfio/uiommu.c b/drivers/vfio/uiommu.c
index e69de29..eec1759 100644
--- a/drivers/vfio/uiommu.c
+++ b/drivers/vfio/uiommu.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+*/
+
+/*
+ * uiommu driver - issue fd handles for IOMMU domains
+ * so they may be passed to vfio (and others?)
+ */
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/iommu.h>
+#include <linux/uiommu.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tom Lyon <[email protected]>");
+MODULE_DESCRIPTION("User IOMMU driver");
+
+static struct uiommu_domain *uiommu_domain_alloc(void)
+{
+ struct iommu_domain *domain;
+ struct uiommu_domain *udomain;
+
+ domain = iommu_domain_alloc();
+ if (domain == NULL)
+ return NULL;
+ udomain = kzalloc(sizeof *udomain, GFP_KERNEL);
+ if (udomain == NULL) {
+ iommu_domain_free(domain);
+ return NULL;
+ }
+ udomain->domain = domain;
+ atomic_inc(&udomain->refcnt);
+ return udomain;
+}
+
+static int uiommu_open(struct inode *inode, struct file *file)
+{
+ struct uiommu_domain *udomain;
+
+ udomain = uiommu_domain_alloc();
+ if (udomain == NULL)
+ return -ENOMEM;
+ file->private_data = udomain;
+ return 0;
+}
+
+static int uiommu_release(struct inode *inode, struct file *file)
+{
+ struct uiommu_domain *udomain;
+
+ udomain = file->private_data;
+ uiommu_put(udomain);
+ return 0;
+}
+
+static const struct file_operations uiommu_fops = {
+ .owner = THIS_MODULE,
+ .open = uiommu_open,
+ .release = uiommu_release,
+};
+
+static struct miscdevice uiommu_dev = {
+ .name = "uiommu",
+ .minor = MISC_DYNAMIC_MINOR,
+ .fops = &uiommu_fops,
+};
+
+struct uiommu_domain *uiommu_fdget(int fd)
+{
+ struct file *file;
+ struct uiommu_domain *udomain;
+
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+ if (file->f_op != &uiommu_fops) {
+ fput(file);
+ return ERR_PTR(-EINVAL);
+ }
+ udomain = file->private_data;
+ atomic_inc(&udomain->refcnt);
+ return udomain;
+}
+EXPORT_SYMBOL_GPL(uiommu_fdget);
+
+void uiommu_put(struct uiommu_domain *udomain)
+{
+ if (atomic_dec_and_test(&udomain->refcnt)) {
+ iommu_domain_free(udomain->domain);
+ kfree(udomain);
+ }
+}
+EXPORT_SYMBOL_GPL(uiommu_put);
+
+static int __init uiommu_init(void)
+{
+ return misc_register(&uiommu_dev);
+}
+
+static void __exit uiommu_exit(void)
+{
+ misc_deregister(&uiommu_dev);
+}
+
+module_init(uiommu_init);
+module_exit(uiommu_exit);
diff --git a/drivers/vfio/vfio_dma.c b/drivers/vfio/vfio_dma.c
index e69de29..ef8d007 100644
--- a/drivers/vfio/vfio_dma.c
+++ b/drivers/vfio/vfio_dma.c
@@ -0,0 +1,342 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <[email protected]>
+ * Copyright(C) 2005, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2006, Hans J. Koch <[email protected]>
+ * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <[email protected]>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/iommu.h>
+#include <linux/uiommu.h>
+#include <linux/sched.h>
+#include <linux/vfio.h>
+
+/* Unmap DMA region */
+/* dgate must be held */
+static void vfio_dma_unmap(struct vfio_listener *listener,
+ struct dma_map_page *mlp)
+{
+ int i;
+ struct vfio_dev *vdev = listener->vdev;
+
+ list_del(&mlp->list);
+ for (i = 0; i < mlp->npage; i++)
+ (void) uiommu_unmap_range(vdev->udomain,
+ mlp->daddr + i*PAGE_SIZE, PAGE_SIZE);
+ for (i = 0; i < mlp->npage; i++) {
+ if (mlp->rdwr)
+ SetPageDirty(mlp->pages[i]);
+ put_page(mlp->pages[i]);
+ }
+ vdev->mapcount--;
+ listener->mm->locked_vm -= mlp->npage;
+ vdev->locked_pages -= mlp->npage;
+ vfree(mlp->pages);
+ kfree(mlp);
+}
+
+/* Unmap ALL DMA regions */
+void vfio_dma_unmapall(struct vfio_listener *listener)
+{
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ vfio_dma_unmap(listener, mlp);
+ }
+ mutex_unlock(&listener->vdev->dgate);
+}
+
+int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
+{
+ unsigned long start, npage;
+ struct dma_map_page *mlp;
+ struct list_head *pos, *pos2;
+ int ret;
+
+ start = dmp->vaddr & ~PAGE_SIZE;
+ npage = dmp->size >> PAGE_SHIFT;
+
+ ret = -ENXIO;
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
+ continue;
+ ret = 0;
+ vfio_dma_unmap(listener, mlp);
+ break;
+ }
+ mutex_unlock(&listener->vdev->dgate);
+ return ret;
+}
+
+#ifdef CONFIG_MMU_NOTIFIER
+/* Handle MMU notifications - user process freed or realloced memory
+ * which may be in use in a DMA region. Clean up region if so.
+ */
+static void vfio_dma_handle_mmu_notify(struct mmu_notifier *mn,
+ unsigned long start, unsigned long end)
+{
+ struct vfio_listener *listener;
+ unsigned long myend;
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ listener = container_of(mn, struct vfio_listener, mmu_notifier);
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (mlp->vaddr >= end)
+ continue;
+ /*
+ * Ranges overlap if they're not disjoint; and they're
+ * disjoint if the end of one is before the start of
+ * the other one.
+ */
+ myend = mlp->vaddr + (mlp->npage << PAGE_SHIFT) - 1;
+ if (!(myend <= start || end <= mlp->vaddr)) {
+ printk(KERN_WARNING
+ "%s: demap start %lx end %lx va %lx pa %lx\n",
+ __func__, start, end,
+ mlp->vaddr, (long)mlp->daddr);
+ vfio_dma_unmap(listener, mlp);
+ }
+ }
+ mutex_unlock(&listener->vdev->dgate);
+}
+
+static void vfio_dma_inval_page(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long addr)
+{
+ vfio_dma_handle_mmu_notify(mn, addr, addr + PAGE_SIZE);
+}
+
+static void vfio_dma_inval_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long start, unsigned long end)
+{
+ vfio_dma_handle_mmu_notify(mn, start, end);
+}
+
+static const struct mmu_notifier_ops vfio_dma_mmu_notifier_ops = {
+ .invalidate_page = vfio_dma_inval_page,
+ .invalidate_range_start = vfio_dma_inval_range_start,
+};
+#endif /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Map usr buffer at specific IO virtual address
+ */
+static struct dma_map_page *vfio_dma_map_iova(
+ struct vfio_listener *listener,
+ unsigned long start_iova,
+ struct page **pages,
+ int npage,
+ int rdwr)
+{
+ struct vfio_dev *vdev = listener->vdev;
+ int ret;
+ int i;
+ phys_addr_t hpa;
+ struct dma_map_page *mlp;
+ unsigned long iova = start_iova;
+
+ if (vdev->udomain == NULL)
+ return ERR_PTR(-EINVAL);
+
+ for (i = 0; i < npage; i++) {
+ if (uiommu_iova_to_phys(vdev->udomain, iova + i*PAGE_SIZE))
+ return ERR_PTR(-EBUSY);
+ }
+
+ mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
+ if (mlp == NULL)
+ return ERR_PTR(-ENOMEM);
+ rdwr = rdwr ? IOMMU_READ|IOMMU_WRITE : IOMMU_READ;
+ if (vdev->cachec)
+ rdwr |= IOMMU_CACHE;
+ for (i = 0; i < npage; i++) {
+ hpa = page_to_phys(pages[i]);
+ ret = uiommu_map_range(vdev->udomain, iova,
+ hpa, PAGE_SIZE, rdwr);
+ if (ret) {
+ while (--i > 0) {
+ iova -= PAGE_SIZE;
+ (void) uiommu_unmap_range(vdev->udomain,
+ iova, PAGE_SIZE);
+ }
+ kfree(mlp);
+ return ERR_PTR(ret);
+ }
+ iova += PAGE_SIZE;
+ }
+ vdev->mapcount++;
+
+ mlp->pages = pages;
+ mlp->daddr = start_iova;
+ mlp->npage = npage;
+ return mlp;
+}
+
+int vfio_dma_map_common(struct vfio_listener *listener,
+ unsigned int cmd, struct vfio_dma_map *dmp)
+{
+ int locked, lock_limit;
+ struct page **pages;
+ int npage;
+ struct dma_map_page *mlp;
+ int rdwr = (dmp->flags & VFIO_FLAG_WRITE) ? 1 : 0;
+ int ret = 0;
+
+ if (dmp->vaddr & (PAGE_SIZE-1))
+ return -EINVAL;
+ if (dmp->size & (PAGE_SIZE-1))
+ return -EINVAL;
+ if (dmp->size <= 0)
+ return -EINVAL;
+ npage = dmp->size >> PAGE_SHIFT;
+ if (npage <= 0)
+ return -EINVAL;
+
+ mutex_lock(&listener->vdev->dgate);
+
+ /* account for locked pages */
+ locked = npage + current->mm->locked_vm;
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
+ >> PAGE_SHIFT;
+ if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
+ __func__);
+ ret = -ENOMEM;
+ goto out_lock;
+ }
+ /* only 1 address space per fd */
+ if (current->mm != listener->mm) {
+ if (listener->mm != NULL) {
+ ret = -EINVAL;
+ goto out_lock;
+ }
+ listener->mm = current->mm;
+#ifdef CONFIG_MMU_NOTIFIER
+ listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
+ ret = mmu_notifier_register(&listener->mmu_notifier,
+ listener->mm);
+ if (ret)
+ printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
+ __func__, ret);
+ ret = 0;
+#endif
+ }
+
+ pages = vmalloc(npage * sizeof(struct page *));
+ if (pages == NULL) {
+ ret = ENOMEM;
+ goto out_lock;
+ }
+ ret = get_user_pages_fast(dmp->vaddr, npage, rdwr, pages);
+ if (ret != npage) {
+ printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
+ __func__, ret, npage);
+ kfree(pages);
+ ret = -EFAULT;
+ goto out_lock;
+ }
+ ret = 0;
+
+ mlp = vfio_dma_map_iova(listener, dmp->dmaaddr,
+ pages, npage, rdwr);
+ if (IS_ERR(mlp)) {
+ ret = PTR_ERR(mlp);
+ vfree(pages);
+ goto out_lock;
+ }
+ mlp->vaddr = dmp->vaddr;
+ mlp->rdwr = rdwr;
+ dmp->dmaaddr = mlp->daddr;
+ list_add(&mlp->list, &listener->dm_list);
+
+ current->mm->locked_vm += npage;
+ listener->vdev->locked_pages += npage;
+out_lock:
+ mutex_unlock(&listener->vdev->dgate);
+ return ret;
+}
+
+int vfio_domain_unset(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+
+ if (vdev->udomain == NULL)
+ return 0;
+ if (vdev->mapcount)
+ return -EBUSY;
+ uiommu_detach_device(vdev->udomain, &pdev->dev);
+ uiommu_put(vdev->udomain);
+ vdev->udomain = NULL;
+ return 0;
+}
+
+int vfio_domain_set(struct vfio_dev *vdev, int fd, int unsafe_ok)
+{
+ struct uiommu_domain *udomain;
+ struct pci_dev *pdev = vdev->pdev;
+ int ret;
+ int safe;
+
+ if (vdev->udomain)
+ return -EBUSY;
+ udomain = uiommu_fdget(fd);
+ if (IS_ERR(udomain))
+ return PTR_ERR(udomain);
+
+ safe = 0;
+#ifdef IOMMU_CAP_INTR_REMAP
+ /* iommu domain must also isolate dev interrupts */
+ if (uiommu_domain_has_cap(udomain, IOMMU_CAP_INTR_REMAP))
+ safe = 1;
+#endif
+ if (!safe && !unsafe_ok) {
+ printk(KERN_WARNING "%s: no interrupt remapping!\n", __func__);
+ return -EINVAL;
+ }
+
+ vfio_domain_unset(vdev);
+ ret = uiommu_attach_device(udomain, &pdev->dev);
+ if (ret) {
+ printk(KERN_ERR "%s: attach_device failed %d\n",
+ __func__, ret);
+ uiommu_put(udomain);
+ return ret;
+ }
+ vdev->cachec = iommu_domain_has_cap(udomain->domain,
+ IOMMU_CAP_CACHE_COHERENCY);
+ vdev->udomain = udomain;
+ return 0;
+}
diff --git a/drivers/vfio/vfio_intrs.c b/drivers/vfio/vfio_intrs.c
index e69de29..74383e7 100644
--- a/drivers/vfio/vfio_intrs.c
+++ b/drivers/vfio/vfio_intrs.c
@@ -0,0 +1,191 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <[email protected]>
+ * Copyright(C) 2005, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2006, Hans J. Koch <[email protected]>
+ * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <[email protected]>
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+
+/*
+ * vfio_interrupt - IRQ hardware interrupt handler
+ */
+irqreturn_t vfio_interrupt(int irq, void *dev_id)
+{
+ struct vfio_dev *vdev = (struct vfio_dev *)dev_id;
+ struct pci_dev *pdev = vdev->pdev;
+ irqreturn_t ret = IRQ_NONE;
+ u32 cmd_status_dword;
+ u16 origcmd, newcmd, status;
+
+ spin_lock_irq(&vdev->irqlock);
+
+ /* Read both command and status registers in a single 32-bit operation.
+ * Note: we could cache the value for command and move the status read
+ * out of the lock if there was a way to get notified of user changes
+ * to command register through sysfs. Should be good for shared irqs. */
+ pci_read_config_dword(pdev, PCI_COMMAND, &cmd_status_dword);
+ origcmd = cmd_status_dword;
+ status = cmd_status_dword >> 16;
+
+ /* Check interrupt status register to see whether our device
+ * triggered the interrupt. */
+ if (!(status & PCI_STATUS_INTERRUPT))
+ goto done;
+
+ /* We triggered the interrupt, disable it. */
+ newcmd = origcmd | PCI_COMMAND_INTX_DISABLE;
+ if (newcmd != origcmd)
+ pci_write_config_word(pdev, PCI_COMMAND, newcmd);
+
+ ret = IRQ_HANDLED;
+done:
+ spin_unlock_irq(&vdev->irqlock);
+ if (ret != IRQ_HANDLED)
+ return ret;
+ if (vdev->ev_irq)
+ eventfd_signal(vdev->ev_irq, 1);
+ return ret;
+}
+
+/*
+ * MSI and MSI-X Interrupt handler.
+ * Just signal an event
+ */
+static irqreturn_t msihandler(int irq, void *arg)
+{
+ struct eventfd_ctx *ctx = arg;
+
+ eventfd_signal(ctx, 1);
+ return IRQ_HANDLED;
+}
+
+void vfio_disable_msi(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+
+ if (vdev->ev_msi) {
+ eventfd_ctx_put(vdev->ev_msi);
+ free_irq(pdev->irq, vdev->ev_msi);
+ vdev->ev_msi = NULL;
+ }
+ pci_disable_msi(pdev);
+}
+
+int vfio_enable_msi(struct vfio_dev *vdev, int fd)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct eventfd_ctx *ctx;
+ int ret;
+
+ ctx = eventfd_ctx_fdget(fd);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+ vdev->ev_msi = ctx;
+ pci_enable_msi(pdev);
+ ret = request_irq(pdev->irq, msihandler, 0,
+ vdev->name, ctx);
+ if (ret) {
+ eventfd_ctx_put(ctx);
+ pci_disable_msi(pdev);
+ vdev->ev_msi = NULL;
+ }
+ return ret;
+}
+
+void vfio_disable_msix(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int i;
+
+ if (vdev->ev_msix && vdev->msix) {
+ for (i = 0; i < vdev->nvec; i++) {
+ free_irq(vdev->msix[i].vector, vdev->ev_msix[i]);
+ if (vdev->ev_msix[i])
+ eventfd_ctx_put(vdev->ev_msix[i]);
+ }
+ }
+ kfree(vdev->ev_msix);
+ vdev->ev_msix = NULL;
+ kfree(vdev->msix);
+ vdev->msix = NULL;
+ vdev->nvec = 0;
+ pci_disable_msix(pdev);
+}
+
+int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct eventfd_ctx *ctx;
+ int ret = 0;
+ int i;
+ int fd;
+
+ if (nvec < 1 || nvec > 1024)
+ return -EINVAL;
+ vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+ GFP_KERNEL);
+ if (vdev->msix == NULL)
+ return -ENOMEM;
+ vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
+ GFP_KERNEL);
+ if (vdev->ev_msix == NULL) {
+ kfree(vdev->msix);
+ return -ENOMEM;
+ }
+ for (i = 0; i < nvec; i++) {
+ if (copy_from_user(&fd, uarg, sizeof fd)) {
+ ret = -EFAULT;
+ break;
+ }
+ uarg += sizeof fd;
+ ctx = eventfd_ctx_fdget(fd);
+ if (IS_ERR(ctx)) {
+ ret = PTR_ERR(ctx);
+ break;
+ }
+ vdev->msix[i].entry = i;
+ vdev->ev_msix[i] = ctx;
+ }
+ if (!ret)
+ ret = pci_enable_msix(pdev, vdev->msix, nvec);
+ vdev->nvec = 0;
+ for (i = 0; i < nvec && !ret; i++) {
+ ret = request_irq(vdev->msix[i].vector, msihandler, 0,
+ vdev->name, vdev->ev_msix[i]);
+ if (ret)
+ break;
+ vdev->nvec = i+1;
+ }
+ if (ret)
+ vfio_disable_msix(vdev);
+ return ret;
+}
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index e69de29..716286e 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -0,0 +1,642 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <[email protected]>
+ * Copyright(C) 2005, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2006, Hans J. Koch <[email protected]>
+ * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <[email protected]>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mm.h>
+#include <linux/idr.h>
+#include <linux/string.h>
+#include <linux/interrupt.h>
+#include <linux/fs.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+
+#include <linux/vfio.h>
+
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "Tom Lyon <[email protected]>"
+#define DRIVER_DESC "VFIO - User Level PCI meta-driver"
+
+/*
+ * Only a very few platforms today (Intel X7500) fully support
+ * both DMA remapping and interrupt remapping in the IOMMU.
+ * Everyone has DMA remapping but interrupt remapping is missing
+ * in some Intel hardware and software, and its missing in the AMD
+ * IOMMU software. Interrupt remapping is needed to really protect the
+ * system from user level driver mischief. Until it is in more platforms
+ * we allow the admin to load the module with allow_unsafe_intrs=1
+ * which will make this driver useful (but not safe)
+ * on those platforms.
+ */
+static int allow_unsafe_intrs;
+module_param(allow_unsafe_intrs, int, 0);
+
+static int vfio_major = -1;
+static DEFINE_IDR(vfio_idr);
+/* Protect idr accesses */
+static DEFINE_MUTEX(vfio_minor_lock);
+
+/*
+ * Does [a1,b1) overlap [a2,b2) ?
+ */
+static inline int overlap(int a1, int b1, int a2, int b2)
+{
+ /*
+ * Ranges overlap if they're not disjoint; and they're
+ * disjoint if the end of one is before the start of
+ * the other one.
+ */
+ return !(b2 <= a1 || b1 <= a2);
+}
+
+static int vfio_open(struct inode *inode, struct file *filep)
+{
+ struct vfio_dev *vdev;
+ struct vfio_listener *listener;
+ int ret = 0;
+
+ mutex_lock(&vfio_minor_lock);
+ vdev = idr_find(&vfio_idr, iminor(inode));
+ mutex_unlock(&vfio_minor_lock);
+ if (!vdev) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ listener = kzalloc(sizeof(*listener), GFP_KERNEL);
+ if (!listener) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ mutex_lock(&vdev->lgate);
+ listener->vdev = vdev;
+ INIT_LIST_HEAD(&listener->dm_list);
+ filep->private_data = listener;
+ vdev->listeners++;
+ mutex_unlock(&vdev->lgate);
+ return 0;
+
+out:
+ return ret;
+}
+
+static int vfio_release(struct inode *inode, struct file *filep)
+{
+ int ret = 0;
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+
+ vfio_dma_unmapall(listener);
+ if (listener->mm) {
+#ifdef CONFIG_MMU_NOTIFIER
+ mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
+#endif
+ listener->mm = NULL;
+ }
+
+ mutex_lock(&vdev->lgate);
+ if (--vdev->listeners <= 0) {
+ /* we don't need to hold igate here since there are
+ * no more listeners doing ioctls
+ */
+ if (vdev->ev_msix)
+ vfio_disable_msix(vdev);
+ if (vdev->ev_msi)
+ vfio_disable_msi(vdev);
+ if (vdev->ev_irq) {
+ eventfd_ctx_put(vdev->ev_irq);
+ vdev->ev_irq = NULL;
+ }
+ kfree(vdev->pci_config_map);
+ vdev->pci_config_map = NULL;
+ pci_clear_master(vdev->pdev);
+ vfio_domain_unset(vdev);
+ wake_up(&vdev->dev_idle_q);
+ }
+ mutex_unlock(&vdev->lgate);
+
+ kfree(listener);
+ return ret;
+}
+
+static ssize_t vfio_read(struct file *filep, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ int pci_space;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+
+ /* config reads are OK before iommu domain set */
+ if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+ return vfio_config_readwrite(0, vdev, buf, count, ppos);
+
+ /* no other reads until IOMMU domain set */
+ if (vdev->udomain == NULL)
+ return -EINVAL;
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+ return vfio_io_readwrite(0, vdev, buf, count, ppos);
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM)
+ return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+ if (pci_space == PCI_ROM_RESOURCE)
+ return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+ return -EINVAL;
+}
+
+static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u16 pos;
+ u32 table_offset;
+ u16 table_size;
+ u8 bir;
+ u32 lo, hi, startp, endp;
+
+ pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+ if (!pos)
+ return 0;
+
+ pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
+ table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
+ pci_read_config_dword(pdev, pos + 4, &table_offset);
+ bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
+ lo = table_offset >> PAGE_SHIFT;
+ hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
+ >> PAGE_SHIFT;
+ startp = start >> PAGE_SHIFT;
+ endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (bir == vfio_offset_to_pci_space(start) &&
+ overlap(lo, hi, startp, endp)) {
+ printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
+ __func__);
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static ssize_t vfio_write(struct file *filep, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ int pci_space;
+ int ret;
+
+ /* no writes until IOMMU domain set */
+ if (vdev->udomain == NULL)
+ return -EINVAL;
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+ return vfio_config_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+ return vfio_io_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
+ if (allow_unsafe_intrs) {
+ /* don't allow writes to msi-x vectors */
+ ret = vfio_msix_check(vdev, *ppos, count);
+ if (ret)
+ return ret;
+ }
+ return vfio_mem_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ }
+ return -EINVAL;
+}
+
+static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ unsigned long requested, actual;
+ int pci_space;
+ u64 start;
+ u32 len;
+ unsigned long phys;
+ int ret;
+
+ /* no reads or writes until IOMMU domain set */
+ if (vdev->udomain == NULL)
+ return -EINVAL;
+
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+
+
+ pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ switch (pci_space) {
+ case PCI_ROM_RESOURCE:
+ if (vma->vm_flags & VM_WRITE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
+ return -EINVAL;
+ actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
+ break;
+ default:
+ if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
+ return -EINVAL;
+ actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
+ break;
+ }
+
+ requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (requested > actual || actual == 0)
+ return -EINVAL;
+
+ start = vma->vm_pgoff << PAGE_SHIFT;
+ len = vma->vm_end - vma->vm_start;
+ if (allow_unsafe_intrs && (vma->vm_flags & VM_WRITE)) {
+ /*
+ * Deter users from screwing up MSI-X intrs
+ */
+ ret = vfio_msix_check(vdev, start, len);
+ if (ret)
+ return ret;
+ }
+
+ vma->vm_private_data = vdev;
+ vma->vm_flags |= VM_IO | VM_RESERVED;
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
+
+ return remap_pfn_range(vma, vma->vm_start, phys,
+ vma->vm_end - vma->vm_start,
+ vma->vm_page_prot);
+}
+
+static long vfio_unl_ioctl(struct file *filep,
+ unsigned int cmd,
+ unsigned long arg)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ void __user *uarg = (void __user *)arg;
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_dma_map *dm;
+ int ret = 0;
+ int fd, nfd;
+ int bar;
+
+ if (vdev == NULL)
+ return -EINVAL;
+
+ switch (cmd) {
+
+ case VFIO_DMA_MAP_IOVA:
+ dm = kmalloc(sizeof *dm, GFP_KERNEL);
+ if (dm == NULL)
+ return -ENOMEM;
+ if (copy_from_user(dm, uarg, sizeof *dm)) {
+ kfree(dm);
+ return -EFAULT;
+ }
+ ret = vfio_dma_map_common(listener, cmd, dm);
+ if (!ret && copy_to_user(uarg, dm, sizeof *dm))
+ ret = -EFAULT;
+ kfree(dm);
+ break;
+
+ case VFIO_DMA_UNMAP:
+ dm = kmalloc(sizeof *dm, GFP_KERNEL);
+ if (dm == NULL)
+ return -ENOMEM;
+ if (copy_from_user(dm, uarg, sizeof *dm)) {
+ kfree(dm);
+ return -EFAULT;
+ }
+ ret = vfio_dma_unmap_dm(listener, dm);
+ kfree(dm);
+ break;
+
+ case VFIO_EVENTFD_IRQ:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ mutex_lock(&vdev->igate);
+ if (vdev->ev_irq)
+ eventfd_ctx_put(vdev->ev_irq);
+ if (fd >= 0) {
+ vdev->ev_irq = eventfd_ctx_fdget(fd);
+ if (vdev->ev_irq == NULL)
+ ret = -EINVAL;
+ }
+ mutex_unlock(&vdev->igate);
+ break;
+
+ case VFIO_EVENTFD_MSI:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ mutex_lock(&vdev->igate);
+ if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+ ret = vfio_enable_msi(vdev, fd);
+ else if (fd < 0 && vdev->ev_msi)
+ vfio_disable_msi(vdev);
+ else
+ ret = -EINVAL;
+ mutex_unlock(&vdev->igate);
+ break;
+
+ case VFIO_EVENTFDS_MSIX:
+ if (copy_from_user(&nfd, uarg, sizeof nfd))
+ return -EFAULT;
+ uarg += sizeof nfd;
+ mutex_lock(&vdev->igate);
+ if (nfd > 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+ ret = vfio_enable_msix(vdev, nfd, uarg);
+ else if (nfd == 0 && vdev->ev_msix)
+ vfio_disable_msix(vdev);
+ else
+ ret = -EINVAL;
+ mutex_unlock(&vdev->igate);
+ break;
+
+ case VFIO_BAR_LEN:
+ if (copy_from_user(&bar, uarg, sizeof bar))
+ return -EFAULT;
+ if (bar < 0 || bar > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ if (pci_resource_start(pdev, bar))
+ bar = pci_resource_len(pdev, bar);
+ else
+ bar = 0;
+ if (copy_to_user(uarg, &bar, sizeof bar))
+ return -EFAULT;
+ break;
+
+ case VFIO_DOMAIN_SET:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ ret = vfio_domain_set(vdev, fd, allow_unsafe_intrs);
+ break;
+
+ case VFIO_DOMAIN_UNSET:
+ ret = vfio_domain_unset(vdev);
+ break;
+
+ default:
+ return -EINVAL;
+ }
+ return ret;
+}
+
+static const struct file_operations vfio_fops = {
+ .owner = THIS_MODULE,
+ .open = vfio_open,
+ .release = vfio_release,
+ .read = vfio_read,
+ .write = vfio_write,
+ .unlocked_ioctl = vfio_unl_ioctl,
+ .mmap = vfio_mmap,
+};
+
+static int vfio_get_devnum(struct vfio_dev *vdev)
+{
+ int retval = -ENOMEM;
+ int id;
+
+ mutex_lock(&vfio_minor_lock);
+ if (idr_pre_get(&vfio_idr, GFP_KERNEL) == 0)
+ goto exit;
+
+ retval = idr_get_new(&vfio_idr, vdev, &id);
+ if (retval < 0) {
+ if (retval == -EAGAIN)
+ retval = -ENOMEM;
+ goto exit;
+ }
+ if (id > MINORMASK) {
+ idr_remove(&vfio_idr, id);
+ retval = -ENOMEM;
+ }
+ if (vfio_major < 0) {
+ retval = register_chrdev(0, "vfio", &vfio_fops);
+ if (retval < 0)
+ goto exit;
+ vfio_major = retval;
+ }
+
+ retval = MKDEV(vfio_major, id);
+exit:
+ mutex_unlock(&vfio_minor_lock);
+ return retval;
+}
+
+static void vfio_free_minor(struct vfio_dev *vdev)
+{
+ mutex_lock(&vfio_minor_lock);
+ idr_remove(&vfio_idr, MINOR(vdev->devnum));
+ mutex_unlock(&vfio_minor_lock);
+}
+
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2. (from uio_pci_generic)
+ */
+static int verify_pci_2_3(struct pci_dev *pdev)
+{
+ u16 orig, new;
+ u8 pin;
+
+ pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
+ if (pin == 0) /* irqs not needed */
+ return 0;
+
+ pci_read_config_word(pdev, PCI_COMMAND, &orig);
+ pci_write_config_word(pdev, PCI_COMMAND,
+ orig ^ PCI_COMMAND_INTX_DISABLE);
+ pci_read_config_word(pdev, PCI_COMMAND, &new);
+ /* There's no way to protect against
+ * hardware bugs or detect them reliably, but as long as we know
+ * what the value should be, let's go ahead and check it. */
+ if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+ dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+ "driver or HW bug?\n", orig, new);
+ return -EBUSY;
+ }
+ if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+ dev_warn(&pdev->dev, "Device does not support "
+ "disabling interrupts: unable to bind.\n");
+ return -ENODEV;
+ }
+ /* Now restore the original value. */
+ pci_write_config_word(pdev, PCI_COMMAND, orig);
+ return 0;
+}
+
+static int vfio_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct vfio_dev *vdev;
+ int err;
+
+ if (!iommu_found())
+ return -EINVAL;
+
+ err = pci_enable_device(pdev);
+ if (err) {
+ dev_err(&pdev->dev, "%s: pci_enable_device failed: %d\n",
+ __func__, err);
+ return err;
+ }
+
+ err = verify_pci_2_3(pdev);
+ if (err)
+ goto err_verify;
+
+ vdev = kzalloc(sizeof(struct vfio_dev), GFP_KERNEL);
+ if (!vdev) {
+ err = -ENOMEM;
+ goto err_alloc;
+ }
+ vdev->pdev = pdev;
+
+ err = vfio_class_init();
+ if (err)
+ goto err_class;
+
+ mutex_init(&vdev->lgate);
+ mutex_init(&vdev->dgate);
+ mutex_init(&vdev->igate);
+ init_waitqueue_head(&vdev->dev_idle_q);
+
+ err = vfio_get_devnum(vdev);
+ if (err < 0)
+ goto err_get_devnum;
+ vdev->devnum = err;
+ err = 0;
+
+ sprintf(vdev->name, "vfio%d", MINOR(vdev->devnum));
+ pci_set_drvdata(pdev, vdev);
+ vdev->dev = device_create(vfio_class->class, &pdev->dev,
+ vdev->devnum, vdev, vdev->name);
+ if (IS_ERR(vdev->dev)) {
+ printk(KERN_ERR "VFIO: device register failed\n");
+ err = PTR_ERR(vdev->dev);
+ goto err_device_create;
+ }
+
+ err = vfio_dev_add_attributes(vdev);
+ if (err)
+ goto err_vfio_dev_add_attributes;
+
+
+ if (pdev->irq > 0) {
+ err = request_irq(pdev->irq, vfio_interrupt,
+ IRQF_SHARED, "vfio", vdev);
+ if (err)
+ goto err_request_irq;
+ }
+ vdev->vinfo.bardirty = 1;
+
+ return 0;
+
+err_request_irq:
+#ifdef notdef
+ vfio_dev_del_attributes(vdev);
+#endif
+err_vfio_dev_add_attributes:
+ device_destroy(vfio_class->class, vdev->devnum);
+err_device_create:
+ vfio_free_minor(vdev);
+err_get_devnum:
+err_class:
+ kfree(vdev);
+err_alloc:
+err_verify:
+ pci_disable_device(pdev);
+ return err;
+}
+
+static void vfio_remove(struct pci_dev *pdev)
+{
+ struct vfio_dev *vdev = pci_get_drvdata(pdev);
+
+ /* prevent further opens */
+ vfio_free_minor(vdev);
+
+ /* wait for all closed */
+ wait_event(vdev->dev_idle_q, vdev->listeners == 0);
+
+ if (pdev->irq > 0)
+ free_irq(pdev->irq, vdev);
+
+#ifdef notdef
+ vfio_dev_del_attributes(vdev);
+#endif
+
+ pci_set_drvdata(pdev, NULL);
+ device_destroy(vfio_class->class, vdev->devnum);
+ kfree(vdev);
+ vfio_class_destroy();
+ pci_disable_device(pdev);
+}
+
+static struct pci_driver driver = {
+ .name = "vfio",
+ .id_table = NULL, /* only dynamic id's */
+ .probe = vfio_probe,
+ .remove = vfio_remove,
+};
+
+static int __init init(void)
+{
+ pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+ return pci_register_driver(&driver);
+}
+
+static void __exit cleanup(void)
+{
+ if (vfio_major >= 0)
+ unregister_chrdev(vfio_major, "vfio");
+ pci_unregister_driver(&driver);
+}
+
+module_init(init);
+module_exit(cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
index e69de29..8bd5c00 100644
--- a/drivers/vfio/vfio_pci_config.c
+++ b/drivers/vfio/vfio_pci_config.c
@@ -0,0 +1,605 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <[email protected]>
+ * Copyright(C) 2005, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2006, Hans J. Koch <[email protected]>
+ * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <[email protected]>
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#define PCI_CAP_ID_BASIC 0
+#ifndef PCI_CAP_ID_MAX
+#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
+#endif
+
+/*
+ * Lengths of PCI Config Capabilities
+ * 0 means unknown (but at least 4)
+ * FF means special/variable
+ */
+static u8 pci_capability_length[] = {
+ [PCI_CAP_ID_BASIC] = 64, /* pci config header */
+ [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
+ [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
+ [PCI_CAP_ID_VPD] = 8,
+ [PCI_CAP_ID_SLOTID] = 4,
+ [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, 20, or 24 */
+ [PCI_CAP_ID_CHSWP] = 4,
+ [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
+ [PCI_CAP_ID_HT] = 28,
+ [PCI_CAP_ID_VNDR] = 0xFF,
+ [PCI_CAP_ID_DBG] = 0,
+ [PCI_CAP_ID_CCRC] = 0,
+ [PCI_CAP_ID_SHPC] = 0,
+ [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
+ [PCI_CAP_ID_AGP3] = 0,
+ [PCI_CAP_ID_EXP] = 36,
+ [PCI_CAP_ID_MSIX] = 12,
+ [PCI_CAP_ID_AF] = 6,
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists,
+ * but what is read depends on whether the field
+ * is 'virtualized', or just pass thru to the hardware.
+ * Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+ u32 rvirt; /* read bits which must be virtualized */
+ u32 write; /* writeable bits - virt if read virt */
+};
+
+static struct perm_bits pci_cap_basic_perm[] = {
+ { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
+ { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
+ { 0, 0, }, /* 0x08 class code & revision id */
+ { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
+ { 0, 0, }, /* 0x28 cardbus - not yet */
+ { 0, 0, }, /* 0x2c subsys vendor & dev */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
+ { 0, 0, }, /* 0x34 capability ptr & resv */
+ { 0, 0, }, /* 0x38 resv */
+ { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
+};
+
+static struct perm_bits pci_cap_pm_perm[] = {
+ { 0, 0, }, /* 0x00 PM capabilities */
+ { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
+};
+
+static struct perm_bits pci_cap_vpd_perm[] = {
+ { 0, 0xFFFF0000, }, /* 0x00 address */
+ { 0, 0xFFFFFFFF, }, /* 0x04 data */
+};
+
+static struct perm_bits pci_cap_slotid_perm[] = {
+ { 0, 0, }, /* 0x00 all read only */
+};
+
+/* 4 different possible layouts of MSI capability */
+static struct perm_bits pci_cap_msi_10_perm[] = {
+ { 0, 0, }, /* 0x00 MSI message control */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
+ { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
+};
+static struct perm_bits pci_cap_msi_14_perm[] = {
+ { 0, 0, }, /* 0x00 MSI message control */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
+ { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
+};
+static struct perm_bits pci_cap_msi_20_perm[] = {
+ { 0, 0, }, /* 0x00 MSI message control */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
+ { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
+ { 0, 0xFFFFFFFF, }, /* 0x0c MSI mask bits */
+ { 0, 0xFFFFFFFF, }, /* 0x10 MSI pending bits */
+};
+static struct perm_bits pci_cap_msi_24_perm[] = {
+ { 0, 0, }, /* 0x00 MSI message control */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
+ { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
+ { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
+ { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
+};
+
+static struct perm_bits pci_cap_pcix_perm[] = {
+ { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
+ { 0, 0, }, /* 0x04 PCI_X_STATUS */
+ { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
+ { 0, 0, }, /* 0x0c ECC first addr */
+ { 0, 0, }, /* 0x10 ECC second addr */
+ { 0, 0, }, /* 0x14 ECC attr */
+};
+
+/* pci express capabilities */
+static struct perm_bits pci_cap_exp_perm[] = {
+ { 0, 0, }, /* 0x00 PCIe capabilities */
+ { 0, 0, }, /* 0x04 PCIe device capabilities */
+ { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
+ { 0, 0, }, /* 0x0c PCIe link capabilities */
+ { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
+ { 0, 0, }, /* 0x14 PCIe slot capabilities */
+ { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
+ { 0, 0, }, /* 0x1c PCIe root port stuff */
+ { 0, 0, }, /* 0x20 PCIe root port stuff */
+};
+
+static struct perm_bits pci_cap_msix_perm[] = {
+ { 0, 0, }, /* 0x00 MSI-X Enable */
+ { 0, 0, }, /* 0x04 table offset & bir */
+ { 0, 0, }, /* 0x08 pba offset & bir */
+};
+
+static struct perm_bits pci_cap_af_perm[] = {
+ { 0, 0, }, /* 0x00 af capability */
+ { 0, 0x0001, }, /* 0x04 af flr bit */
+};
+
+static struct perm_bits *pci_cap_perms[] = {
+ [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
+ [PCI_CAP_ID_PM] = pci_cap_pm_perm,
+ [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
+ [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
+ [PCI_CAP_ID_MSI] = NULL, /* special */
+ [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
+ [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
+ [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
+ [PCI_CAP_ID_AF] = pci_cap_af_perm,
+};
+
+static int pci_msi_cap_len(struct pci_dev *pdev, u8 pos)
+{
+ int len;
+ int ret;
+ u16 flags;
+
+ ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+ if (ret < 0)
+ return ret;
+ if (flags & PCI_MSI_FLAGS_64BIT)
+ len = 14;
+ else
+ len = 10;
+ if (flags & PCI_MSI_FLAGS_MASKBIT)
+ len += 10;
+ return len;
+}
+
+/*
+ * We build a map of the config space that tells us where
+ * and what capabilities exist, so that we can map reads and
+ * writes back to capabilities, and thus figure out what to
+ * allow, deny, or virtualize
+ */
+int vfio_build_config_map(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map;
+ int i, len;
+ u8 pos, cap, tmp;
+ u16 flags;
+ int ret;
+#ifndef PCI_FIND_CAP_TTL
+#define PCI_FIND_CAP_TTL 48
+#endif
+ int loops = PCI_FIND_CAP_TTL;
+
+ map = kmalloc(pdev->cfg_size, GFP_KERNEL);
+ if (map == NULL)
+ return -ENOMEM;
+ for (i = 0; i < pdev->cfg_size; i++)
+ map[i] = 0xFF;
+ vdev->pci_config_map = map;
+
+ /* default config space */
+ for (i = 0; i < pci_capability_length[0]; i++)
+ map[i] = 0;
+
+ /* any capabilities? */
+ ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
+ if (ret < 0)
+ return ret;
+ if ((flags & PCI_STATUS_CAP_LIST) == 0)
+ return 0;
+
+ ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+ if (ret < 0)
+ return ret;
+ while (pos && --loops > 0) {
+ ret = pci_read_config_byte(pdev, pos, &cap);
+ if (ret < 0)
+ return ret;
+ if (cap == 0) {
+ printk(KERN_WARNING "%s: cap 0\n", __func__);
+ break;
+ }
+ if (cap > PCI_CAP_ID_MAX) {
+ printk(KERN_WARNING "%s: unknown pci capability id %x\n",
+ __func__, cap);
+ len = 0;
+ } else
+ len = pci_capability_length[cap];
+ if (len == 0) {
+ printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
+ __func__, cap);
+ len = 4;
+ }
+ if (len == 0xFF) {
+ switch (cap) {
+ case PCI_CAP_ID_MSI:
+ len = pci_msi_cap_len(pdev, pos);
+ if (len < 0)
+ return len;
+ break;
+ case PCI_CAP_ID_PCIX:
+ ret = pci_read_config_word(pdev, pos + 2,
+ &flags);
+ if (ret < 0)
+ return ret;
+ if (flags & 0x3000)
+ len = 24;
+ else
+ len = 8;
+ break;
+ case PCI_CAP_ID_VNDR:
+ /* length follows next field */
+ ret = pci_read_config_byte(pdev, pos + 2, &tmp);
+ if (ret < 0)
+ return ret;
+ len = tmp;
+ break;
+ default:
+ len = 0;
+ break;
+ }
+ }
+
+ for (i = 0; i < len; i++) {
+ if (map[pos+i] != 0xFF)
+ printk(KERN_WARNING
+ "%s: pci config conflict at %x, "
+ "caps %x %x\n",
+ __func__, i, map[pos+i], cap);
+ map[pos+i] = cap;
+ }
+ ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
+ if (ret < 0)
+ return ret;
+ }
+ if (loops <= 0)
+ printk(KERN_ERR "%s: config space loop!\n", __func__);
+ return 0;
+}
+
+static void vfio_virt_init(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int bar;
+ u32 *lp;
+ u32 val;
+ u8 pos;
+ int i, len;
+
+ for (bar = 0; bar <= 5; bar++) {
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
+ *lp++ = val;
+ }
+ lp = (u32 *)vdev->vinfo.rombar;
+ pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
+ *lp = val;
+
+ vdev->vinfo.intr = pdev->irq;
+
+ pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
+ if (pos > 0) {
+ len = pci_msi_cap_len(pdev, pos);
+ if (len < 0)
+ return;
+ for (i = 0; i < len; i++)
+ (void) pci_read_config_byte(pdev, pos + i,
+ &vdev->vinfo.msi[i]);
+ }
+}
+
+static void vfio_bar_fixup(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int bar;
+ u32 *lp;
+ u64 mask;
+
+ for (bar = 0; bar <= 5; bar++) {
+ if (pci_resource_start(pdev, bar))
+ mask = ~(pci_resource_len(pdev, bar) - 1);
+ else
+ mask = 0;
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ *lp &= (u32)mask;
+
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
+ *lp |= PCI_BASE_ADDRESS_SPACE_IO;
+ else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
+ *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
+ *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
+ *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+ lp++;
+ *lp &= (u32)(mask >> 32);
+ bar++;
+ }
+ }
+ }
+
+ lp = (u32 *)vdev->vinfo.rombar;
+ mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
+ *lp &= (u32)mask | PCI_ROM_ADDRESS_ENABLE;
+
+ vdev->vinfo.bardirty = 0;
+}
+
+static int vfio_config_rwbyte(int write,
+ struct vfio_dev *vdev,
+ int pos,
+ char __user *buf)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map = vdev->pci_config_map;
+ u8 cap, val, newval;
+ u16 start, off;
+ int p;
+ struct perm_bits *perm;
+ u8 wr, virt;
+ int ret;
+ int len;
+
+ cap = map[pos];
+ if (cap == 0xFF) { /* unknown region */
+ if (write)
+ return 0; /* silent no-op */
+ val = 0;
+ if (pos <= pci_capability_length[0]) /* ok to read */
+ (void) pci_read_config_byte(pdev, pos, &val);
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ return 0;
+ }
+
+ /* scan back to start of cap region */
+ for (p = pos; p >= 0; p--) {
+ if (map[p] != cap)
+ break;
+ start = p;
+ }
+ off = pos - start; /* offset within capability */
+
+ perm = pci_cap_perms[cap];
+ if (cap == PCI_CAP_ID_MSI) {
+ len = pci_msi_cap_len(pdev, start);
+ switch (len) {
+ case 10:
+ perm = pci_cap_msi_10_perm;
+ break;
+ case 14:
+ perm = pci_cap_msi_14_perm;
+ break;
+ case 20:
+ perm = pci_cap_msi_20_perm;
+ break;
+ case 24:
+ perm = pci_cap_msi_24_perm;
+ break;
+ default:
+ perm = NULL;
+ break;
+ }
+ }
+ if (perm == NULL) {
+ wr = 0;
+ virt = 0;
+ } else {
+ perm += (off >> 2);
+ wr = perm->write >> ((off & 3) * 8);
+ virt = perm->rvirt >> ((off & 3) * 8);
+ }
+ if (write && !wr) /* no writeable bits */
+ return 0;
+ if (!virt) {
+ if (write) {
+ if (copy_from_user(&val, buf, 1))
+ return -EFAULT;
+ val &= wr;
+ if (wr != 0xFF) {
+ u8 existing;
+
+ ret = pci_read_config_byte(pdev, pos,
+ &existing);
+ if (ret < 0)
+ return ret;
+ val |= (existing & ~wr);
+ }
+ pci_write_config_byte(pdev, pos, val);
+ } else {
+ ret = pci_read_config_byte(pdev, pos, &val);
+ if (ret < 0)
+ return ret;
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ }
+ return 0;
+ }
+
+ if (write) {
+ if (copy_from_user(&newval, buf, 1))
+ return -EFAULT;
+ }
+ /*
+ * We get here if there are some virt bits
+ * handle remaining real bits, if any
+ */
+ if (~virt) {
+ u8 rbits = (~virt) & wr;
+
+ ret = pci_read_config_byte(pdev, pos, &val);
+ if (ret < 0)
+ return ret;
+ if (write && rbits) {
+ val &= ~rbits;
+ newval &= rbits;
+ val |= newval;
+ pci_write_config_byte(pdev, pos, val);
+ }
+ }
+ /*
+ * Now handle entirely virtual fields
+ */
+ switch (cap) {
+ case PCI_CAP_ID_BASIC: /* virtualize BARs */
+ switch (off) {
+ /*
+ * vendor and device are virt because they don't
+ * show up otherwise for sr-iov vfs
+ */
+ case PCI_VENDOR_ID:
+ val = pdev->vendor;
+ break;
+ case PCI_VENDOR_ID + 1:
+ val = pdev->vendor >> 8;
+ break;
+ case PCI_DEVICE_ID:
+ val = pdev->device;
+ break;
+ case PCI_DEVICE_ID + 1:
+ val = pdev->device >> 8;
+ break;
+ case PCI_INTERRUPT_LINE:
+ if (write)
+ vdev->vinfo.intr = newval;
+ else
+ val = vdev->vinfo.intr;
+ break;
+ case PCI_ROM_ADDRESS:
+ case PCI_ROM_ADDRESS+1:
+ case PCI_ROM_ADDRESS+2:
+ case PCI_ROM_ADDRESS+3:
+ if (write) {
+ vdev->vinfo.rombar[off & 3] = newval;
+ vdev->vinfo.bardirty = 1;
+ } else {
+ if (vdev->vinfo.bardirty)
+ vfio_bar_fixup(vdev);
+ val = vdev->vinfo.rombar[off & 3];
+ }
+ break;
+ default:
+ if (off >= PCI_BASE_ADDRESS_0 &&
+ off <= PCI_BASE_ADDRESS_5 + 3) {
+ int boff = off - PCI_BASE_ADDRESS_0;
+
+ if (write) {
+ vdev->vinfo.bar[boff] = newval;
+ vdev->vinfo.bardirty = 1;
+ } else {
+ if (vdev->vinfo.bardirty)
+ vfio_bar_fixup(vdev);
+ val = vdev->vinfo.bar[boff];
+ }
+ }
+ break;
+ }
+ break;
+ case PCI_CAP_ID_MSI: /* virtualize (parts of) MSI */
+ if (write)
+ vdev->vinfo.msi[off] = newval;
+ else
+ val = vdev->vinfo.msi[off];
+ break;
+ }
+ if (!write && copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ return 0;
+}
+
+ssize_t vfio_config_readwrite(int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int done = 0;
+ int ret;
+ u16 pos;
+
+
+ if (vdev->pci_config_map == NULL) {
+ ret = vfio_build_config_map(vdev);
+ if (ret < 0)
+ goto out;
+ vfio_virt_init(vdev);
+ }
+
+ while (count > 0) {
+ pos = *ppos;
+ if (pos == pdev->cfg_size)
+ break;
+ if (pos > pdev->cfg_size) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * we grab the irqlock here to prevent confusing
+ * the read/modify/write sequence in vfio_interrupt
+ */
+ spin_lock_irq(&vdev->irqlock);
+ ret = vfio_config_rwbyte(write, vdev, pos, buf);
+ spin_unlock_irq(&vdev->irqlock);
+
+ if (ret < 0)
+ goto out;
+ buf++;
+ done++;
+ count--;
+ (*ppos)++;
+ }
+ ret = done;
+out:
+ return ret;
+}
diff --git a/drivers/vfio/vfio_rdwr.c b/drivers/vfio/vfio_rdwr.c
index e69de29..f4bd2b7 100644
--- a/drivers/vfio/vfio_rdwr.c
+++ b/drivers/vfio/vfio_rdwr.c
@@ -0,0 +1,152 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <[email protected]>
+ * Copyright(C) 2005, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2006, Hans J. Koch <[email protected]>
+ * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <[email protected]>
+ */
+
+#include <linux/fs.h>
+#include <linux/mmu_notifier.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include <linux/vfio.h>
+
+ssize_t vfio_io_readwrite(
+ int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ size_t done = 0;
+ resource_size_t end;
+ void __iomem *io;
+ loff_t pos;
+ int pci_space;
+ int unit;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ pos = vfio_offset_to_pci_offset(*ppos);
+
+ if (!pci_resource_start(pdev, pci_space))
+ return -EINVAL;
+ end = pci_resource_len(pdev, pci_space);
+ if (pos + count > end)
+ return -EINVAL;
+ if (vdev->bar[pci_space] == NULL)
+ vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+ io = vdev->bar[pci_space];
+
+ while (count > 0) {
+ if ((pos % 4) == 0 && count >= 4) {
+ u32 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 4))
+ return -EFAULT;
+ iowrite32(val, io + pos);
+ } else {
+ val = ioread32(io + pos);
+ if (copy_to_user(buf, &val, 4))
+ return -EFAULT;
+ }
+ unit = 4;
+ } else if ((pos % 2) == 0 && count >= 2) {
+ u16 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 2))
+ return -EFAULT;
+ iowrite16(val, io + pos);
+ } else {
+ val = ioread16(io + pos);
+ if (copy_to_user(buf, &val, 2))
+ return -EFAULT;
+ }
+ unit = 2;
+ } else {
+ u8 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 1))
+ return -EFAULT;
+ iowrite8(val, io + pos);
+ } else {
+ val = ioread8(io + pos);
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ }
+ unit = 1;
+ }
+ pos += unit;
+ buf += unit;
+ count -= unit;
+ done += unit;
+ }
+ *ppos += done;
+ return done;
+}
+
+ssize_t vfio_mem_readwrite(
+ int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ resource_size_t end;
+ void __iomem *io;
+ loff_t pos;
+ int pci_space;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ pos = vfio_offset_to_pci_offset(*ppos);
+
+ if (!pci_resource_start(pdev, pci_space))
+ return -EINVAL;
+ end = pci_resource_len(pdev, pci_space);
+ if (vdev->bar[pci_space] == NULL)
+ vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+ io = vdev->bar[pci_space];
+
+ if (pos > end)
+ return -EINVAL;
+ if (pos == end)
+ return 0;
+ if (pos + count > end)
+ count = end - pos;
+ if (write) {
+ if (copy_from_user(io + pos, buf, count))
+ return -EFAULT;
+ } else {
+ if (copy_to_user(buf, io + pos, count))
+ return -EFAULT;
+ }
+ *ppos += count;
+ return count;
+}
diff --git a/drivers/vfio/vfio_sysfs.c b/drivers/vfio/vfio_sysfs.c
index e69de29..6275809 100644
--- a/drivers/vfio/vfio_sysfs.c
+++ b/drivers/vfio/vfio_sysfs.c
@@ -0,0 +1,153 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <[email protected]>
+ * Copyright(C) 2005, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2006, Hans J. Koch <[email protected]>
+ * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <[email protected]>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+struct vfio_class *vfio_class;
+
+int vfio_class_init(void)
+{
+ int ret = 0;
+
+ if (vfio_class != NULL) {
+ kref_get(&vfio_class->kref);
+ goto exit;
+ }
+
+ vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
+ if (!vfio_class) {
+ ret = -ENOMEM;
+ goto err_kzalloc;
+ }
+
+ kref_init(&vfio_class->kref);
+ vfio_class->class = class_create(THIS_MODULE, "vfio");
+ if (IS_ERR(vfio_class->class)) {
+ ret = IS_ERR(vfio_class->class);
+ printk(KERN_ERR "class_create failed for vfio\n");
+ goto err_class_create;
+ }
+ return 0;
+
+err_class_create:
+ kfree(vfio_class);
+ vfio_class = NULL;
+err_kzalloc:
+exit:
+ return ret;
+}
+
+static void vfio_class_release(struct kref *kref)
+{
+ /* Ok, we cheat as we know we only have one vfio_class */
+ class_destroy(vfio_class->class);
+ kfree(vfio_class);
+ vfio_class = NULL;
+}
+
+void vfio_class_destroy(void)
+{
+ if (vfio_class)
+ kref_put(&vfio_class->kref, vfio_class_release);
+}
+
+static ssize_t config_map_read(struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t count)
+{
+ struct vfio_dev *vdev = bin_attr->private;
+ int ret;
+
+ if (off >= 256)
+ return 0;
+ if (off + count > 256)
+ count = 256 - off;
+ if (vdev->pci_config_map == NULL) {
+ ret = vfio_build_config_map(vdev);
+ if (ret < 0)
+ return ret;
+ }
+ memcpy(buf, vdev->pci_config_map + off, count);
+ return count;
+}
+
+static ssize_t show_locked_pages(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vfio_dev *vdev = dev_get_drvdata(dev);
+
+ if (vdev == NULL)
+ return -ENODEV;
+ return sprintf(buf, "%u\n", vdev->locked_pages);
+}
+
+static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
+
+static struct attribute *vfio_attrs[] = {
+ &dev_attr_locked_pages.attr,
+ NULL,
+};
+
+static struct attribute_group vfio_attr_grp = {
+ .attrs = vfio_attrs,
+};
+
+static struct bin_attribute config_map_bin_attribute = {
+ .attr = {
+ .name = "config_map",
+ .mode = S_IRUGO,
+ },
+ .size = 256,
+ .read = config_map_read,
+};
+
+int vfio_dev_add_attributes(struct vfio_dev *vdev)
+{
+ struct bin_attribute *bi;
+ int ret;
+
+ ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
+ if (ret)
+ return ret;
+ bi = kmalloc(sizeof(*bi), GFP_KERNEL);
+ if (bi == NULL)
+ return -ENOMEM;
+ *bi = config_map_bin_attribute;
+ bi->private = vdev;
+ return sysfs_create_bin_file(&vdev->dev->kobj, bi);
+}
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index e2ea0b2..ed37cf4 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -166,6 +166,7 @@ header-y += ultrasound.h
header-y += un.h
header-y += utime.h
header-y += veth.h
+header-y += vfio.h
header-y += videotext.h
header-y += x25.h

diff --git a/include/linux/uiommu.h b/include/linux/uiommu.h
index e69de29..a7b7eac 100644
--- a/include/linux/uiommu.h
+++ b/include/linux/uiommu.h
@@ -0,0 +1,76 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * uiommu driver - manipulation of iommu domains from user progs
+ */
+struct uiommu_domain {
+ struct iommu_domain *domain;
+ atomic_t refcnt;
+};
+
+/*
+ * Kernel routines invoked by fellow driver (vfio)
+ * after uiommu domain fd is passed in.
+ */
+struct uiommu_domain *uiommu_fdget(int fd);
+void uiommu_put(struct uiommu_domain *);
+
+/*
+ * These inlines are placeholders for future routines
+ * which may keep statistics, show info in sysfs, etc.
+ */
+static inline int uiommu_attach_device(struct uiommu_domain *udomain,
+ struct device *dev)
+{
+ return iommu_attach_device(udomain->domain, dev);
+}
+
+static inline void uiommu_detach_device(struct uiommu_domain *udomain,
+ struct device *dev)
+{
+ iommu_detach_device(udomain->domain, dev);
+}
+
+static inline int uiommu_map_range(struct uiommu_domain *udomain,
+ unsigned long iova,
+ phys_addr_t paddr,
+ size_t size,
+ int prot)
+{
+ return iommu_map_range(udomain->domain, iova, paddr, size, prot);
+}
+
+static inline void uiommu_unmap_range(struct uiommu_domain *udomain,
+ unsigned long iova,
+ size_t size)
+{
+ iommu_unmap_range(udomain->domain, iova, size);
+}
+
+static inline phys_addr_t uiommu_iova_to_phys(struct uiommu_domain *udomain,
+ unsigned long iova)
+{
+ return iommu_iova_to_phys(udomain->domain, iova);
+}
+
+static inline int uiommu_domain_has_cap(struct uiommu_domain *udomain,
+ unsigned long cap)
+{
+ return iommu_domain_has_cap(udomain->domain, cap);
+}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e69de29..52aa0dd 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -0,0 +1,202 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <[email protected]>
+ * Copyright(C) 2005, Thomas Gleixner <[email protected]>
+ * Copyright(C) 2006, Hans J. Koch <[email protected]>
+ * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <[email protected]>
+ */
+#include <linux/types.h>
+
+/*
+ * VFIO driver - allow mapping and use of certain PCI devices
+ * in unprivileged user processes. (If IOMMU is present)
+ * Especially useful for Virtual Function parts of SR-IOV devices
+ */
+
+#ifdef __KERNEL__
+
+struct vfio_dev {
+ struct device *dev;
+ struct pci_dev *pdev;
+ u8 *pci_config_map;
+ int pci_config_size;
+ char name[8];
+ int devnum;
+ void __iomem *bar[PCI_ROM_RESOURCE+1];
+ spinlock_t irqlock; /* guards command register accesses */
+ int listeners;
+ u32 locked_pages;
+ struct mutex lgate; /* listener gate */
+ struct mutex dgate; /* dma op gate */
+ struct mutex igate; /* intr op gate */
+ wait_queue_head_t dev_idle_q;
+ int mapcount;
+ struct uiommu_domain *udomain;
+ struct msix_entry *msix;
+ int nvec;
+ int cachec;
+ struct eventfd_ctx *ev_irq;
+ struct eventfd_ctx *ev_msi;
+ struct eventfd_ctx **ev_msix;
+ struct {
+ u8 intr;
+ u8 bardirty;
+ u8 rombar[4];
+ u8 bar[6*4];
+ u8 msi[24];
+ } vinfo;
+};
+
+struct vfio_listener {
+ struct vfio_dev *vdev;
+ struct list_head dm_list;
+ struct mm_struct *mm;
+ struct mmu_notifier mmu_notifier;
+};
+
+/*
+ * Structure for keeping track of memory nailed down by the
+ * user for DMA
+ */
+struct dma_map_page {
+ struct list_head list;
+ struct page **pages;
+ dma_addr_t daddr;
+ unsigned long vaddr;
+ int npage;
+ int rdwr;
+};
+
+/* VFIO class infrastructure */
+struct vfio_class {
+ struct kref kref;
+ struct class *class;
+};
+extern struct vfio_class *vfio_class;
+
+ssize_t vfio_io_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+ssize_t vfio_config_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+
+void vfio_disable_msi(struct vfio_dev *);
+void vfio_disable_msix(struct vfio_dev *);
+int vfio_enable_msi(struct vfio_dev *, int);
+int vfio_enable_msix(struct vfio_dev *, int, void __user *);
+
+#ifndef PCI_MSIX_ENTRY_SIZE
+#define PCI_MSIX_ENTRY_SIZE 16
+#endif
+#ifndef PCI_STATUS_INTERRUPT
+#define PCI_STATUS_INTERRUPT 0x08
+#endif
+
+struct vfio_dma_map;
+void vfio_dma_unmapall(struct vfio_listener *);
+int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
+int vfio_dma_map_common(struct vfio_listener *, unsigned int,
+ struct vfio_dma_map *);
+int vfio_domain_set(struct vfio_dev *, int, int);
+int vfio_domain_unset(struct vfio_dev *);
+
+int vfio_class_init(void);
+void vfio_class_destroy(void);
+int vfio_dev_add_attributes(struct vfio_dev *);
+int vfio_build_config_map(struct vfio_dev *);
+
+irqreturn_t vfio_interrupt(int, void *);
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for ioctls */
+
+/*
+ * Structure for DMA mapping of user buffers
+ * vaddr, dmaaddr, and size must all be page aligned
+ * buffer may only be larger than 1 page if (a) there is
+ * an iommu in the system, or (b) buffer is part of a huge page
+ */
+struct vfio_dma_map {
+ __u64 vaddr; /* process virtual addr */
+ __u64 dmaaddr; /* desired and/or returned dma address */
+ __u64 size; /* size in bytes */
+ __u64 flags; /* bool: 0 for r/o; 1 for r/w */
+#define VFIO_FLAG_WRITE 0x1 /* req writeable DMA mem */
+};
+
+/* map user pages at specific dma address */
+/* requires previous VFIO_DOMAIN_SET */
+#define VFIO_DMA_MAP_IOVA _IOWR(';', 101, struct vfio_dma_map)
+
+/* unmap user pages */
+#define VFIO_DMA_UNMAP _IOW(';', 102, struct vfio_dma_map)
+
+/* request IRQ interrupts; use given eventfd */
+#define VFIO_EVENTFD_IRQ _IOW(';', 103, int)
+
+/* request MSI interrupts; use given eventfd */
+#define VFIO_EVENTFD_MSI _IOW(';', 104, int)
+
+/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
+#define VFIO_EVENTFDS_MSIX _IOW(';', 105, int)
+
+/* Get length of a BAR */
+#define VFIO_BAR_LEN _IOWR(';', 167, __u32)
+
+/* Set the IOMMU domain - arg is fd from uiommu driver */
+#define VFIO_DOMAIN_SET _IOW(';', 107, int)
+
+/* Unset the IOMMU domain */
+#define VFIO_DOMAIN_UNSET _IO(';', 108)
+
+/*
+ * Reads, writes, and mmaps determine which PCI BAR (or config space)
+ * from the high level bits of the file offset
+ */
+#define VFIO_PCI_BAR0_RESOURCE 0x0
+#define VFIO_PCI_BAR1_RESOURCE 0x1
+#define VFIO_PCI_BAR2_RESOURCE 0x2
+#define VFIO_PCI_BAR3_RESOURCE 0x3
+#define VFIO_PCI_BAR4_RESOURCE 0x4
+#define VFIO_PCI_BAR5_RESOURCE 0x5
+#define VFIO_PCI_ROM_RESOURCE 0x6
+#define VFIO_PCI_CONFIG_RESOURCE 0xF
+#define VFIO_PCI_SPACE_SHIFT 32
+#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
+
+static inline int vfio_offset_to_pci_space(__u64 off)
+{
+ return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
+}
+
+static inline u32 vfio_offset_to_pci_offset(__u64 off)
+{
+ return off & (u32)0xFFFFFFFF;
+}
+
+static inline __u64 vfio_pci_space_to_offset(int sp)
+{
+ return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
+}


2010-07-17 08:45:46

by Piotr Jaroszyński

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On 16 July 2010 23:58, Tom Lyon <[email protected]> wrote:
> The VFIO "driver" is used to allow privileged AND non-privileged processes to
> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> devices.

Thanks for working on that! I wonder whether it's possible to say what
are the chances of it being merged to mainline and which version we
might be talking about?

--
Best Regards
Piotr Jaroszyński

2010-07-18 09:45:36

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

Hi Tom,
> ---
> In this version:
>
> There are lots of bug fixes and cleanups in this version, but the main
> change is to check to make sure that the IOMMU has interrupt remapping
> enabled, which is necessary to prevent user level code from triggering
> spurious interrupts for other devices. Since most platforms today
> do not have the necessary hardware and/or software for this, a module
> option can override this check, thus making vfio useful (but not safe)
> on many more platforms.
>
> In the next version I plan to add kernel to user messaging using the
> generic netlink mechanism to allow the user driver to react to hot add
> and remove, and power management requests.
>
> Blurb from version 2:
>
> This version now requires an IOMMU domain to be set before any access to
> device registers is granted (except that config space may be read). In
> addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> which does not have sufficient controls around IOMMU usage. The IOMMU domain
> is obtained from the 'uiommu' driver which is included in this patch.
>
> Various locking, security, and documentation issues have also been fixed.
>

I think this is making nice progress, especially good to see
some effort to address the interrupt remapping issue.
I just realized we also have an issue with determining
the MSI capability and allocating entries. Some ideas
on addressing this posted below.
I also think it might make sense to involve the pci crowd here.

It might be nice to see a bit of documentation for the interface
presented to userspace. It is not always obvious.
For example, passing CONFIG as offset to
write or read will not always trigger access to config space.

More comments below.


On Fri, Jul 16, 2010 at 02:58:48PM -0700, Tom Lyon wrote:
> +static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u16 pos;
> + u32 table_offset;
> + u16 table_size;
> + u8 bir;
> + u32 lo, hi, startp, endp;
> +
> + pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
> + if (!pos)
> + return 0;
> +
> + pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
> + table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
> + pci_read_config_dword(pdev, pos + 4, &table_offset);
> + bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
> + lo = table_offset >> PAGE_SHIFT;
> + hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
> + >> PAGE_SHIFT;
> + startp = start >> PAGE_SHIFT;
> + endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (bir == vfio_offset_to_pci_space(start) &&
> + overlap(lo, hi, startp, endp)) {
> + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> + __func__);
> + return -EINVAL;
> + }
> + return 0;
> +}


You can't just check MSI capability to figure out whether MSI
will work: this also depends on the host system in many ways.
The only way to really know is to request vectors in host. So
what we really need is an API that will reserve MSI vectors,
but not assign them in the device until guest asks us to
do it.

> +
> + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
> + if (allow_unsafe_intrs) {
> + /* don't allow writes to msi-x vectors */
> + ret = vfio_msix_check(vdev, *ppos, count);

I thought we have an agreement here: it's useless to protect
msix vector space and at the same time allow DMA from device,
since device can do DMA write to trigger MSI.
So why do you keep this code around?


> + case VFIO_EVENTFD_MSI:
> + if (copy_from_user(&fd, uarg, sizeof fd))
> + return -EFAULT;
> + mutex_lock(&vdev->igate);
> + if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> + ret = vfio_enable_msi(vdev, fd);
> + else if (fd < 0 && vdev->ev_msi)
> + vfio_disable_msi(vdev);
> + else
> + ret = -EINVAL;
> + mutex_unlock(&vdev->igate);
> + break;

I think that for virt we'll need multivector support for MSI.


> diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
> index e69de29..8bd5c00 100644
> --- a/drivers/vfio/vfio_pci_config.c
> +++ b/drivers/vfio/vfio_pci_config.c
> @@ -0,0 +1,605 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]

Any hints on what this file does?

> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <[email protected]>
> + */

Tom so this file is some 600 lines of pretty tricky code,
that I still don't completely understand the purpose of.
For example, it prevents usespace from writing into readonly hardware
registers like class/revision id. But writing them is harmless.
As above, all code that does tricks with msi is also kind of useless.
Maybe we need protection against writing BAR registers, but
what about all the rest of it?
Leavng it pasted below so you can explain it to me.

Also, I am still unsure what purpose does the trick of "virtualization" serve.
For example, you return a fake device id for an sr/iov device.
Sounds a bit like forcing policy.
Wouldn't it be cleaner to simply add an ioctl to get the
fake id, thus making both the real value and the fake value
available.

On the other hand, the virtualization trick will prevent userspace
from doing things like restoring registers after a device specific reset.
See drivers/infiniband/hw/mthca/mthca_reset.c as one example.
For BARs, we can solve this by checking for BAR change and
restoring it.
Again, doing this might be better done through a special ioctl?


> +
> +#include <linux/fs.h>
> +#include <linux/pci.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +
> +#define PCI_CAP_ID_BASIC 0
> +#ifndef PCI_CAP_ID_MAX
> +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> +#endif
> +
> +/*
> + * Lengths of PCI Config Capabilities
> + * 0 means unknown (but at least 4)
> + * FF means special/variable
> + */
> +static u8 pci_capability_length[] = {
> + [PCI_CAP_ID_BASIC] = 64, /* pci config header */
> + [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
> + [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
> + [PCI_CAP_ID_VPD] = 8,
> + [PCI_CAP_ID_SLOTID] = 4,
> + [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, 20, or 24 */
> + [PCI_CAP_ID_CHSWP] = 4,
> + [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
> + [PCI_CAP_ID_HT] = 28,
> + [PCI_CAP_ID_VNDR] = 0xFF,
> + [PCI_CAP_ID_DBG] = 0,
> + [PCI_CAP_ID_CCRC] = 0,
> + [PCI_CAP_ID_SHPC] = 0,
> + [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
> + [PCI_CAP_ID_AGP3] = 0,
> + [PCI_CAP_ID_EXP] = 36,
> + [PCI_CAP_ID_MSIX] = 12,
> + [PCI_CAP_ID_AF] = 6,
> +};

Maybe all these constants should go into pci_regs.h?
Please consider Cc Jesse/linux-pci.

> +
> +/*
> + * Read/Write Permission Bits - one bit for each bit in capability
> + * Any field can be read if it exists,
> + * but what is read depends on whether the field
> + * is 'virtualized', or just pass thru to the hardware.
> + * Any virtualized field is also virtualized for writes.
> + * Writes are only permitted if they have a 1 bit here.
> + */
> +struct perm_bits {
> + u32 rvirt; /* read bits which must be virtualized */
> + u32 write; /* writeable bits - virt if read virt */
> +};
> +
> +static struct perm_bits pci_cap_basic_perm[] = {
> + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> + { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */

don't want to virtualize mem/io?

> + { 0, 0, }, /* 0x08 class code & revision id */
> + { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
> + { 0, 0, }, /* 0x28 cardbus - not yet */
> + { 0, 0, }, /* 0x2c subsys vendor & dev */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
> + { 0, 0, }, /* 0x34 capability ptr & resv */
> + { 0, 0, }, /* 0x38 resv */
> + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> +};

Use .[] initializers above instead of sticking register name in comment?

> +
> +static struct perm_bits pci_cap_pm_perm[] = {
> + { 0, 0, }, /* 0x00 PM capabilities */
> + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
> +};
> +
> +static struct perm_bits pci_cap_vpd_perm[] = {
> + { 0, 0xFFFF0000, }, /* 0x00 address */
> + { 0, 0xFFFFFFFF, }, /* 0x04 data */
> +};
> +
> +static struct perm_bits pci_cap_slotid_perm[] = {
> + { 0, 0, }, /* 0x00 all read only */
> +};
> +
> +/* 4 different possible layouts of MSI capability */
> +static struct perm_bits pci_cap_msi_10_perm[] = {
> + { 0, 0, }, /* 0x00 MSI message control */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
> +};
> +static struct perm_bits pci_cap_msi_14_perm[] = {
> + { 0, 0, }, /* 0x00 MSI message control */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
> + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> +};
> +static struct perm_bits pci_cap_msi_20_perm[] = {
> + { 0, 0, }, /* 0x00 MSI message control */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
> + { 0, 0xFFFFFFFF, }, /* 0x0c MSI mask bits */
> + { 0, 0xFFFFFFFF, }, /* 0x10 MSI pending bits */
> +};
> +static struct perm_bits pci_cap_msi_24_perm[] = {
> + { 0, 0, }, /* 0x00 MSI message control */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
> + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> + { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
> + { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
> +};
> +
> +static struct perm_bits pci_cap_pcix_perm[] = {
> + { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
> + { 0, 0, }, /* 0x04 PCI_X_STATUS */
> + { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
> + { 0, 0, }, /* 0x0c ECC first addr */
> + { 0, 0, }, /* 0x10 ECC second addr */
> + { 0, 0, }, /* 0x14 ECC attr */
> +};
> +
> +/* pci express capabilities */
> +static struct perm_bits pci_cap_exp_perm[] = {
> + { 0, 0, }, /* 0x00 PCIe capabilities */
> + { 0, 0, }, /* 0x04 PCIe device capabilities */
> + { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
> + { 0, 0, }, /* 0x0c PCIe link capabilities */
> + { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
> + { 0, 0, }, /* 0x14 PCIe slot capabilities */
> + { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
> + { 0, 0, }, /* 0x1c PCIe root port stuff */
> + { 0, 0, }, /* 0x20 PCIe root port stuff */
> +};
> +
> +static struct perm_bits pci_cap_msix_perm[] = {
> + { 0, 0, }, /* 0x00 MSI-X Enable */
> + { 0, 0, }, /* 0x04 table offset & bir */
> + { 0, 0, }, /* 0x08 pba offset & bir */
> +};
> +
> +static struct perm_bits pci_cap_af_perm[] = {
> + { 0, 0, }, /* 0x00 af capability */
> + { 0, 0x0001, }, /* 0x04 af flr bit */

So you let application reset the function?
If this happens, application will need to restore config space,
so virtualizing will create problems.

> +};
> +
> +static struct perm_bits *pci_cap_perms[] = {
> + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
> + [PCI_CAP_ID_PM] = pci_cap_pm_perm,
> + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
> + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
> + [PCI_CAP_ID_MSI] = NULL, /* special */
> + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
> + [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
> + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
> + [PCI_CAP_ID_AF] = pci_cap_af_perm,
> +};
> +
> +static int pci_msi_cap_len(struct pci_dev *pdev, u8 pos)
> +{
> + int len;
> + int ret;
> + u16 flags;
> +
> + ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
> + if (ret < 0)
> + return ret;
> + if (flags & PCI_MSI_FLAGS_64BIT)
> + len = 14;
> + else
> + len = 10;
> + if (flags & PCI_MSI_FLAGS_MASKBIT)
> + len += 10;
> + return len;
> +}
> +
> +/*
> + * We build a map of the config space that tells us where
> + * and what capabilities exist, so that we can map reads and
> + * writes back to capabilities, and thus figure out what to
> + * allow, deny, or virtualize
> + */
> +int vfio_build_config_map(struct vfio_dev *vdev)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u8 *map;
> + int i, len;
> + u8 pos, cap, tmp;
> + u16 flags;
> + int ret;
> +#ifndef PCI_FIND_CAP_TTL
> +#define PCI_FIND_CAP_TTL 48
> +#endif
> + int loops = PCI_FIND_CAP_TTL;
> +
> + map = kmalloc(pdev->cfg_size, GFP_KERNEL);
> + if (map == NULL)
> + return -ENOMEM;
> + for (i = 0; i < pdev->cfg_size; i++)
> + map[i] = 0xFF;
> + vdev->pci_config_map = map;
> +
> + /* default config space */
> + for (i = 0; i < pci_capability_length[0]; i++)
> + map[i] = 0;
> +
> + /* any capabilities? */
> + ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
> + if (ret < 0)
> + return ret;
> + if ((flags & PCI_STATUS_CAP_LIST) == 0)
> + return 0;
> +
> + ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
> + if (ret < 0)
> + return ret;
> + while (pos && --loops > 0) {
> + ret = pci_read_config_byte(pdev, pos, &cap);
> + if (ret < 0)
> + return ret;
> + if (cap == 0) {
> + printk(KERN_WARNING "%s: cap 0\n", __func__);
> + break;
> + }
> + if (cap > PCI_CAP_ID_MAX) {
> + printk(KERN_WARNING "%s: unknown pci capability id %x\n",
> + __func__, cap);
> + len = 0;

Why is this a problem? Devices will implement more capabilities.
How is access to an unknown one worse that access outside
any capability, or vendor specific, that you do allow?

> + } else
> + len = pci_capability_length[cap];
> + if (len == 0) {
> + printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
> + __func__, cap);
> + len = 4;
> + }
> + if (len == 0xFF) {
> + switch (cap) {
> + case PCI_CAP_ID_MSI:
> + len = pci_msi_cap_len(pdev, pos);
> + if (len < 0)
> + return len;
> + break;
> + case PCI_CAP_ID_PCIX:
> + ret = pci_read_config_word(pdev, pos + 2,
> + &flags);
> + if (ret < 0)
> + return ret;
> + if (flags & 0x3000)
> + len = 24;
> + else
> + len = 8;
> + break;
> + case PCI_CAP_ID_VNDR:
> + /* length follows next field */
> + ret = pci_read_config_byte(pdev, pos + 2, &tmp);
> + if (ret < 0)
> + return ret;
> + len = tmp;
> + break;
> + default:
> + len = 0;
> + break;
> + }
> + }
> +
> + for (i = 0; i < len; i++) {
> + if (map[pos+i] != 0xFF)
> + printk(KERN_WARNING
> + "%s: pci config conflict at %x, "
> + "caps %x %x\n",
> + __func__, i, map[pos+i], cap);
> + map[pos+i] = cap;
> + }
> + ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
> + if (ret < 0)
> + return ret;
> + }
> + if (loops <= 0)
> + printk(KERN_ERR "%s: config space loop!\n", __func__);
> + return 0;
> +}
> +
> +static void vfio_virt_init(struct vfio_dev *vdev)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + int bar;
> + u32 *lp;
> + u32 val;
> + u8 pos;
> + int i, len;
> +
> + for (bar = 0; bar <= 5; bar++) {
> + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> + pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
> + *lp++ = val;
> + }
> + lp = (u32 *)vdev->vinfo.rombar;
> + pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
> + *lp = val;
> +
> + vdev->vinfo.intr = pdev->irq;
> +
> + pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
> + if (pos > 0) {
> + len = pci_msi_cap_len(pdev, pos);
> + if (len < 0)
> + return;
> + for (i = 0; i < len; i++)
> + (void) pci_read_config_byte(pdev, pos + i,
> + &vdev->vinfo.msi[i]);
> + }
> +}
> +
> +static void vfio_bar_fixup(struct vfio_dev *vdev)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + int bar;
> + u32 *lp;
> + u64 mask;
> +
> + for (bar = 0; bar <= 5; bar++) {
> + if (pci_resource_start(pdev, bar))
> + mask = ~(pci_resource_len(pdev, bar) - 1);
> + else
> + mask = 0;
> + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> + *lp &= (u32)mask;
> +
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
> + *lp |= PCI_BASE_ADDRESS_SPACE_IO;
> + else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> + *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
> + *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
> + *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> + lp++;
> + *lp &= (u32)(mask >> 32);
> + bar++;
> + }
> + }
> + }
> +
> + lp = (u32 *)vdev->vinfo.rombar;
> + mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
> + *lp &= (u32)mask | PCI_ROM_ADDRESS_ENABLE;
> +
> + vdev->vinfo.bardirty = 0;
> +}
> +
> +static int vfio_config_rwbyte(int write,
> + struct vfio_dev *vdev,
> + int pos,
> + char __user *buf)

Consider exporting from pci-sysfs.c instead of duplicating code.

> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u8 *map = vdev->pci_config_map;
> + u8 cap, val, newval;
> + u16 start, off;
> + int p;
> + struct perm_bits *perm;
> + u8 wr, virt;
> + int ret;
> + int len;
> +
> + cap = map[pos];
> + if (cap == 0xFF) { /* unknown region */
> + if (write)
> + return 0; /* silent no-op */
> + val = 0;
> + if (pos <= pci_capability_length[0]) /* ok to read */
> + (void) pci_read_config_byte(pdev, pos, &val);
> + if (copy_to_user(buf, &val, 1))
> + return -EFAULT;
> + return 0;
> + }
> +
> + /* scan back to start of cap region */
> + for (p = pos; p >= 0; p--) {
> + if (map[p] != cap)
> + break;
> + start = p;
> + }
> + off = pos - start; /* offset within capability */
> +
> + perm = pci_cap_perms[cap];
> + if (cap == PCI_CAP_ID_MSI) {
> + len = pci_msi_cap_len(pdev, start);
> + switch (len) {
> + case 10:
> + perm = pci_cap_msi_10_perm;
> + break;
> + case 14:
> + perm = pci_cap_msi_14_perm;
> + break;
> + case 20:
> + perm = pci_cap_msi_20_perm;
> + break;
> + case 24:
> + perm = pci_cap_msi_24_perm;
> + break;
> + default:
> + perm = NULL;
> + break;
> + }
> + }
> + if (perm == NULL) {
> + wr = 0;
> + virt = 0;
> + } else {
> + perm += (off >> 2);
> + wr = perm->write >> ((off & 3) * 8);
> + virt = perm->rvirt >> ((off & 3) * 8);
> + }
> + if (write && !wr) /* no writeable bits */
> + return 0;
> + if (!virt) {
> + if (write) {
> + if (copy_from_user(&val, buf, 1))
> + return -EFAULT;
> + val &= wr;
> + if (wr != 0xFF) {
> + u8 existing;
> +
> + ret = pci_read_config_byte(pdev, pos,
> + &existing);
> + if (ret < 0)
> + return ret;
> + val |= (existing & ~wr);
> + }
> + pci_write_config_byte(pdev, pos, val);

I think this should be pci_user_write_config_dword and same for read.


> + } else {
> + ret = pci_read_config_byte(pdev, pos, &val);
> + if (ret < 0)
> + return ret;
> + if (copy_to_user(buf, &val, 1))
> + return -EFAULT;
> + }
> + return 0;
> + }
> +
> + if (write) {
> + if (copy_from_user(&newval, buf, 1))
> + return -EFAULT;
> + }
> + /*
> + * We get here if there are some virt bits
> + * handle remaining real bits, if any
> + */
> + if (~virt) {
> + u8 rbits = (~virt) & wr;
> +
> + ret = pci_read_config_byte(pdev, pos, &val);
> + if (ret < 0)
> + return ret;
> + if (write && rbits) {
> + val &= ~rbits;
> + newval &= rbits;
> + val |= newval;
> + pci_write_config_byte(pdev, pos, val);
> + }
> + }
> + /*
> + * Now handle entirely virtual fields
> + */
> + switch (cap) {
> + case PCI_CAP_ID_BASIC: /* virtualize BARs */
> + switch (off) {
> + /*
> + * vendor and device are virt because they don't
> + * show up otherwise for sr-iov vfs
> + */
> + case PCI_VENDOR_ID:
> + val = pdev->vendor;
> + break;
> + case PCI_VENDOR_ID + 1:
> + val = pdev->vendor >> 8;
> + break;
> + case PCI_DEVICE_ID:
> + val = pdev->device;
> + break;
> + case PCI_DEVICE_ID + 1:
> + val = pdev->device >> 8;
> + break;
> + case PCI_INTERRUPT_LINE:
> + if (write)
> + vdev->vinfo.intr = newval;
> + else
> + val = vdev->vinfo.intr;
> + break;
> + case PCI_ROM_ADDRESS:
> + case PCI_ROM_ADDRESS+1:
> + case PCI_ROM_ADDRESS+2:
> + case PCI_ROM_ADDRESS+3:
> + if (write) {
> + vdev->vinfo.rombar[off & 3] = newval;
> + vdev->vinfo.bardirty = 1;
> + } else {
> + if (vdev->vinfo.bardirty)
> + vfio_bar_fixup(vdev);
> + val = vdev->vinfo.rombar[off & 3];
> + }
> + break;
> + default:
> + if (off >= PCI_BASE_ADDRESS_0 &&
> + off <= PCI_BASE_ADDRESS_5 + 3) {
> + int boff = off - PCI_BASE_ADDRESS_0;
> +
> + if (write) {
> + vdev->vinfo.bar[boff] = newval;
> + vdev->vinfo.bardirty = 1;
> + } else {
> + if (vdev->vinfo.bardirty)
> + vfio_bar_fixup(vdev);
> + val = vdev->vinfo.bar[boff];
> + }
> + }
> + break;
> + }
> + break;
> + case PCI_CAP_ID_MSI: /* virtualize (parts of) MSI */
> + if (write)
> + vdev->vinfo.msi[off] = newval;
> + else
> + val = vdev->vinfo.msi[off];
> + break;
> + }
> + if (!write && copy_to_user(buf, &val, 1))
> + return -EFAULT;
> + return 0;
> +}
> +
> +ssize_t vfio_config_readwrite(int write,
> + struct vfio_dev *vdev,
> + char __user *buf,
> + size_t count,
> + loff_t *ppos)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + int done = 0;
> + int ret;
> + u16 pos;
> +
> +
> + if (vdev->pci_config_map == NULL) {
> + ret = vfio_build_config_map(vdev);
> + if (ret < 0)
> + goto out;
> + vfio_virt_init(vdev);
> + }
> +
> + while (count > 0) {
> + pos = *ppos;
> + if (pos == pdev->cfg_size)
> + break;
> + if (pos > pdev->cfg_size) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * we grab the irqlock here to prevent confusing
> + * the read/modify/write sequence in vfio_interrupt
> + */
> + spin_lock_irq(&vdev->irqlock);
> + ret = vfio_config_rwbyte(write, vdev, pos, buf);
> + spin_unlock_irq(&vdev->irqlock);
> +
> + if (ret < 0)
> + goto out;
> + buf++;
> + done++;
> + count--;
> + (*ppos)++;
> + }
> + ret = done;
> +out:
> + return ret;
> +}
> diff --git a/drivers/vfio/vfio_rdwr.c b/drivers/vfio/vfio_rdwr.c
> index e69de29..f4bd2b7 100644
> --- a/drivers/vfio/vfio_rdwr.c
> +++ b/drivers/vfio/vfio_rdwr.c
> @@ -0,0 +1,152 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]

Any hints on what this file does?

> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <[email protected]>
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/pci.h>
> +#include <linux/uaccess.h>
> +#include <linux/io.h>
> +
> +#include <linux/vfio.h>
> +
> +ssize_t vfio_io_readwrite(
> + int write,
> + struct vfio_dev *vdev,
> + char __user *buf,
> + size_t count,
> + loff_t *ppos)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + size_t done = 0;
> + resource_size_t end;
> + void __iomem *io;
> + loff_t pos;
> + int pci_space;
> + int unit;
> +
> + pci_space = vfio_offset_to_pci_space(*ppos);
> + pos = vfio_offset_to_pci_offset(*ppos);
> +
> + if (!pci_resource_start(pdev, pci_space))
> + return -EINVAL;
> + end = pci_resource_len(pdev, pci_space);
> + if (pos + count > end)
> + return -EINVAL;
> + if (vdev->bar[pci_space] == NULL)
> + vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> + io = vdev->bar[pci_space];
> +
> + while (count > 0) {
> + if ((pos % 4) == 0 && count >= 4) {
> + u32 val;
> +
> + if (write) {
> + if (copy_from_user(&val, buf, 4))
> + return -EFAULT;
> + iowrite32(val, io + pos);
> + } else {
> + val = ioread32(io + pos);
> + if (copy_to_user(buf, &val, 4))
> + return -EFAULT;
> + }
> + unit = 4;
> + } else if ((pos % 2) == 0 && count >= 2) {
> + u16 val;
> +
> + if (write) {
> + if (copy_from_user(&val, buf, 2))
> + return -EFAULT;
> + iowrite16(val, io + pos);
> + } else {
> + val = ioread16(io + pos);
> + if (copy_to_user(buf, &val, 2))
> + return -EFAULT;
> + }
> + unit = 2;
> + } else {
> + u8 val;
> +
> + if (write) {
> + if (copy_from_user(&val, buf, 1))
> + return -EFAULT;
> + iowrite8(val, io + pos);
> + } else {
> + val = ioread8(io + pos);
> + if (copy_to_user(buf, &val, 1))
> + return -EFAULT;
> + }
> + unit = 1;
> + }
> + pos += unit;
> + buf += unit;
> + count -= unit;
> + done += unit;

Again, export from pci-sysfs.c?

> + }
> + *ppos += done;
> + return done;
> +}
> +
> +ssize_t vfio_mem_readwrite(
> + int write,
> + struct vfio_dev *vdev,
> + char __user *buf,
> + size_t count,
> + loff_t *ppos)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + resource_size_t end;
> + void __iomem *io;
> + loff_t pos;
> + int pci_space;
> +
> + pci_space = vfio_offset_to_pci_space(*ppos);
> + pos = vfio_offset_to_pci_offset(*ppos);
> +
> + if (!pci_resource_start(pdev, pci_space))
> + return -EINVAL;
> + end = pci_resource_len(pdev, pci_space);
> + if (vdev->bar[pci_space] == NULL)
> + vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> + io = vdev->bar[pci_space];
> +
> + if (pos > end)
> + return -EINVAL;
> + if (pos == end)
> + return 0;
> + if (pos + count > end)
> + count = end - pos;
> + if (write) {
> + if (copy_from_user(io + pos, buf, count))
> + return -EFAULT;
> + } else {
> + if (copy_to_user(buf, io + pos, count))
> + return -EFAULT;
> + }
> + *ppos += count;
> + return count;
> +}
> diff --git a/drivers/vfio/vfio_sysfs.c b/drivers/vfio/vfio_sysfs.c
> index e69de29..6275809 100644
> --- a/drivers/vfio/vfio_sysfs.c
> +++ b/drivers/vfio/vfio_sysfs.c
> @@ -0,0 +1,153 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <[email protected]>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kobject.h>
> +#include <linux/sysfs.h>
> +#include <linux/mm.h>
> +#include <linux/fs.h>
> +#include <linux/pci.h>
> +#include <linux/mmu_notifier.h>
> +
> +#include <linux/vfio.h>
> +
> +struct vfio_class *vfio_class;
> +
> +int vfio_class_init(void)
> +{
> + int ret = 0;
> +
> + if (vfio_class != NULL) {
> + kref_get(&vfio_class->kref);
> + goto exit;
> + }
> +
> + vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
> + if (!vfio_class) {
> + ret = -ENOMEM;
> + goto err_kzalloc;
> + }
> +
> + kref_init(&vfio_class->kref);
> + vfio_class->class = class_create(THIS_MODULE, "vfio");
> + if (IS_ERR(vfio_class->class)) {
> + ret = IS_ERR(vfio_class->class);
> + printk(KERN_ERR "class_create failed for vfio\n");
> + goto err_class_create;
> + }
> + return 0;
> +
> +err_class_create:
> + kfree(vfio_class);
> + vfio_class = NULL;
> +err_kzalloc:
> +exit:
> + return ret;
> +}
> +
> +static void vfio_class_release(struct kref *kref)
> +{
> + /* Ok, we cheat as we know we only have one vfio_class */
> + class_destroy(vfio_class->class);
> + kfree(vfio_class);
> + vfio_class = NULL;
> +}
> +
> +void vfio_class_destroy(void)
> +{
> + if (vfio_class)
> + kref_put(&vfio_class->kref, vfio_class_release);
> +}
> +
> +static ssize_t config_map_read(struct kobject *kobj,
> + struct bin_attribute *bin_attr,
> + char *buf, loff_t off, size_t count)
> +{
> + struct vfio_dev *vdev = bin_attr->private;
> + int ret;
> +
> + if (off >= 256)
> + return 0;
> + if (off + count > 256)
> + count = 256 - off;
> + if (vdev->pci_config_map == NULL) {
> + ret = vfio_build_config_map(vdev);
> + if (ret < 0)
> + return ret;
> + }
> + memcpy(buf, vdev->pci_config_map + off, count);
> + return count;
> +}
> +
> +static ssize_t show_locked_pages(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vfio_dev *vdev = dev_get_drvdata(dev);
> +
> + if (vdev == NULL)
> + return -ENODEV;
> + return sprintf(buf, "%u\n", vdev->locked_pages);
> +}
> +
> +static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
> +
> +static struct attribute *vfio_attrs[] = {
> + &dev_attr_locked_pages.attr,
> + NULL,
> +};
> +
> +static struct attribute_group vfio_attr_grp = {
> + .attrs = vfio_attrs,
> +};
> +
> +static struct bin_attribute config_map_bin_attribute = {
> + .attr = {
> + .name = "config_map",
> + .mode = S_IRUGO,
> + },
> + .size = 256,
> + .read = config_map_read,
> +};

The config map looks like an internal implementation detail
of the driver. Right? If so exposing it will tie us to
this forever, so not a good idea.

> +
> +int vfio_dev_add_attributes(struct vfio_dev *vdev)
> +{
> + struct bin_attribute *bi;
> + int ret;
> +
> + ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
> + if (ret)
> + return ret;
> + bi = kmalloc(sizeof(*bi), GFP_KERNEL);
> + if (bi == NULL)
> + return -ENOMEM;
> + *bi = config_map_bin_attribute;
> + bi->private = vdev;
> + return sysfs_create_bin_file(&vdev->dev->kobj, bi);
> +}
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index e2ea0b2..ed37cf4 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -166,6 +166,7 @@ header-y += ultrasound.h
> header-y += un.h
> header-y += utime.h
> header-y += veth.h
> +header-y += vfio.h
> header-y += videotext.h
> header-y += x25.h
>
> diff --git a/include/linux/uiommu.h b/include/linux/uiommu.h
> index e69de29..a7b7eac 100644
> --- a/include/linux/uiommu.h
> +++ b/include/linux/uiommu.h
> @@ -0,0 +1,76 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +/*
> + * uiommu driver - manipulation of iommu domains from user progs
> + */
> +struct uiommu_domain {
> + struct iommu_domain *domain;
> + atomic_t refcnt;
> +};
> +
> +/*
> + * Kernel routines invoked by fellow driver (vfio)
> + * after uiommu domain fd is passed in.
> + */
> +struct uiommu_domain *uiommu_fdget(int fd);
> +void uiommu_put(struct uiommu_domain *);
> +
> +/*
> + * These inlines are placeholders for future routines
> + * which may keep statistics, show info in sysfs, etc.
> + */
> +static inline int uiommu_attach_device(struct uiommu_domain *udomain,
> + struct device *dev)
> +{
> + return iommu_attach_device(udomain->domain, dev);
> +}
> +
> +static inline void uiommu_detach_device(struct uiommu_domain *udomain,
> + struct device *dev)
> +{
> + iommu_detach_device(udomain->domain, dev);
> +}
> +
> +static inline int uiommu_map_range(struct uiommu_domain *udomain,
> + unsigned long iova,
> + phys_addr_t paddr,
> + size_t size,
> + int prot)
> +{
> + return iommu_map_range(udomain->domain, iova, paddr, size, prot);
> +}
> +
> +static inline void uiommu_unmap_range(struct uiommu_domain *udomain,
> + unsigned long iova,
> + size_t size)
> +{
> + iommu_unmap_range(udomain->domain, iova, size);
> +}
> +
> +static inline phys_addr_t uiommu_iova_to_phys(struct uiommu_domain *udomain,
> + unsigned long iova)
> +{
> + return iommu_iova_to_phys(udomain->domain, iova);
> +}
> +
> +static inline int uiommu_domain_has_cap(struct uiommu_domain *udomain,
> + unsigned long cap)
> +{
> + return iommu_domain_has_cap(udomain->domain, cap);
> +}
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e69de29..52aa0dd 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -0,0 +1,202 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <[email protected]>
> + */
> +#include <linux/types.h>
> +
> +/*
> + * VFIO driver - allow mapping and use of certain PCI devices
> + * in unprivileged user processes. (If IOMMU is present)
> + * Especially useful for Virtual Function parts of SR-IOV devices
> + */
> +
> +#ifdef __KERNEL__
> +
> +struct vfio_dev {
> + struct device *dev;
> + struct pci_dev *pdev;
> + u8 *pci_config_map;
> + int pci_config_size;
> + char name[8];
> + int devnum;
> + void __iomem *bar[PCI_ROM_RESOURCE+1];
> + spinlock_t irqlock; /* guards command register accesses */
> + int listeners;
> + u32 locked_pages;
> + struct mutex lgate; /* listener gate */
> + struct mutex dgate; /* dma op gate */
> + struct mutex igate; /* intr op gate */
> + wait_queue_head_t dev_idle_q;
> + int mapcount;
> + struct uiommu_domain *udomain;
> + struct msix_entry *msix;
> + int nvec;
> + int cachec;
> + struct eventfd_ctx *ev_irq;
> + struct eventfd_ctx *ev_msi;
> + struct eventfd_ctx **ev_msix;
> + struct {
> + u8 intr;
> + u8 bardirty;
> + u8 rombar[4];
> + u8 bar[6*4];
> + u8 msi[24];
> + } vinfo;
> +};
> +
> +struct vfio_listener {
> + struct vfio_dev *vdev;
> + struct list_head dm_list;
> + struct mm_struct *mm;
> + struct mmu_notifier mmu_notifier;
> +};
> +
> +/*
> + * Structure for keeping track of memory nailed down by the
> + * user for DMA
> + */
> +struct dma_map_page {
> + struct list_head list;
> + struct page **pages;
> + dma_addr_t daddr;
> + unsigned long vaddr;
> + int npage;
> + int rdwr;
> +};
> +
> +/* VFIO class infrastructure */
> +struct vfio_class {
> + struct kref kref;
> + struct class *class;
> +};
> +extern struct vfio_class *vfio_class;
> +
> +ssize_t vfio_io_readwrite(int, struct vfio_dev *,
> + char __user *, size_t, loff_t *);
> +ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
> + char __user *, size_t, loff_t *);
> +ssize_t vfio_config_readwrite(int, struct vfio_dev *,
> + char __user *, size_t, loff_t *);
> +
> +void vfio_disable_msi(struct vfio_dev *);
> +void vfio_disable_msix(struct vfio_dev *);
> +int vfio_enable_msi(struct vfio_dev *, int);
> +int vfio_enable_msix(struct vfio_dev *, int, void __user *);
> +
> +#ifndef PCI_MSIX_ENTRY_SIZE
> +#define PCI_MSIX_ENTRY_SIZE 16
> +#endif
> +#ifndef PCI_STATUS_INTERRUPT
> +#define PCI_STATUS_INTERRUPT 0x08
> +#endif
> +
> +struct vfio_dma_map;
> +void vfio_dma_unmapall(struct vfio_listener *);
> +int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
> +int vfio_dma_map_common(struct vfio_listener *, unsigned int,
> + struct vfio_dma_map *);
> +int vfio_domain_set(struct vfio_dev *, int, int);
> +int vfio_domain_unset(struct vfio_dev *);
> +
> +int vfio_class_init(void);
> +void vfio_class_destroy(void);
> +int vfio_dev_add_attributes(struct vfio_dev *);
> +int vfio_build_config_map(struct vfio_dev *);
> +
> +irqreturn_t vfio_interrupt(int, void *);
> +
> +#endif /* __KERNEL__ */
> +
> +/* Kernel & User level defines for ioctls */
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + * buffer may only be larger than 1 page if (a) there is
> + * an iommu in the system, or (b) buffer is part of a huge page
> + */
> +struct vfio_dma_map {
> + __u64 vaddr; /* process virtual addr */
> + __u64 dmaaddr; /* desired and/or returned dma address */
> + __u64 size; /* size in bytes */
> + __u64 flags; /* bool: 0 for r/o; 1 for r/w */
> +#define VFIO_FLAG_WRITE 0x1 /* req writeable DMA mem */
> +};
> +
> +/* map user pages at specific dma address */
> +/* requires previous VFIO_DOMAIN_SET */
> +#define VFIO_DMA_MAP_IOVA _IOWR(';', 101, struct vfio_dma_map)
> +
> +/* unmap user pages */
> +#define VFIO_DMA_UNMAP _IOW(';', 102, struct vfio_dma_map)
> +
> +/* request IRQ interrupts; use given eventfd */
> +#define VFIO_EVENTFD_IRQ _IOW(';', 103, int)
> +
> +/* request MSI interrupts; use given eventfd */
> +#define VFIO_EVENTFD_MSI _IOW(';', 104, int)
> +
> +/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
> +#define VFIO_EVENTFDS_MSIX _IOW(';', 105, int)
> +
> +/* Get length of a BAR */
> +#define VFIO_BAR_LEN _IOWR(';', 167, __u32)
> +
> +/* Set the IOMMU domain - arg is fd from uiommu driver */
> +#define VFIO_DOMAIN_SET _IOW(';', 107, int)
> +
> +/* Unset the IOMMU domain */
> +#define VFIO_DOMAIN_UNSET _IO(';', 108)
> +
> +/*
> + * Reads, writes, and mmaps determine which PCI BAR (or config space)
> + * from the high level bits of the file offset
> + */
> +#define VFIO_PCI_BAR0_RESOURCE 0x0
> +#define VFIO_PCI_BAR1_RESOURCE 0x1
> +#define VFIO_PCI_BAR2_RESOURCE 0x2
> +#define VFIO_PCI_BAR3_RESOURCE 0x3
> +#define VFIO_PCI_BAR4_RESOURCE 0x4
> +#define VFIO_PCI_BAR5_RESOURCE 0x5

This looks wrong.
In the code I snipped, we had:
+ pci_space = vfio_offset_to_pci_space(*ppos);

So this is actually used as resource number, not BAR number.
One or the other will need to get fixed, otherwise
with 64 bit BAR0, you will get the wrong resource.

> +#define VFIO_PCI_ROM_RESOURCE 0x6
> +#define VFIO_PCI_CONFIG_RESOURCE 0xF
> +#define VFIO_PCI_SPACE_SHIFT 32
> +#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
> +
> +static inline int vfio_offset_to_pci_space(__u64 off)
> +{
> + return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
> +}
> +
> +static inline u32 vfio_offset_to_pci_offset(__u64 off)
> +{
> + return off & (u32)0xFFFFFFFF;
> +}
> +
> +static inline __u64 vfio_pci_space_to_offset(int sp)
> +{
> + return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
> +}

2010-07-19 04:57:27

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

Hi Tom, Michael,

Comments for both of you below. Tom, what does this build against? Are
we still on 2.6.34?

On Sun, 2010-07-18 at 12:39 +0300, Michael S. Tsirkin wrote:
> Hi Tom,
> > ---
> > In this version:
> >
> > There are lots of bug fixes and cleanups in this version, but the main
> > change is to check to make sure that the IOMMU has interrupt remapping
> > enabled, which is necessary to prevent user level code from triggering
> > spurious interrupts for other devices. Since most platforms today
> > do not have the necessary hardware and/or software for this, a module
> > option can override this check, thus making vfio useful (but not safe)
> > on many more platforms.
> >
> > In the next version I plan to add kernel to user messaging using the
> > generic netlink mechanism to allow the user driver to react to hot add
> > and remove, and power management requests.
> >
> > Blurb from version 2:
> >
> > This version now requires an IOMMU domain to be set before any access to
> > device registers is granted (except that config space may be read). In
> > addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> > which does not have sufficient controls around IOMMU usage. The IOMMU domain
> > is obtained from the 'uiommu' driver which is included in this patch.
> >
> > Various locking, security, and documentation issues have also been fixed.
> >
>
> I think this is making nice progress, especially good to see
> some effort to address the interrupt remapping issue.
> I just realized we also have an issue with determining
> the MSI capability and allocating entries. Some ideas
> on addressing this posted below.
> I also think it might make sense to involve the pci crowd here.
>
> It might be nice to see a bit of documentation for the interface
> presented to userspace. It is not always obvious.
> For example, passing CONFIG as offset to
> write or read will not always trigger access to config space.
>
> More comments below.
>
>
> On Fri, Jul 16, 2010 at 02:58:48PM -0700, Tom Lyon wrote:
> > +static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u16 pos;
> > + u32 table_offset;
> > + u16 table_size;
> > + u8 bir;
> > + u32 lo, hi, startp, endp;
> > +
> > + pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
> > + if (!pos)
> > + return 0;
> > +
> > + pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
> > + table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
> > + pci_read_config_dword(pdev, pos + 4, &table_offset);
> > + bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
> > + lo = table_offset >> PAGE_SHIFT;
> > + hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
> > + >> PAGE_SHIFT;
> > + startp = start >> PAGE_SHIFT;
> > + endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > + if (bir == vfio_offset_to_pci_space(start) &&
> > + overlap(lo, hi, startp, endp)) {
> > + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > + return 0;
> > +}
>
>
> You can't just check MSI capability to figure out whether MSI
> will work: this also depends on the host system in many ways.
> The only way to really know is to request vectors in host. So
> what we really need is an API that will reserve MSI vectors,
> but not assign them in the device until guest asks us to
> do it.
>
> > +
> > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
> > + if (allow_unsafe_intrs) {
> > + /* don't allow writes to msi-x vectors */
> > + ret = vfio_msix_check(vdev, *ppos, count);
>
> I thought we have an agreement here: it's useless to protect
> msix vector space and at the same time allow DMA from device,
> since device can do DMA write to trigger MSI.
> So why do you keep this code around?
>
>
> > + case VFIO_EVENTFD_MSI:
> > + if (copy_from_user(&fd, uarg, sizeof fd))
> > + return -EFAULT;
> > + mutex_lock(&vdev->igate);
> > + if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> > + ret = vfio_enable_msi(vdev, fd);
> > + else if (fd < 0 && vdev->ev_msi)
> > + vfio_disable_msi(vdev);
> > + else
> > + ret = -EINVAL;
> > + mutex_unlock(&vdev->igate);
> > + break;
>
> I think that for virt we'll need multivector support for MSI.
>
>
> > diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
> > index e69de29..8bd5c00 100644
> > --- a/drivers/vfio/vfio_pci_config.c
> > +++ b/drivers/vfio/vfio_pci_config.c
> > @@ -0,0 +1,605 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
>
> Any hints on what this file does?
>
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <[email protected]>
> > + */
>
> Tom so this file is some 600 lines of pretty tricky code,
> that I still don't completely understand the purpose of.
> For example, it prevents usespace from writing into readonly hardware
> registers like class/revision id. But writing them is harmless.
> As above, all code that does tricks with msi is also kind of useless.
> Maybe we need protection against writing BAR registers, but
> what about all the rest of it?
> Leavng it pasted below so you can explain it to me.
>
> Also, I am still unsure what purpose does the trick of "virtualization" serve.
> For example, you return a fake device id for an sr/iov device.
> Sounds a bit like forcing policy.
> Wouldn't it be cleaner to simply add an ioctl to get the
> fake id, thus making both the real value and the fake value
> available.

Is there a better default for this example? Linux has already figured
out what the vendor and device IDs are for the VF, why burden every
userspace caller to go figure out what's in pcisysfs and do the same
trick? If userspace wants to expose something different, they're free
to trap these config offsets and return whatever they please. If
userspace wants direct access, they can use pcisysfs.

> On the other hand, the virtualization trick will prevent userspace
> from doing things like restoring registers after a device specific reset.
> See drivers/infiniband/hw/mthca/mthca_reset.c as one example.
> For BARs, we can solve this by checking for BAR change and
> restoring it.
> Again, doing this might be better done through a special ioctl?

A reset ioctl may make sense, but what would we lose if we simply
trapped the flr write in vfio and issued a device reset?

> > +
> > +#include <linux/fs.h>
> > +#include <linux/pci.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +
> > +#define PCI_CAP_ID_BASIC 0
> > +#ifndef PCI_CAP_ID_MAX
> > +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> > +#endif
> > +
> > +/*
> > + * Lengths of PCI Config Capabilities
> > + * 0 means unknown (but at least 4)
> > + * FF means special/variable
> > + */
> > +static u8 pci_capability_length[] = {
> > + [PCI_CAP_ID_BASIC] = 64, /* pci config header */
> > + [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
> > + [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
> > + [PCI_CAP_ID_VPD] = 8,
> > + [PCI_CAP_ID_SLOTID] = 4,
> > + [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, 20, or 24 */
> > + [PCI_CAP_ID_CHSWP] = 4,
> > + [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
> > + [PCI_CAP_ID_HT] = 28,
> > + [PCI_CAP_ID_VNDR] = 0xFF,
> > + [PCI_CAP_ID_DBG] = 0,
> > + [PCI_CAP_ID_CCRC] = 0,
> > + [PCI_CAP_ID_SHPC] = 0,
> > + [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
> > + [PCI_CAP_ID_AGP3] = 0,
> > + [PCI_CAP_ID_EXP] = 36,
> > + [PCI_CAP_ID_MSIX] = 12,
> > + [PCI_CAP_ID_AF] = 6,
> > +};
>
> Maybe all these constants should go into pci_regs.h?
> Please consider Cc Jesse/linux-pci.
>
> > +
> > +/*
> > + * Read/Write Permission Bits - one bit for each bit in capability
> > + * Any field can be read if it exists,
> > + * but what is read depends on whether the field
> > + * is 'virtualized', or just pass thru to the hardware.
> > + * Any virtualized field is also virtualized for writes.
> > + * Writes are only permitted if they have a 1 bit here.
> > + */
> > +struct perm_bits {
> > + u32 rvirt; /* read bits which must be virtualized */
> > + u32 write; /* writeable bits - virt if read virt */
> > +};
> > +
> > +static struct perm_bits pci_cap_basic_perm[] = {
> > + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> > + { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
>
> don't want to virtualize mem/io?

This seems like a hole to me too. I haven't looked in the spec for this
issue, but an 82576 VF doesn't have working mem/io bits, so I end up
falling back to qemu for the command register.

> > + { 0, 0, }, /* 0x08 class code & revision id */
> > + { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
> > + { 0, 0, }, /* 0x28 cardbus - not yet */
> > + { 0, 0, }, /* 0x2c subsys vendor & dev */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
> > + { 0, 0, }, /* 0x34 capability ptr & resv */
> > + { 0, 0, }, /* 0x38 resv */
> > + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> > +};
>
> Use .[] initializers above instead of sticking register name in comment?
>
> > +
> > +static struct perm_bits pci_cap_pm_perm[] = {
> > + { 0, 0, }, /* 0x00 PM capabilities */
> > + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
> > +};
> > +
> > +static struct perm_bits pci_cap_vpd_perm[] = {
> > + { 0, 0xFFFF0000, }, /* 0x00 address */
> > + { 0, 0xFFFFFFFF, }, /* 0x04 data */
> > +};
> > +
> > +static struct perm_bits pci_cap_slotid_perm[] = {
> > + { 0, 0, }, /* 0x00 all read only */
> > +};
> > +
> > +/* 4 different possible layouts of MSI capability */
> > +static struct perm_bits pci_cap_msi_10_perm[] = {
> > + { 0, 0, }, /* 0x00 MSI message control */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
> > +};
> > +static struct perm_bits pci_cap_msi_14_perm[] = {
> > + { 0, 0, }, /* 0x00 MSI message control */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
> > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> > +};
> > +static struct perm_bits pci_cap_msi_20_perm[] = {
> > + { 0, 0, }, /* 0x00 MSI message control */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
> > + { 0, 0xFFFFFFFF, }, /* 0x0c MSI mask bits */
> > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI pending bits */
> > +};
> > +static struct perm_bits pci_cap_msi_24_perm[] = {
> > + { 0, 0, }, /* 0x00 MSI message control */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
> > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
> > + { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
> > +};
> > +
> > +static struct perm_bits pci_cap_pcix_perm[] = {
> > + { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
> > + { 0, 0, }, /* 0x04 PCI_X_STATUS */
> > + { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
> > + { 0, 0, }, /* 0x0c ECC first addr */
> > + { 0, 0, }, /* 0x10 ECC second addr */
> > + { 0, 0, }, /* 0x14 ECC attr */
> > +};
> > +
> > +/* pci express capabilities */
> > +static struct perm_bits pci_cap_exp_perm[] = {
> > + { 0, 0, }, /* 0x00 PCIe capabilities */
> > + { 0, 0, }, /* 0x04 PCIe device capabilities */
> > + { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
> > + { 0, 0, }, /* 0x0c PCIe link capabilities */
> > + { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
> > + { 0, 0, }, /* 0x14 PCIe slot capabilities */
> > + { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
> > + { 0, 0, }, /* 0x1c PCIe root port stuff */
> > + { 0, 0, }, /* 0x20 PCIe root port stuff */
> > +};
> > +
> > +static struct perm_bits pci_cap_msix_perm[] = {
> > + { 0, 0, }, /* 0x00 MSI-X Enable */
> > + { 0, 0, }, /* 0x04 table offset & bir */
> > + { 0, 0, }, /* 0x08 pba offset & bir */
> > +};
> > +
> > +static struct perm_bits pci_cap_af_perm[] = {
> > + { 0, 0, }, /* 0x00 af capability */
> > + { 0, 0x0001, }, /* 0x04 af flr bit */
>
> So you let application reset the function?
> If this happens, application will need to restore config space,
> so virtualizing will create problems.
>
> > +};
> > +
> > +static struct perm_bits *pci_cap_perms[] = {
> > + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
> > + [PCI_CAP_ID_PM] = pci_cap_pm_perm,
> > + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
> > + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
> > + [PCI_CAP_ID_MSI] = NULL, /* special */
> > + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
> > + [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
> > + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
> > + [PCI_CAP_ID_AF] = pci_cap_af_perm,
> > +};
> > +
> > +static int pci_msi_cap_len(struct pci_dev *pdev, u8 pos)
> > +{
> > + int len;
> > + int ret;
> > + u16 flags;
> > +
> > + ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
> > + if (ret < 0)
> > + return ret;
> > + if (flags & PCI_MSI_FLAGS_64BIT)
> > + len = 14;
> > + else
> > + len = 10;
> > + if (flags & PCI_MSI_FLAGS_MASKBIT)
> > + len += 10;
> > + return len;
> > +}
> > +
> > +/*
> > + * We build a map of the config space that tells us where
> > + * and what capabilities exist, so that we can map reads and
> > + * writes back to capabilities, and thus figure out what to
> > + * allow, deny, or virtualize
> > + */
> > +int vfio_build_config_map(struct vfio_dev *vdev)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u8 *map;
> > + int i, len;
> > + u8 pos, cap, tmp;
> > + u16 flags;
> > + int ret;
> > +#ifndef PCI_FIND_CAP_TTL
> > +#define PCI_FIND_CAP_TTL 48
> > +#endif
> > + int loops = PCI_FIND_CAP_TTL;
> > +
> > + map = kmalloc(pdev->cfg_size, GFP_KERNEL);
> > + if (map == NULL)
> > + return -ENOMEM;
> > + for (i = 0; i < pdev->cfg_size; i++)
> > + map[i] = 0xFF;
> > + vdev->pci_config_map = map;
> > +
> > + /* default config space */
> > + for (i = 0; i < pci_capability_length[0]; i++)
> > + map[i] = 0;
> > +
> > + /* any capabilities? */
> > + ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
> > + if (ret < 0)
> > + return ret;
> > + if ((flags & PCI_STATUS_CAP_LIST) == 0)
> > + return 0;
> > +
> > + ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
> > + if (ret < 0)
> > + return ret;
> > + while (pos && --loops > 0) {
> > + ret = pci_read_config_byte(pdev, pos, &cap);
> > + if (ret < 0)
> > + return ret;
> > + if (cap == 0) {
> > + printk(KERN_WARNING "%s: cap 0\n", __func__);
> > + break;
> > + }
> > + if (cap > PCI_CAP_ID_MAX) {
> > + printk(KERN_WARNING "%s: unknown pci capability id %x\n",
> > + __func__, cap);
> > + len = 0;
>
> Why is this a problem? Devices will implement more capabilities.
> How is access to an unknown one worse that access outside
> any capability, or vendor specific, that you do allow?

I believe the code doesn't allow access outside of known config space or
capabilities. It does seem like we could simply skip unknown caps
though, but that would require virtualizing the capability next pointer,
which isn't very amenable to how virtualized reads are currently done.
We'd need something more like how qemu virtualizes config space for
that.

> > + } else
> > + len = pci_capability_length[cap];
> > + if (len == 0) {
> > + printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
> > + __func__, cap);
> > + len = 4;
> > + }
> > + if (len == 0xFF) {
> > + switch (cap) {
> > + case PCI_CAP_ID_MSI:
> > + len = pci_msi_cap_len(pdev, pos);
> > + if (len < 0)
> > + return len;
> > + break;
> > + case PCI_CAP_ID_PCIX:
> > + ret = pci_read_config_word(pdev, pos + 2,
> > + &flags);
> > + if (ret < 0)
> > + return ret;
> > + if (flags & 0x3000)
> > + len = 24;
> > + else
> > + len = 8;
> > + break;
> > + case PCI_CAP_ID_VNDR:
> > + /* length follows next field */
> > + ret = pci_read_config_byte(pdev, pos + 2, &tmp);
> > + if (ret < 0)
> > + return ret;
> > + len = tmp;
> > + break;
> > + default:
> > + len = 0;
> > + break;
> > + }
> > + }
> > +
> > + for (i = 0; i < len; i++) {
> > + if (map[pos+i] != 0xFF)
> > + printk(KERN_WARNING
> > + "%s: pci config conflict at %x, "
> > + "caps %x %x\n",
> > + __func__, i, map[pos+i], cap);
> > + map[pos+i] = cap;
> > + }
> > + ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
> > + if (ret < 0)
> > + return ret;
> > + }
> > + if (loops <= 0)
> > + printk(KERN_ERR "%s: config space loop!\n", __func__);
> > + return 0;
> > +}
> > +
> > +static void vfio_virt_init(struct vfio_dev *vdev)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + int bar;
> > + u32 *lp;
> > + u32 val;
> > + u8 pos;
> > + int i, len;
> > +
> > + for (bar = 0; bar <= 5; bar++) {
> > + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> > + pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
> > + *lp++ = val;
> > + }
> > + lp = (u32 *)vdev->vinfo.rombar;
> > + pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
> > + *lp = val;
> > +
> > + vdev->vinfo.intr = pdev->irq;
> > +
> > + pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
> > + if (pos > 0) {
> > + len = pci_msi_cap_len(pdev, pos);
> > + if (len < 0)
> > + return;
> > + for (i = 0; i < len; i++)
> > + (void) pci_read_config_byte(pdev, pos + i,
> > + &vdev->vinfo.msi[i]);
> > + }
> > +}
> > +
> > +static void vfio_bar_fixup(struct vfio_dev *vdev)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + int bar;
> > + u32 *lp;
> > + u64 mask;
> > +
> > + for (bar = 0; bar <= 5; bar++) {
> > + if (pci_resource_start(pdev, bar))
> > + mask = ~(pci_resource_len(pdev, bar) - 1);
> > + else
> > + mask = 0;
> > + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> > + *lp &= (u32)mask;
> > +
> > + if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
> > + *lp |= PCI_BASE_ADDRESS_SPACE_IO;
> > + else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> > + *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
> > + if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
> > + *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> > + if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
> > + *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> > + lp++;
> > + *lp &= (u32)(mask >> 32);
> > + bar++;
> > + }
> > + }
> > + }
> > +
> > + lp = (u32 *)vdev->vinfo.rombar;
> > + mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
> > + *lp &= (u32)mask | PCI_ROM_ADDRESS_ENABLE;
> > +
> > + vdev->vinfo.bardirty = 0;
> > +}
> > +
> > +static int vfio_config_rwbyte(int write,
> > + struct vfio_dev *vdev,
> > + int pos,
> > + char __user *buf)
>
> Consider exporting from pci-sysfs.c instead of duplicating code.

Seems like we'd need to reach some agreement that this should look a lot
more like pci-sysfs config access first, no?

> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u8 *map = vdev->pci_config_map;
> > + u8 cap, val, newval;
> > + u16 start, off;
> > + int p;
> > + struct perm_bits *perm;
> > + u8 wr, virt;
> > + int ret;
> > + int len;
> > +
> > + cap = map[pos];
> > + if (cap == 0xFF) { /* unknown region */
> > + if (write)
> > + return 0; /* silent no-op */
> > + val = 0;
> > + if (pos <= pci_capability_length[0]) /* ok to read */
> > + (void) pci_read_config_byte(pdev, pos, &val);

Tom, there's a kernel memory leak here. val always needs to be
initialized if we're going to stuff it back in the user buffer.

> > + if (copy_to_user(buf, &val, 1))
> > + return -EFAULT;
> > + return 0;
> > + }
> > +
> > + /* scan back to start of cap region */
> > + for (p = pos; p >= 0; p--) {
> > + if (map[p] != cap)
> > + break;
> > + start = p;
> > + }
> > + off = pos - start; /* offset within capability */
> > +
> > + perm = pci_cap_perms[cap];
> > + if (cap == PCI_CAP_ID_MSI) {
> > + len = pci_msi_cap_len(pdev, start);
> > + switch (len) {
> > + case 10:
> > + perm = pci_cap_msi_10_perm;
> > + break;
> > + case 14:
> > + perm = pci_cap_msi_14_perm;
> > + break;
> > + case 20:
> > + perm = pci_cap_msi_20_perm;
> > + break;
> > + case 24:
> > + perm = pci_cap_msi_24_perm;
> > + break;
> > + default:
> > + perm = NULL;
> > + break;
> > + }
> > + }
> > + if (perm == NULL) {
> > + wr = 0;
> > + virt = 0;
> > + } else {
> > + perm += (off >> 2);
> > + wr = perm->write >> ((off & 3) * 8);
> > + virt = perm->rvirt >> ((off & 3) * 8);
> > + }
> > + if (write && !wr) /* no writeable bits */
> > + return 0;
> > + if (!virt) {
> > + if (write) {
> > + if (copy_from_user(&val, buf, 1))
> > + return -EFAULT;
> > + val &= wr;
> > + if (wr != 0xFF) {
> > + u8 existing;
> > +
> > + ret = pci_read_config_byte(pdev, pos,
> > + &existing);
> > + if (ret < 0)
> > + return ret;
> > + val |= (existing & ~wr);
> > + }
> > + pci_write_config_byte(pdev, pos, val);
>
> I think this should be pci_user_write_config_dword and same for read.
>
>
> > + } else {
> > + ret = pci_read_config_byte(pdev, pos, &val);
> > + if (ret < 0)
> > + return ret;
> > + if (copy_to_user(buf, &val, 1))
> > + return -EFAULT;
> > + }
> > + return 0;
> > + }
> > +
> > + if (write) {
> > + if (copy_from_user(&newval, buf, 1))
> > + return -EFAULT;
> > + }
> > + /*
> > + * We get here if there are some virt bits
> > + * handle remaining real bits, if any
> > + */
> > + if (~virt) {
> > + u8 rbits = (~virt) & wr;
> > +
> > + ret = pci_read_config_byte(pdev, pos, &val);
> > + if (ret < 0)
> > + return ret;
> > + if (write && rbits) {
> > + val &= ~rbits;
> > + newval &= rbits;
> > + val |= newval;
> > + pci_write_config_byte(pdev, pos, val);
> > + }
> > + }
> > + /*
> > + * Now handle entirely virtual fields
> > + */
> > + switch (cap) {
> > + case PCI_CAP_ID_BASIC: /* virtualize BARs */
> > + switch (off) {
> > + /*
> > + * vendor and device are virt because they don't
> > + * show up otherwise for sr-iov vfs
> > + */
> > + case PCI_VENDOR_ID:
> > + val = pdev->vendor;
> > + break;
> > + case PCI_VENDOR_ID + 1:
> > + val = pdev->vendor >> 8;
> > + break;
> > + case PCI_DEVICE_ID:
> > + val = pdev->device;
> > + break;
> > + case PCI_DEVICE_ID + 1:
> > + val = pdev->device >> 8;
> > + break;
> > + case PCI_INTERRUPT_LINE:
> > + if (write)
> > + vdev->vinfo.intr = newval;
> > + else
> > + val = vdev->vinfo.intr;
> > + break;
> > + case PCI_ROM_ADDRESS:
> > + case PCI_ROM_ADDRESS+1:
> > + case PCI_ROM_ADDRESS+2:
> > + case PCI_ROM_ADDRESS+3:
> > + if (write) {
> > + vdev->vinfo.rombar[off & 3] = newval;
> > + vdev->vinfo.bardirty = 1;
> > + } else {
> > + if (vdev->vinfo.bardirty)
> > + vfio_bar_fixup(vdev);
> > + val = vdev->vinfo.rombar[off & 3];
> > + }
> > + break;
> > + default:
> > + if (off >= PCI_BASE_ADDRESS_0 &&
> > + off <= PCI_BASE_ADDRESS_5 + 3) {
> > + int boff = off - PCI_BASE_ADDRESS_0;
> > +
> > + if (write) {
> > + vdev->vinfo.bar[boff] = newval;
> > + vdev->vinfo.bardirty = 1;
> > + } else {
> > + if (vdev->vinfo.bardirty)
> > + vfio_bar_fixup(vdev);
> > + val = vdev->vinfo.bar[boff];
> > + }
> > + }
> > + break;
> > + }
> > + break;
> > + case PCI_CAP_ID_MSI: /* virtualize (parts of) MSI */
> > + if (write)
> > + vdev->vinfo.msi[off] = newval;
> > + else
> > + val = vdev->vinfo.msi[off];
> > + break;
> > + }
> > + if (!write && copy_to_user(buf, &val, 1))
> > + return -EFAULT;
> > + return 0;
> > +}
> > +
> > +ssize_t vfio_config_readwrite(int write,
> > + struct vfio_dev *vdev,
> > + char __user *buf,
> > + size_t count,
> > + loff_t *ppos)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + int done = 0;
> > + int ret;
> > + u16 pos;
> > +
> > +
> > + if (vdev->pci_config_map == NULL) {
> > + ret = vfio_build_config_map(vdev);
> > + if (ret < 0)
> > + goto out;
> > + vfio_virt_init(vdev);
> > + }
> > +
> > + while (count > 0) {
> > + pos = *ppos;
> > + if (pos == pdev->cfg_size)
> > + break;
> > + if (pos > pdev->cfg_size) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + /*
> > + * we grab the irqlock here to prevent confusing
> > + * the read/modify/write sequence in vfio_interrupt
> > + */
> > + spin_lock_irq(&vdev->irqlock);
> > + ret = vfio_config_rwbyte(write, vdev, pos, buf);
> > + spin_unlock_irq(&vdev->irqlock);
> > +
> > + if (ret < 0)
> > + goto out;
> > + buf++;
> > + done++;
> > + count--;
> > + (*ppos)++;
> > + }
> > + ret = done;
> > +out:
> > + return ret;
> > +}
> > diff --git a/drivers/vfio/vfio_rdwr.c b/drivers/vfio/vfio_rdwr.c
> > index e69de29..f4bd2b7 100644
> > --- a/drivers/vfio/vfio_rdwr.c
> > +++ b/drivers/vfio/vfio_rdwr.c
> > @@ -0,0 +1,152 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
>
> Any hints on what this file does?
>
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <[email protected]>
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/pci.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/io.h>
> > +
> > +#include <linux/vfio.h>
> > +
> > +ssize_t vfio_io_readwrite(
> > + int write,
> > + struct vfio_dev *vdev,
> > + char __user *buf,
> > + size_t count,
> > + loff_t *ppos)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + size_t done = 0;
> > + resource_size_t end;
> > + void __iomem *io;
> > + loff_t pos;
> > + int pci_space;
> > + int unit;
> > +
> > + pci_space = vfio_offset_to_pci_space(*ppos);
> > + pos = vfio_offset_to_pci_offset(*ppos);
> > +
> > + if (!pci_resource_start(pdev, pci_space))
> > + return -EINVAL;
> > + end = pci_resource_len(pdev, pci_space);
> > + if (pos + count > end)
> > + return -EINVAL;
> > + if (vdev->bar[pci_space] == NULL)
> > + vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> > + io = vdev->bar[pci_space];
> > +
> > + while (count > 0) {
> > + if ((pos % 4) == 0 && count >= 4) {
> > + u32 val;
> > +
> > + if (write) {
> > + if (copy_from_user(&val, buf, 4))
> > + return -EFAULT;
> > + iowrite32(val, io + pos);
> > + } else {
> > + val = ioread32(io + pos);
> > + if (copy_to_user(buf, &val, 4))
> > + return -EFAULT;
> > + }
> > + unit = 4;
> > + } else if ((pos % 2) == 0 && count >= 2) {
> > + u16 val;
> > +
> > + if (write) {
> > + if (copy_from_user(&val, buf, 2))
> > + return -EFAULT;
> > + iowrite16(val, io + pos);
> > + } else {
> > + val = ioread16(io + pos);
> > + if (copy_to_user(buf, &val, 2))
> > + return -EFAULT;
> > + }
> > + unit = 2;
> > + } else {
> > + u8 val;
> > +
> > + if (write) {
> > + if (copy_from_user(&val, buf, 1))
> > + return -EFAULT;
> > + iowrite8(val, io + pos);
> > + } else {
> > + val = ioread8(io + pos);
> > + if (copy_to_user(buf, &val, 1))
> > + return -EFAULT;
> > + }
> > + unit = 1;
> > + }
> > + pos += unit;
> > + buf += unit;
> > + count -= unit;
> > + done += unit;
>
> Again, export from pci-sysfs.c?

I just submitted a patch to be able to do read/write to ioport
resources, but I still think these serve a different purpose. PCI sysfs
config and resource files don't have the same tie in to DMA mapping as
vfio does, so they'll happily let you setup the device without the added
security of an iommu. We have to trust that the user is going to be
safe and use the iommu. I believe vfio is trying to create a safe
interface that provides iommu enforcement and simple config space
virtualization, so does need to reimplement a few things that we may be
able to get unsafely through pci-sysfs.

> > + }
> > + *ppos += done;
> > + return done;
> > +}
> > +
> > +ssize_t vfio_mem_readwrite(
> > + int write,
> > + struct vfio_dev *vdev,
> > + char __user *buf,
> > + size_t count,
> > + loff_t *ppos)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + resource_size_t end;
> > + void __iomem *io;
> > + loff_t pos;
> > + int pci_space;
> > +
> > + pci_space = vfio_offset_to_pci_space(*ppos);
> > + pos = vfio_offset_to_pci_offset(*ppos);
> > +
> > + if (!pci_resource_start(pdev, pci_space))
> > + return -EINVAL;
> > + end = pci_resource_len(pdev, pci_space);
> > + if (vdev->bar[pci_space] == NULL)
> > + vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> > + io = vdev->bar[pci_space];
> > +
> > + if (pos > end)
> > + return -EINVAL;
> > + if (pos == end)
> > + return 0;
> > + if (pos + count > end)
> > + count = end - pos;
> > + if (write) {
> > + if (copy_from_user(io + pos, buf, count))
> > + return -EFAULT;
> > + } else {
> > + if (copy_to_user(buf, io + pos, count))
> > + return -EFAULT;
> > + }
> > + *ppos += count;
> > + return count;
> > +}
> > diff --git a/drivers/vfio/vfio_sysfs.c b/drivers/vfio/vfio_sysfs.c
> > index e69de29..6275809 100644
> > --- a/drivers/vfio/vfio_sysfs.c
> > +++ b/drivers/vfio/vfio_sysfs.c
> > @@ -0,0 +1,153 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <[email protected]>
> > + */
> > +
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/kobject.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/mm.h>
> > +#include <linux/fs.h>
> > +#include <linux/pci.h>
> > +#include <linux/mmu_notifier.h>
> > +
> > +#include <linux/vfio.h>
> > +
> > +struct vfio_class *vfio_class;
> > +
> > +int vfio_class_init(void)
> > +{
> > + int ret = 0;
> > +
> > + if (vfio_class != NULL) {
> > + kref_get(&vfio_class->kref);
> > + goto exit;
> > + }
> > +
> > + vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
> > + if (!vfio_class) {
> > + ret = -ENOMEM;
> > + goto err_kzalloc;
> > + }
> > +
> > + kref_init(&vfio_class->kref);
> > + vfio_class->class = class_create(THIS_MODULE, "vfio");
> > + if (IS_ERR(vfio_class->class)) {
> > + ret = IS_ERR(vfio_class->class);
> > + printk(KERN_ERR "class_create failed for vfio\n");
> > + goto err_class_create;
> > + }
> > + return 0;
> > +
> > +err_class_create:
> > + kfree(vfio_class);
> > + vfio_class = NULL;
> > +err_kzalloc:
> > +exit:
> > + return ret;
> > +}
> > +
> > +static void vfio_class_release(struct kref *kref)
> > +{
> > + /* Ok, we cheat as we know we only have one vfio_class */
> > + class_destroy(vfio_class->class);
> > + kfree(vfio_class);
> > + vfio_class = NULL;
> > +}
> > +
> > +void vfio_class_destroy(void)
> > +{
> > + if (vfio_class)
> > + kref_put(&vfio_class->kref, vfio_class_release);
> > +}
> > +
> > +static ssize_t config_map_read(struct kobject *kobj,
> > + struct bin_attribute *bin_attr,
> > + char *buf, loff_t off, size_t count)
> > +{
> > + struct vfio_dev *vdev = bin_attr->private;
> > + int ret;
> > +
> > + if (off >= 256)
> > + return 0;
> > + if (off + count > 256)
> > + count = 256 - off;
> > + if (vdev->pci_config_map == NULL) {
> > + ret = vfio_build_config_map(vdev);
> > + if (ret < 0)
> > + return ret;
> > + }
> > + memcpy(buf, vdev->pci_config_map + off, count);
> > + return count;
> > +}
> > +
> > +static ssize_t show_locked_pages(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + struct vfio_dev *vdev = dev_get_drvdata(dev);
> > +
> > + if (vdev == NULL)
> > + return -ENODEV;
> > + return sprintf(buf, "%u\n", vdev->locked_pages);
> > +}
> > +
> > +static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
> > +
> > +static struct attribute *vfio_attrs[] = {
> > + &dev_attr_locked_pages.attr,
> > + NULL,
> > +};
> > +
> > +static struct attribute_group vfio_attr_grp = {
> > + .attrs = vfio_attrs,
> > +};
> > +
> > +static struct bin_attribute config_map_bin_attribute = {
> > + .attr = {
> > + .name = "config_map",
> > + .mode = S_IRUGO,
> > + },
> > + .size = 256,
> > + .read = config_map_read,
> > +};
>
> The config map looks like an internal implementation detail
> of the driver. Right? If so exposing it will tie us to
> this forever, so not a good idea.
>
> > +
> > +int vfio_dev_add_attributes(struct vfio_dev *vdev)
> > +{
> > + struct bin_attribute *bi;
> > + int ret;
> > +
> > + ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
> > + if (ret)
> > + return ret;
> > + bi = kmalloc(sizeof(*bi), GFP_KERNEL);
> > + if (bi == NULL)
> > + return -ENOMEM;
> > + *bi = config_map_bin_attribute;
> > + bi->private = vdev;
> > + return sysfs_create_bin_file(&vdev->dev->kobj, bi);
> > +}
> > diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> > index e2ea0b2..ed37cf4 100644
> > --- a/include/linux/Kbuild
> > +++ b/include/linux/Kbuild
> > @@ -166,6 +166,7 @@ header-y += ultrasound.h
> > header-y += un.h
> > header-y += utime.h
> > header-y += veth.h
> > +header-y += vfio.h
> > header-y += videotext.h
> > header-y += x25.h
> >
> > diff --git a/include/linux/uiommu.h b/include/linux/uiommu.h
> > index e69de29..a7b7eac 100644
> > --- a/include/linux/uiommu.h
> > +++ b/include/linux/uiommu.h
> > @@ -0,0 +1,76 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + */
> > +
> > +/*
> > + * uiommu driver - manipulation of iommu domains from user progs
> > + */
> > +struct uiommu_domain {
> > + struct iommu_domain *domain;
> > + atomic_t refcnt;
> > +};
> > +
> > +/*
> > + * Kernel routines invoked by fellow driver (vfio)
> > + * after uiommu domain fd is passed in.
> > + */
> > +struct uiommu_domain *uiommu_fdget(int fd);
> > +void uiommu_put(struct uiommu_domain *);
> > +
> > +/*
> > + * These inlines are placeholders for future routines
> > + * which may keep statistics, show info in sysfs, etc.
> > + */
> > +static inline int uiommu_attach_device(struct uiommu_domain *udomain,
> > + struct device *dev)
> > +{
> > + return iommu_attach_device(udomain->domain, dev);
> > +}
> > +
> > +static inline void uiommu_detach_device(struct uiommu_domain *udomain,
> > + struct device *dev)
> > +{
> > + iommu_detach_device(udomain->domain, dev);
> > +}
> > +
> > +static inline int uiommu_map_range(struct uiommu_domain *udomain,
> > + unsigned long iova,
> > + phys_addr_t paddr,
> > + size_t size,
> > + int prot)
> > +{
> > + return iommu_map_range(udomain->domain, iova, paddr, size, prot);
> > +}
> > +
> > +static inline void uiommu_unmap_range(struct uiommu_domain *udomain,
> > + unsigned long iova,
> > + size_t size)
> > +{
> > + iommu_unmap_range(udomain->domain, iova, size);
> > +}
> > +
> > +static inline phys_addr_t uiommu_iova_to_phys(struct uiommu_domain *udomain,
> > + unsigned long iova)
> > +{
> > + return iommu_iova_to_phys(udomain->domain, iova);
> > +}
> > +
> > +static inline int uiommu_domain_has_cap(struct uiommu_domain *udomain,
> > + unsigned long cap)
> > +{
> > + return iommu_domain_has_cap(udomain->domain, cap);
> > +}
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e69de29..52aa0dd 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -0,0 +1,202 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <[email protected]>
> > + */
> > +#include <linux/types.h>
> > +
> > +/*
> > + * VFIO driver - allow mapping and use of certain PCI devices
> > + * in unprivileged user processes. (If IOMMU is present)
> > + * Especially useful for Virtual Function parts of SR-IOV devices
> > + */
> > +
> > +#ifdef __KERNEL__
> > +
> > +struct vfio_dev {
> > + struct device *dev;
> > + struct pci_dev *pdev;
> > + u8 *pci_config_map;
> > + int pci_config_size;
> > + char name[8];
> > + int devnum;
> > + void __iomem *bar[PCI_ROM_RESOURCE+1];
> > + spinlock_t irqlock; /* guards command register accesses */
> > + int listeners;
> > + u32 locked_pages;
> > + struct mutex lgate; /* listener gate */
> > + struct mutex dgate; /* dma op gate */
> > + struct mutex igate; /* intr op gate */
> > + wait_queue_head_t dev_idle_q;
> > + int mapcount;
> > + struct uiommu_domain *udomain;
> > + struct msix_entry *msix;
> > + int nvec;
> > + int cachec;
> > + struct eventfd_ctx *ev_irq;
> > + struct eventfd_ctx *ev_msi;
> > + struct eventfd_ctx **ev_msix;
> > + struct {
> > + u8 intr;
> > + u8 bardirty;
> > + u8 rombar[4];
> > + u8 bar[6*4];
> > + u8 msi[24];
> > + } vinfo;
> > +};
> > +
> > +struct vfio_listener {
> > + struct vfio_dev *vdev;
> > + struct list_head dm_list;
> > + struct mm_struct *mm;
> > + struct mmu_notifier mmu_notifier;
> > +};
> > +
> > +/*
> > + * Structure for keeping track of memory nailed down by the
> > + * user for DMA
> > + */
> > +struct dma_map_page {
> > + struct list_head list;
> > + struct page **pages;
> > + dma_addr_t daddr;
> > + unsigned long vaddr;
> > + int npage;
> > + int rdwr;
> > +};
> > +
> > +/* VFIO class infrastructure */
> > +struct vfio_class {
> > + struct kref kref;
> > + struct class *class;
> > +};
> > +extern struct vfio_class *vfio_class;
> > +
> > +ssize_t vfio_io_readwrite(int, struct vfio_dev *,
> > + char __user *, size_t, loff_t *);
> > +ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
> > + char __user *, size_t, loff_t *);
> > +ssize_t vfio_config_readwrite(int, struct vfio_dev *,
> > + char __user *, size_t, loff_t *);
> > +
> > +void vfio_disable_msi(struct vfio_dev *);
> > +void vfio_disable_msix(struct vfio_dev *);
> > +int vfio_enable_msi(struct vfio_dev *, int);
> > +int vfio_enable_msix(struct vfio_dev *, int, void __user *);
> > +
> > +#ifndef PCI_MSIX_ENTRY_SIZE
> > +#define PCI_MSIX_ENTRY_SIZE 16
> > +#endif
> > +#ifndef PCI_STATUS_INTERRUPT
> > +#define PCI_STATUS_INTERRUPT 0x08
> > +#endif
> > +
> > +struct vfio_dma_map;
> > +void vfio_dma_unmapall(struct vfio_listener *);
> > +int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
> > +int vfio_dma_map_common(struct vfio_listener *, unsigned int,
> > + struct vfio_dma_map *);
> > +int vfio_domain_set(struct vfio_dev *, int, int);
> > +int vfio_domain_unset(struct vfio_dev *);
> > +
> > +int vfio_class_init(void);
> > +void vfio_class_destroy(void);
> > +int vfio_dev_add_attributes(struct vfio_dev *);
> > +int vfio_build_config_map(struct vfio_dev *);
> > +
> > +irqreturn_t vfio_interrupt(int, void *);
> > +
> > +#endif /* __KERNEL__ */
> > +
> > +/* Kernel & User level defines for ioctls */
> > +
> > +/*
> > + * Structure for DMA mapping of user buffers
> > + * vaddr, dmaaddr, and size must all be page aligned
> > + * buffer may only be larger than 1 page if (a) there is
> > + * an iommu in the system, or (b) buffer is part of a huge page
> > + */
> > +struct vfio_dma_map {
> > + __u64 vaddr; /* process virtual addr */
> > + __u64 dmaaddr; /* desired and/or returned dma address */
> > + __u64 size; /* size in bytes */
> > + __u64 flags; /* bool: 0 for r/o; 1 for r/w */
> > +#define VFIO_FLAG_WRITE 0x1 /* req writeable DMA mem */
> > +};
> > +
> > +/* map user pages at specific dma address */
> > +/* requires previous VFIO_DOMAIN_SET */
> > +#define VFIO_DMA_MAP_IOVA _IOWR(';', 101, struct vfio_dma_map)
> > +
> > +/* unmap user pages */
> > +#define VFIO_DMA_UNMAP _IOW(';', 102, struct vfio_dma_map)
> > +
> > +/* request IRQ interrupts; use given eventfd */
> > +#define VFIO_EVENTFD_IRQ _IOW(';', 103, int)
> > +
> > +/* request MSI interrupts; use given eventfd */
> > +#define VFIO_EVENTFD_MSI _IOW(';', 104, int)
> > +
> > +/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
> > +#define VFIO_EVENTFDS_MSIX _IOW(';', 105, int)
> > +
> > +/* Get length of a BAR */
> > +#define VFIO_BAR_LEN _IOWR(';', 167, __u32)
> > +
> > +/* Set the IOMMU domain - arg is fd from uiommu driver */
> > +#define VFIO_DOMAIN_SET _IOW(';', 107, int)
> > +
> > +/* Unset the IOMMU domain */
> > +#define VFIO_DOMAIN_UNSET _IO(';', 108)
> > +
> > +/*
> > + * Reads, writes, and mmaps determine which PCI BAR (or config space)
> > + * from the high level bits of the file offset
> > + */
> > +#define VFIO_PCI_BAR0_RESOURCE 0x0
> > +#define VFIO_PCI_BAR1_RESOURCE 0x1
> > +#define VFIO_PCI_BAR2_RESOURCE 0x2
> > +#define VFIO_PCI_BAR3_RESOURCE 0x3
> > +#define VFIO_PCI_BAR4_RESOURCE 0x4
> > +#define VFIO_PCI_BAR5_RESOURCE 0x5
>
> This looks wrong.
> In the code I snipped, we had:
> + pci_space = vfio_offset_to_pci_space(*ppos);
>
> So this is actually used as resource number, not BAR number.
> One or the other will need to get fixed, otherwise
> with 64 bit BAR0, you will get the wrong resource.

Aren't BAR number and resource number synonymous? I would have assumed
that if BAR0 is 64bit, pci_resource_len(, BAR1) is zero, which should
work just fine for the way it's coded up here.

> > +#define VFIO_PCI_ROM_RESOURCE 0x6
> > +#define VFIO_PCI_CONFIG_RESOURCE 0xF
> > +#define VFIO_PCI_SPACE_SHIFT 32
> > +#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
> > +
> > +static inline int vfio_offset_to_pci_space(__u64 off)
> > +{
> > + return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
> > +}
> > +
> > +static inline u32 vfio_offset_to_pci_offset(__u64 off)
> > +{
> > + return off & (u32)0xFFFFFFFF;
> > +}
> > +
> > +static inline __u64 vfio_pci_space_to_offset(int sp)
> > +{
> > + return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
> > +}


2010-07-19 10:12:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Sun, Jul 18, 2010 at 10:56:47PM -0600, Alex Williamson wrote:
> Hi Tom, Michael,
>
> Comments for both of you below. Tom, what does this build against? Are
> we still on 2.6.34?
>
> On Sun, 2010-07-18 at 12:39 +0300, Michael S. Tsirkin wrote:
> > Hi Tom,
> > > ---
> > > In this version:
> > >
> > > There are lots of bug fixes and cleanups in this version, but the main
> > > change is to check to make sure that the IOMMU has interrupt remapping
> > > enabled, which is necessary to prevent user level code from triggering
> > > spurious interrupts for other devices. Since most platforms today
> > > do not have the necessary hardware and/or software for this, a module
> > > option can override this check, thus making vfio useful (but not safe)
> > > on many more platforms.
> > >
> > > In the next version I plan to add kernel to user messaging using the
> > > generic netlink mechanism to allow the user driver to react to hot add
> > > and remove, and power management requests.
> > >
> > > Blurb from version 2:
> > >
> > > This version now requires an IOMMU domain to be set before any access to
> > > device registers is granted (except that config space may be read). In
> > > addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> > > which does not have sufficient controls around IOMMU usage. The IOMMU domain
> > > is obtained from the 'uiommu' driver which is included in this patch.
> > >
> > > Various locking, security, and documentation issues have also been fixed.
> > >
> >
> > I think this is making nice progress, especially good to see
> > some effort to address the interrupt remapping issue.
> > I just realized we also have an issue with determining
> > the MSI capability and allocating entries. Some ideas
> > on addressing this posted below.
> > I also think it might make sense to involve the pci crowd here.
> >
> > It might be nice to see a bit of documentation for the interface
> > presented to userspace. It is not always obvious.
> > For example, passing CONFIG as offset to
> > write or read will not always trigger access to config space.
> >
> > More comments below.
> >
> >
> > On Fri, Jul 16, 2010 at 02:58:48PM -0700, Tom Lyon wrote:
> > > +static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + u16 pos;
> > > + u32 table_offset;
> > > + u16 table_size;
> > > + u8 bir;
> > > + u32 lo, hi, startp, endp;
> > > +
> > > + pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
> > > + if (!pos)
> > > + return 0;
> > > +
> > > + pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
> > > + table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
> > > + pci_read_config_dword(pdev, pos + 4, &table_offset);
> > > + bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
> > > + lo = table_offset >> PAGE_SHIFT;
> > > + hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
> > > + >> PAGE_SHIFT;
> > > + startp = start >> PAGE_SHIFT;
> > > + endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > + if (bir == vfio_offset_to_pci_space(start) &&
> > > + overlap(lo, hi, startp, endp)) {
> > > + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> > > + __func__);
> > > + return -EINVAL;
> > > + }
> > > + return 0;
> > > +}
> >
> >
> > You can't just check MSI capability to figure out whether MSI
> > will work: this also depends on the host system in many ways.
> > The only way to really know is to request vectors in host. So
> > what we really need is an API that will reserve MSI vectors,
> > but not assign them in the device until guest asks us to
> > do it.
> >
> > > +
> > > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
> > > + if (allow_unsafe_intrs) {
> > > + /* don't allow writes to msi-x vectors */
> > > + ret = vfio_msix_check(vdev, *ppos, count);
> >
> > I thought we have an agreement here: it's useless to protect
> > msix vector space and at the same time allow DMA from device,
> > since device can do DMA write to trigger MSI.
> > So why do you keep this code around?
> >
> >
> > > + case VFIO_EVENTFD_MSI:
> > > + if (copy_from_user(&fd, uarg, sizeof fd))
> > > + return -EFAULT;
> > > + mutex_lock(&vdev->igate);
> > > + if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> > > + ret = vfio_enable_msi(vdev, fd);
> > > + else if (fd < 0 && vdev->ev_msi)
> > > + vfio_disable_msi(vdev);
> > > + else
> > > + ret = -EINVAL;
> > > + mutex_unlock(&vdev->igate);
> > > + break;
> >
> > I think that for virt we'll need multivector support for MSI.
> >
> >
> > > diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
> > > index e69de29..8bd5c00 100644
> > > --- a/drivers/vfio/vfio_pci_config.c
> > > +++ b/drivers/vfio/vfio_pci_config.c
> > > @@ -0,0 +1,605 @@
> > > +/*
> > > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > > + * Author: Tom Lyon, [email protected]
> >
> > Any hints on what this file does?
> >
> > > + *
> > > + * This program is free software; you may redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; version 2 of the License.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > > + * SOFTWARE.
> > > + *
> > > + * Portions derived from drivers/uio/uio.c:
> > > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > > + *
> > > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > > + * Copyright (C) 2009 Red Hat, Inc.
> > > + * Author: Michael S. Tsirkin <[email protected]>
> > > + */
> >
> > Tom so this file is some 600 lines of pretty tricky code,
> > that I still don't completely understand the purpose of.
> > For example, it prevents usespace from writing into readonly hardware
> > registers like class/revision id. But writing them is harmless.
> > As above, all code that does tricks with msi is also kind of useless.
> > Maybe we need protection against writing BAR registers, but
> > what about all the rest of it?
> > Leavng it pasted below so you can explain it to me.
> >
> > Also, I am still unsure what purpose does the trick of "virtualization" serve.
> > For example, you return a fake device id for an sr/iov device.
> > Sounds a bit like forcing policy.
> > Wouldn't it be cleaner to simply add an ioctl to get the
> > fake id, thus making both the real value and the fake value
> > available.
>
> Is there a better default for this example? Linux has already figured
> out what the vendor and device IDs are for the VF, why burden every
> userspace caller to go figure out what's in pcisysfs and do the same
> trick?

The issue here is the interface IMO. What happens if you do
a write or read at offset X in the device? It might write the device,
it might cache it in kernel, you might get real data, you might not ...
It's basically undefined at the moment, you read the code to figure it out.
What I advocate is "read gets data from device, write puts data
in the device or fails with -EPERM". For extra functionality,
such as restoring registers after reset, you would use write
at special offsets (or ioctl calls if you prefer).

This way makes the non-pageable, hard-to-update kernel code
as straightforward as possible. It is also possible to create a library
and stick all kind of common code in there.

> If userspace wants to expose something different, they're free
> to trap these config offsets and return whatever they please.

But same applies to registers this driver did decide to virtualize.
It can all be done in userspace.


> If
> userspace wants direct access, they can use pcisysfs.

vfio already has some ioctls (resource size) that can be replaced
with pcisysfs. I am not sure why does it have them, but I don't mind.

> > On the other hand, the virtualization trick will prevent userspace
> > from doing things like restoring registers after a device specific reset.
> > See drivers/infiniband/hw/mthca/mthca_reset.c as one example.
> > For BARs, we can solve this by checking for BAR change and
> > restoring it.
> > Again, doing this might be better done through a special ioctl?
>
> A reset ioctl may make sense, but what would we lose if we simply
> trapped the flr write in vfio and issued a device reset?

Yes. but that's not enough: FLR is very recent.
A lot of devices have vendor-specific reset (see example above), some
might not be in config space so you can not trap.
Now, after reset driver will try to restore the config space.
Just caching config writes and not touching the device will not DTRT.

> > > +
> > > +#include <linux/fs.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/vfio.h>
> > > +
> > > +#define PCI_CAP_ID_BASIC 0
> > > +#ifndef PCI_CAP_ID_MAX
> > > +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> > > +#endif
> > > +
> > > +/*
> > > + * Lengths of PCI Config Capabilities
> > > + * 0 means unknown (but at least 4)
> > > + * FF means special/variable
> > > + */
> > > +static u8 pci_capability_length[] = {
> > > + [PCI_CAP_ID_BASIC] = 64, /* pci config header */
> > > + [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
> > > + [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
> > > + [PCI_CAP_ID_VPD] = 8,
> > > + [PCI_CAP_ID_SLOTID] = 4,
> > > + [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, 20, or 24 */
> > > + [PCI_CAP_ID_CHSWP] = 4,
> > > + [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
> > > + [PCI_CAP_ID_HT] = 28,
> > > + [PCI_CAP_ID_VNDR] = 0xFF,
> > > + [PCI_CAP_ID_DBG] = 0,
> > > + [PCI_CAP_ID_CCRC] = 0,
> > > + [PCI_CAP_ID_SHPC] = 0,
> > > + [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
> > > + [PCI_CAP_ID_AGP3] = 0,
> > > + [PCI_CAP_ID_EXP] = 36,
> > > + [PCI_CAP_ID_MSIX] = 12,
> > > + [PCI_CAP_ID_AF] = 6,
> > > +};
> >
> > Maybe all these constants should go into pci_regs.h?
> > Please consider Cc Jesse/linux-pci.
> >
> > > +
> > > +/*
> > > + * Read/Write Permission Bits - one bit for each bit in capability
> > > + * Any field can be read if it exists,
> > > + * but what is read depends on whether the field
> > > + * is 'virtualized', or just pass thru to the hardware.
> > > + * Any virtualized field is also virtualized for writes.
> > > + * Writes are only permitted if they have a 1 bit here.
> > > + */
> > > +struct perm_bits {
> > > + u32 rvirt; /* read bits which must be virtualized */
> > > + u32 write; /* writeable bits - virt if read virt */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_basic_perm[] = {
> > > + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> > > + { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> >
> > don't want to virtualize mem/io?
>
> This seems like a hole to me too. I haven't looked in the spec for this
> issue, but an 82576 VF doesn't have working mem/io bits, so I end up
> falling back to qemu for the command register.

Again, if driver wants to disable mem/io on a device, what damage can
this do? I think we really only should intervene in kernel if userspace
can cause DOS through the device. Otherwise, IMO we should just give
userspace rope.

> > > + { 0, 0, }, /* 0x08 class code & revision id */
> > > + { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
> > > + { 0, 0, }, /* 0x28 cardbus - not yet */
> > > + { 0, 0, }, /* 0x2c subsys vendor & dev */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
> > > + { 0, 0, }, /* 0x34 capability ptr & resv */
> > > + { 0, 0, }, /* 0x38 resv */
> > > + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> > > +};
> >
> > Use .[] initializers above instead of sticking register name in comment?
> >
> > > +
> > > +static struct perm_bits pci_cap_pm_perm[] = {
> > > + { 0, 0, }, /* 0x00 PM capabilities */
> > > + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_vpd_perm[] = {
> > > + { 0, 0xFFFF0000, }, /* 0x00 address */
> > > + { 0, 0xFFFFFFFF, }, /* 0x04 data */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_slotid_perm[] = {
> > > + { 0, 0, }, /* 0x00 all read only */
> > > +};
> > > +
> > > +/* 4 different possible layouts of MSI capability */
> > > +static struct perm_bits pci_cap_msi_10_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI message control */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
> > > +};
> > > +static struct perm_bits pci_cap_msi_14_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI message control */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
> > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> > > +};
> > > +static struct perm_bits pci_cap_msi_20_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI message control */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x08 MSI message data */
> > > + { 0, 0xFFFFFFFF, }, /* 0x0c MSI mask bits */
> > > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI pending bits */
> > > +};
> > > +static struct perm_bits pci_cap_msi_24_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI message control */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message upper addr */
> > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> > > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
> > > + { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_pcix_perm[] = {
> > > + { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
> > > + { 0, 0, }, /* 0x04 PCI_X_STATUS */
> > > + { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
> > > + { 0, 0, }, /* 0x0c ECC first addr */
> > > + { 0, 0, }, /* 0x10 ECC second addr */
> > > + { 0, 0, }, /* 0x14 ECC attr */
> > > +};
> > > +
> > > +/* pci express capabilities */
> > > +static struct perm_bits pci_cap_exp_perm[] = {
> > > + { 0, 0, }, /* 0x00 PCIe capabilities */
> > > + { 0, 0, }, /* 0x04 PCIe device capabilities */
> > > + { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
> > > + { 0, 0, }, /* 0x0c PCIe link capabilities */
> > > + { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
> > > + { 0, 0, }, /* 0x14 PCIe slot capabilities */
> > > + { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
> > > + { 0, 0, }, /* 0x1c PCIe root port stuff */
> > > + { 0, 0, }, /* 0x20 PCIe root port stuff */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_msix_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI-X Enable */
> > > + { 0, 0, }, /* 0x04 table offset & bir */
> > > + { 0, 0, }, /* 0x08 pba offset & bir */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_af_perm[] = {
> > > + { 0, 0, }, /* 0x00 af capability */
> > > + { 0, 0x0001, }, /* 0x04 af flr bit */
> >
> > So you let application reset the function?
> > If this happens, application will need to restore config space,
> > so virtualizing will create problems.
> >
> > > +};
> > > +
> > > +static struct perm_bits *pci_cap_perms[] = {
> > > + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
> > > + [PCI_CAP_ID_PM] = pci_cap_pm_perm,
> > > + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
> > > + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
> > > + [PCI_CAP_ID_MSI] = NULL, /* special */
> > > + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
> > > + [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
> > > + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
> > > + [PCI_CAP_ID_AF] = pci_cap_af_perm,
> > > +};
> > > +
> > > +static int pci_msi_cap_len(struct pci_dev *pdev, u8 pos)
> > > +{
> > > + int len;
> > > + int ret;
> > > + u16 flags;
> > > +
> > > + ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (flags & PCI_MSI_FLAGS_64BIT)
> > > + len = 14;
> > > + else
> > > + len = 10;
> > > + if (flags & PCI_MSI_FLAGS_MASKBIT)
> > > + len += 10;
> > > + return len;
> > > +}
> > > +
> > > +/*
> > > + * We build a map of the config space that tells us where
> > > + * and what capabilities exist, so that we can map reads and
> > > + * writes back to capabilities, and thus figure out what to
> > > + * allow, deny, or virtualize
> > > + */
> > > +int vfio_build_config_map(struct vfio_dev *vdev)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + u8 *map;
> > > + int i, len;
> > > + u8 pos, cap, tmp;
> > > + u16 flags;
> > > + int ret;
> > > +#ifndef PCI_FIND_CAP_TTL
> > > +#define PCI_FIND_CAP_TTL 48
> > > +#endif
> > > + int loops = PCI_FIND_CAP_TTL;
> > > +
> > > + map = kmalloc(pdev->cfg_size, GFP_KERNEL);
> > > + if (map == NULL)
> > > + return -ENOMEM;
> > > + for (i = 0; i < pdev->cfg_size; i++)
> > > + map[i] = 0xFF;
> > > + vdev->pci_config_map = map;
> > > +
> > > + /* default config space */
> > > + for (i = 0; i < pci_capability_length[0]; i++)
> > > + map[i] = 0;
> > > +
> > > + /* any capabilities? */
> > > + ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
> > > + if (ret < 0)
> > > + return ret;
> > > + if ((flags & PCI_STATUS_CAP_LIST) == 0)
> > > + return 0;
> > > +
> > > + ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
> > > + if (ret < 0)
> > > + return ret;
> > > + while (pos && --loops > 0) {
> > > + ret = pci_read_config_byte(pdev, pos, &cap);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (cap == 0) {
> > > + printk(KERN_WARNING "%s: cap 0\n", __func__);
> > > + break;
> > > + }
> > > + if (cap > PCI_CAP_ID_MAX) {
> > > + printk(KERN_WARNING "%s: unknown pci capability id %x\n",
> > > + __func__, cap);
> > > + len = 0;
> >
> > Why is this a problem? Devices will implement more capabilities.
> > How is access to an unknown one worse that access outside
> > any capability, or vendor specific, that you do allow?
>
> I believe the code doesn't allow access outside of known config space or
> capabilities.

vendor specific capability can still do absolutely anything at all.
So I do not see what protection this buys us.

> It does seem like we could simply skip unknown caps
> though, but that would require virtualizing the capability next pointer,
> which isn't very amenable to how virtualized reads are currently done.
> We'd need something more like how qemu virtualizes config space for
> that.

Which still does not answer the question of why do this in kernel at all :).
qemu has all this code in userspace already (does not use vfio, but we
can cut and paste), userspace drivers that Tom wants will not need these
tricks at all as they are written to a specific interface anyway.

> > > + } else
> > > + len = pci_capability_length[cap];
> > > + if (len == 0) {
> > > + printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
> > > + __func__, cap);
> > > + len = 4;
> > > + }
> > > + if (len == 0xFF) {
> > > + switch (cap) {
> > > + case PCI_CAP_ID_MSI:
> > > + len = pci_msi_cap_len(pdev, pos);
> > > + if (len < 0)
> > > + return len;
> > > + break;
> > > + case PCI_CAP_ID_PCIX:
> > > + ret = pci_read_config_word(pdev, pos + 2,
> > > + &flags);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (flags & 0x3000)
> > > + len = 24;
> > > + else
> > > + len = 8;
> > > + break;
> > > + case PCI_CAP_ID_VNDR:
> > > + /* length follows next field */
> > > + ret = pci_read_config_byte(pdev, pos + 2, &tmp);
> > > + if (ret < 0)
> > > + return ret;
> > > + len = tmp;
> > > + break;
> > > + default:
> > > + len = 0;
> > > + break;
> > > + }
> > > + }
> > > +
> > > + for (i = 0; i < len; i++) {
> > > + if (map[pos+i] != 0xFF)
> > > + printk(KERN_WARNING
> > > + "%s: pci config conflict at %x, "
> > > + "caps %x %x\n",
> > > + __func__, i, map[pos+i], cap);
> > > + map[pos+i] = cap;
> > > + }
> > > + ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
> > > + if (ret < 0)
> > > + return ret;
> > > + }
> > > + if (loops <= 0)
> > > + printk(KERN_ERR "%s: config space loop!\n", __func__);
> > > + return 0;
> > > +}
> > > +
> > > +static void vfio_virt_init(struct vfio_dev *vdev)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + int bar;
> > > + u32 *lp;
> > > + u32 val;
> > > + u8 pos;
> > > + int i, len;
> > > +
> > > + for (bar = 0; bar <= 5; bar++) {
> > > + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> > > + pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
> > > + *lp++ = val;
> > > + }
> > > + lp = (u32 *)vdev->vinfo.rombar;
> > > + pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
> > > + *lp = val;
> > > +
> > > + vdev->vinfo.intr = pdev->irq;
> > > +
> > > + pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
> > > + if (pos > 0) {
> > > + len = pci_msi_cap_len(pdev, pos);
> > > + if (len < 0)
> > > + return;
> > > + for (i = 0; i < len; i++)
> > > + (void) pci_read_config_byte(pdev, pos + i,
> > > + &vdev->vinfo.msi[i]);
> > > + }
> > > +}
> > > +
> > > +static void vfio_bar_fixup(struct vfio_dev *vdev)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + int bar;
> > > + u32 *lp;
> > > + u64 mask;
> > > +
> > > + for (bar = 0; bar <= 5; bar++) {
> > > + if (pci_resource_start(pdev, bar))
> > > + mask = ~(pci_resource_len(pdev, bar) - 1);
> > > + else
> > > + mask = 0;
> > > + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> > > + *lp &= (u32)mask;
> > > +
> > > + if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
> > > + *lp |= PCI_BASE_ADDRESS_SPACE_IO;
> > > + else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> > > + *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
> > > + if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
> > > + *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> > > + if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
> > > + *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> > > + lp++;
> > > + *lp &= (u32)(mask >> 32);
> > > + bar++;
> > > + }
> > > + }
> > > + }
> > > +
> > > + lp = (u32 *)vdev->vinfo.rombar;
> > > + mask = ~(pci_resource_len(pdev, PCI_ROM_RESOURCE) - 1);
> > > + *lp &= (u32)mask | PCI_ROM_ADDRESS_ENABLE;
> > > +
> > > + vdev->vinfo.bardirty = 0;
> > > +}
> > > +
> > > +static int vfio_config_rwbyte(int write,
> > > + struct vfio_dev *vdev,
> > > + int pos,
> > > + char __user *buf)
> >
> > Consider exporting from pci-sysfs.c instead of duplicating code.
>
> Seems like we'd need to reach some agreement that this should look a lot
> more like pci-sysfs config access first, no?

It seems to do the same, except it lets userspace do pci_write instead of
pci_user_write, which seems like a bug?

> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + u8 *map = vdev->pci_config_map;
> > > + u8 cap, val, newval;
> > > + u16 start, off;
> > > + int p;
> > > + struct perm_bits *perm;
> > > + u8 wr, virt;
> > > + int ret;
> > > + int len;
> > > +
> > > + cap = map[pos];
> > > + if (cap == 0xFF) { /* unknown region */
> > > + if (write)
> > > + return 0; /* silent no-op */
> > > + val = 0;
> > > + if (pos <= pci_capability_length[0]) /* ok to read */
> > > + (void) pci_read_config_byte(pdev, pos, &val);
>
> Tom, there's a kernel memory leak here. val always needs to be
> initialized if we're going to stuff it back in the user buffer.
>
> > > + if (copy_to_user(buf, &val, 1))
> > > + return -EFAULT;
> > > + return 0;
> > > + }
> > > +
> > > + /* scan back to start of cap region */
> > > + for (p = pos; p >= 0; p--) {
> > > + if (map[p] != cap)
> > > + break;
> > > + start = p;
> > > + }
> > > + off = pos - start; /* offset within capability */
> > > +
> > > + perm = pci_cap_perms[cap];
> > > + if (cap == PCI_CAP_ID_MSI) {
> > > + len = pci_msi_cap_len(pdev, start);
> > > + switch (len) {
> > > + case 10:
> > > + perm = pci_cap_msi_10_perm;
> > > + break;
> > > + case 14:
> > > + perm = pci_cap_msi_14_perm;
> > > + break;
> > > + case 20:
> > > + perm = pci_cap_msi_20_perm;
> > > + break;
> > > + case 24:
> > > + perm = pci_cap_msi_24_perm;
> > > + break;
> > > + default:
> > > + perm = NULL;
> > > + break;
> > > + }
> > > + }
> > > + if (perm == NULL) {
> > > + wr = 0;
> > > + virt = 0;
> > > + } else {
> > > + perm += (off >> 2);
> > > + wr = perm->write >> ((off & 3) * 8);
> > > + virt = perm->rvirt >> ((off & 3) * 8);
> > > + }
> > > + if (write && !wr) /* no writeable bits */
> > > + return 0;
> > > + if (!virt) {
> > > + if (write) {
> > > + if (copy_from_user(&val, buf, 1))
> > > + return -EFAULT;
> > > + val &= wr;
> > > + if (wr != 0xFF) {
> > > + u8 existing;
> > > +
> > > + ret = pci_read_config_byte(pdev, pos,
> > > + &existing);
> > > + if (ret < 0)
> > > + return ret;
> > > + val |= (existing & ~wr);
> > > + }
> > > + pci_write_config_byte(pdev, pos, val);
> >
> > I think this should be pci_user_write_config_dword and same for read.
> >
> >
> > > + } else {
> > > + ret = pci_read_config_byte(pdev, pos, &val);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (copy_to_user(buf, &val, 1))
> > > + return -EFAULT;
> > > + }
> > > + return 0;
> > > + }
> > > +
> > > + if (write) {
> > > + if (copy_from_user(&newval, buf, 1))
> > > + return -EFAULT;
> > > + }
> > > + /*
> > > + * We get here if there are some virt bits
> > > + * handle remaining real bits, if any
> > > + */
> > > + if (~virt) {
> > > + u8 rbits = (~virt) & wr;
> > > +
> > > + ret = pci_read_config_byte(pdev, pos, &val);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (write && rbits) {
> > > + val &= ~rbits;
> > > + newval &= rbits;
> > > + val |= newval;
> > > + pci_write_config_byte(pdev, pos, val);
> > > + }
> > > + }
> > > + /*
> > > + * Now handle entirely virtual fields
> > > + */
> > > + switch (cap) {
> > > + case PCI_CAP_ID_BASIC: /* virtualize BARs */
> > > + switch (off) {
> > > + /*
> > > + * vendor and device are virt because they don't
> > > + * show up otherwise for sr-iov vfs
> > > + */
> > > + case PCI_VENDOR_ID:
> > > + val = pdev->vendor;
> > > + break;
> > > + case PCI_VENDOR_ID + 1:
> > > + val = pdev->vendor >> 8;
> > > + break;
> > > + case PCI_DEVICE_ID:
> > > + val = pdev->device;
> > > + break;
> > > + case PCI_DEVICE_ID + 1:
> > > + val = pdev->device >> 8;
> > > + break;
> > > + case PCI_INTERRUPT_LINE:
> > > + if (write)
> > > + vdev->vinfo.intr = newval;
> > > + else
> > > + val = vdev->vinfo.intr;
> > > + break;
> > > + case PCI_ROM_ADDRESS:
> > > + case PCI_ROM_ADDRESS+1:
> > > + case PCI_ROM_ADDRESS+2:
> > > + case PCI_ROM_ADDRESS+3:
> > > + if (write) {
> > > + vdev->vinfo.rombar[off & 3] = newval;
> > > + vdev->vinfo.bardirty = 1;
> > > + } else {
> > > + if (vdev->vinfo.bardirty)
> > > + vfio_bar_fixup(vdev);
> > > + val = vdev->vinfo.rombar[off & 3];
> > > + }
> > > + break;
> > > + default:
> > > + if (off >= PCI_BASE_ADDRESS_0 &&
> > > + off <= PCI_BASE_ADDRESS_5 + 3) {
> > > + int boff = off - PCI_BASE_ADDRESS_0;
> > > +
> > > + if (write) {
> > > + vdev->vinfo.bar[boff] = newval;
> > > + vdev->vinfo.bardirty = 1;
> > > + } else {
> > > + if (vdev->vinfo.bardirty)
> > > + vfio_bar_fixup(vdev);
> > > + val = vdev->vinfo.bar[boff];
> > > + }
> > > + }
> > > + break;
> > > + }
> > > + break;
> > > + case PCI_CAP_ID_MSI: /* virtualize (parts of) MSI */
> > > + if (write)
> > > + vdev->vinfo.msi[off] = newval;
> > > + else
> > > + val = vdev->vinfo.msi[off];
> > > + break;
> > > + }
> > > + if (!write && copy_to_user(buf, &val, 1))
> > > + return -EFAULT;
> > > + return 0;
> > > +}
> > > +
> > > +ssize_t vfio_config_readwrite(int write,
> > > + struct vfio_dev *vdev,
> > > + char __user *buf,
> > > + size_t count,
> > > + loff_t *ppos)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + int done = 0;
> > > + int ret;
> > > + u16 pos;
> > > +
> > > +
> > > + if (vdev->pci_config_map == NULL) {
> > > + ret = vfio_build_config_map(vdev);
> > > + if (ret < 0)
> > > + goto out;
> > > + vfio_virt_init(vdev);
> > > + }
> > > +
> > > + while (count > 0) {
> > > + pos = *ppos;
> > > + if (pos == pdev->cfg_size)
> > > + break;
> > > + if (pos > pdev->cfg_size) {
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> > > +
> > > + /*
> > > + * we grab the irqlock here to prevent confusing
> > > + * the read/modify/write sequence in vfio_interrupt
> > > + */
> > > + spin_lock_irq(&vdev->irqlock);
> > > + ret = vfio_config_rwbyte(write, vdev, pos, buf);
> > > + spin_unlock_irq(&vdev->irqlock);
> > > +
> > > + if (ret < 0)
> > > + goto out;
> > > + buf++;
> > > + done++;
> > > + count--;
> > > + (*ppos)++;
> > > + }
> > > + ret = done;
> > > +out:
> > > + return ret;
> > > +}
> > > diff --git a/drivers/vfio/vfio_rdwr.c b/drivers/vfio/vfio_rdwr.c
> > > index e69de29..f4bd2b7 100644
> > > --- a/drivers/vfio/vfio_rdwr.c
> > > +++ b/drivers/vfio/vfio_rdwr.c
> > > @@ -0,0 +1,152 @@
> > > +/*
> > > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > > + * Author: Tom Lyon, [email protected]
> >
> > Any hints on what this file does?
> >
> > > + *
> > > + * This program is free software; you may redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; version 2 of the License.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > > + * SOFTWARE.
> > > + *
> > > + * Portions derived from drivers/uio/uio.c:
> > > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > > + *
> > > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > > + * Copyright (C) 2009 Red Hat, Inc.
> > > + * Author: Michael S. Tsirkin <[email protected]>
> > > + */
> > > +
> > > +#include <linux/fs.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/io.h>
> > > +
> > > +#include <linux/vfio.h>
> > > +
> > > +ssize_t vfio_io_readwrite(
> > > + int write,
> > > + struct vfio_dev *vdev,
> > > + char __user *buf,
> > > + size_t count,
> > > + loff_t *ppos)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + size_t done = 0;
> > > + resource_size_t end;
> > > + void __iomem *io;
> > > + loff_t pos;
> > > + int pci_space;
> > > + int unit;
> > > +
> > > + pci_space = vfio_offset_to_pci_space(*ppos);
> > > + pos = vfio_offset_to_pci_offset(*ppos);
> > > +
> > > + if (!pci_resource_start(pdev, pci_space))
> > > + return -EINVAL;
> > > + end = pci_resource_len(pdev, pci_space);
> > > + if (pos + count > end)
> > > + return -EINVAL;
> > > + if (vdev->bar[pci_space] == NULL)
> > > + vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> > > + io = vdev->bar[pci_space];
> > > +
> > > + while (count > 0) {
> > > + if ((pos % 4) == 0 && count >= 4) {
> > > + u32 val;
> > > +
> > > + if (write) {
> > > + if (copy_from_user(&val, buf, 4))
> > > + return -EFAULT;
> > > + iowrite32(val, io + pos);
> > > + } else {
> > > + val = ioread32(io + pos);
> > > + if (copy_to_user(buf, &val, 4))
> > > + return -EFAULT;
> > > + }
> > > + unit = 4;
> > > + } else if ((pos % 2) == 0 && count >= 2) {
> > > + u16 val;
> > > +
> > > + if (write) {
> > > + if (copy_from_user(&val, buf, 2))
> > > + return -EFAULT;
> > > + iowrite16(val, io + pos);
> > > + } else {
> > > + val = ioread16(io + pos);
> > > + if (copy_to_user(buf, &val, 2))
> > > + return -EFAULT;
> > > + }
> > > + unit = 2;
> > > + } else {
> > > + u8 val;
> > > +
> > > + if (write) {
> > > + if (copy_from_user(&val, buf, 1))
> > > + return -EFAULT;
> > > + iowrite8(val, io + pos);
> > > + } else {
> > > + val = ioread8(io + pos);
> > > + if (copy_to_user(buf, &val, 1))
> > > + return -EFAULT;
> > > + }
> > > + unit = 1;
> > > + }
> > > + pos += unit;
> > > + buf += unit;
> > > + count -= unit;
> > > + done += unit;
> >
> > Again, export from pci-sysfs.c?
>
> I just submitted a patch to be able to do read/write to ioport
> resources, but I still think these serve a different purpose. PCI sysfs
> config and resource files don't have the same tie in to DMA mapping as
> vfio does, so they'll happily let you setup the device without the added
> security of an iommu. We have to trust that the user is going to be
> safe and use the iommu. I believe vfio is trying to create a safe
> interface that provides iommu enforcement and simple config space
> virtualization, so does need to reimplement a few things that we may be
> able to get unsafely through pci-sysfs.

I do realize you want to do everything through a single fd.
Sure, but add EXPORT_ and call the working code in pci-sysfs.c,
do not just duplicate code.

> > > + }
> > > + *ppos += done;
> > > + return done;
> > > +}
> > > +
> > > +ssize_t vfio_mem_readwrite(
> > > + int write,
> > > + struct vfio_dev *vdev,
> > > + char __user *buf,
> > > + size_t count,
> > > + loff_t *ppos)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + resource_size_t end;
> > > + void __iomem *io;
> > > + loff_t pos;
> > > + int pci_space;
> > > +
> > > + pci_space = vfio_offset_to_pci_space(*ppos);
> > > + pos = vfio_offset_to_pci_offset(*ppos);
> > > +
> > > + if (!pci_resource_start(pdev, pci_space))
> > > + return -EINVAL;
> > > + end = pci_resource_len(pdev, pci_space);
> > > + if (vdev->bar[pci_space] == NULL)
> > > + vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> > > + io = vdev->bar[pci_space];
> > > +
> > > + if (pos > end)
> > > + return -EINVAL;
> > > + if (pos == end)
> > > + return 0;
> > > + if (pos + count > end)
> > > + count = end - pos;
> > > + if (write) {
> > > + if (copy_from_user(io + pos, buf, count))
> > > + return -EFAULT;
> > > + } else {
> > > + if (copy_to_user(buf, io + pos, count))
> > > + return -EFAULT;
> > > + }
> > > + *ppos += count;
> > > + return count;
> > > +}
> > > diff --git a/drivers/vfio/vfio_sysfs.c b/drivers/vfio/vfio_sysfs.c
> > > index e69de29..6275809 100644
> > > --- a/drivers/vfio/vfio_sysfs.c
> > > +++ b/drivers/vfio/vfio_sysfs.c
> > > @@ -0,0 +1,153 @@
> > > +/*
> > > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > > + * Author: Tom Lyon, [email protected]
> > > + *
> > > + * This program is free software; you may redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; version 2 of the License.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > > + * SOFTWARE.
> > > + *
> > > + * Portions derived from drivers/uio/uio.c:
> > > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > > + *
> > > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > > + * Copyright (C) 2009 Red Hat, Inc.
> > > + * Author: Michael S. Tsirkin <[email protected]>
> > > + */
> > > +
> > > +#include <linux/module.h>
> > > +#include <linux/device.h>
> > > +#include <linux/kobject.h>
> > > +#include <linux/sysfs.h>
> > > +#include <linux/mm.h>
> > > +#include <linux/fs.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/mmu_notifier.h>
> > > +
> > > +#include <linux/vfio.h>
> > > +
> > > +struct vfio_class *vfio_class;
> > > +
> > > +int vfio_class_init(void)
> > > +{
> > > + int ret = 0;
> > > +
> > > + if (vfio_class != NULL) {
> > > + kref_get(&vfio_class->kref);
> > > + goto exit;
> > > + }
> > > +
> > > + vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
> > > + if (!vfio_class) {
> > > + ret = -ENOMEM;
> > > + goto err_kzalloc;
> > > + }
> > > +
> > > + kref_init(&vfio_class->kref);
> > > + vfio_class->class = class_create(THIS_MODULE, "vfio");
> > > + if (IS_ERR(vfio_class->class)) {
> > > + ret = IS_ERR(vfio_class->class);
> > > + printk(KERN_ERR "class_create failed for vfio\n");
> > > + goto err_class_create;
> > > + }
> > > + return 0;
> > > +
> > > +err_class_create:
> > > + kfree(vfio_class);
> > > + vfio_class = NULL;
> > > +err_kzalloc:
> > > +exit:
> > > + return ret;
> > > +}
> > > +
> > > +static void vfio_class_release(struct kref *kref)
> > > +{
> > > + /* Ok, we cheat as we know we only have one vfio_class */
> > > + class_destroy(vfio_class->class);
> > > + kfree(vfio_class);
> > > + vfio_class = NULL;
> > > +}
> > > +
> > > +void vfio_class_destroy(void)
> > > +{
> > > + if (vfio_class)
> > > + kref_put(&vfio_class->kref, vfio_class_release);
> > > +}
> > > +
> > > +static ssize_t config_map_read(struct kobject *kobj,
> > > + struct bin_attribute *bin_attr,
> > > + char *buf, loff_t off, size_t count)
> > > +{
> > > + struct vfio_dev *vdev = bin_attr->private;
> > > + int ret;
> > > +
> > > + if (off >= 256)
> > > + return 0;
> > > + if (off + count > 256)
> > > + count = 256 - off;
> > > + if (vdev->pci_config_map == NULL) {
> > > + ret = vfio_build_config_map(vdev);
> > > + if (ret < 0)
> > > + return ret;
> > > + }
> > > + memcpy(buf, vdev->pci_config_map + off, count);
> > > + return count;
> > > +}
> > > +
> > > +static ssize_t show_locked_pages(struct device *dev,
> > > + struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + struct vfio_dev *vdev = dev_get_drvdata(dev);
> > > +
> > > + if (vdev == NULL)
> > > + return -ENODEV;
> > > + return sprintf(buf, "%u\n", vdev->locked_pages);
> > > +}
> > > +
> > > +static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
> > > +
> > > +static struct attribute *vfio_attrs[] = {
> > > + &dev_attr_locked_pages.attr,
> > > + NULL,
> > > +};
> > > +
> > > +static struct attribute_group vfio_attr_grp = {
> > > + .attrs = vfio_attrs,
> > > +};
> > > +
> > > +static struct bin_attribute config_map_bin_attribute = {
> > > + .attr = {
> > > + .name = "config_map",
> > > + .mode = S_IRUGO,
> > > + },
> > > + .size = 256,
> > > + .read = config_map_read,
> > > +};
> >
> > The config map looks like an internal implementation detail
> > of the driver. Right? If so exposing it will tie us to
> > this forever, so not a good idea.
> >
> > > +
> > > +int vfio_dev_add_attributes(struct vfio_dev *vdev)
> > > +{
> > > + struct bin_attribute *bi;
> > > + int ret;
> > > +
> > > + ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
> > > + if (ret)
> > > + return ret;
> > > + bi = kmalloc(sizeof(*bi), GFP_KERNEL);
> > > + if (bi == NULL)
> > > + return -ENOMEM;
> > > + *bi = config_map_bin_attribute;
> > > + bi->private = vdev;
> > > + return sysfs_create_bin_file(&vdev->dev->kobj, bi);
> > > +}
> > > diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> > > index e2ea0b2..ed37cf4 100644
> > > --- a/include/linux/Kbuild
> > > +++ b/include/linux/Kbuild
> > > @@ -166,6 +166,7 @@ header-y += ultrasound.h
> > > header-y += un.h
> > > header-y += utime.h
> > > header-y += veth.h
> > > +header-y += vfio.h
> > > header-y += videotext.h
> > > header-y += x25.h
> > >
> > > diff --git a/include/linux/uiommu.h b/include/linux/uiommu.h
> > > index e69de29..a7b7eac 100644
> > > --- a/include/linux/uiommu.h
> > > +++ b/include/linux/uiommu.h
> > > @@ -0,0 +1,76 @@
> > > +/*
> > > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > > + * Author: Tom Lyon, [email protected]
> > > + *
> > > + * This program is free software; you may redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; version 2 of the License.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > > + * SOFTWARE.
> > > + */
> > > +
> > > +/*
> > > + * uiommu driver - manipulation of iommu domains from user progs
> > > + */
> > > +struct uiommu_domain {
> > > + struct iommu_domain *domain;
> > > + atomic_t refcnt;
> > > +};
> > > +
> > > +/*
> > > + * Kernel routines invoked by fellow driver (vfio)
> > > + * after uiommu domain fd is passed in.
> > > + */
> > > +struct uiommu_domain *uiommu_fdget(int fd);
> > > +void uiommu_put(struct uiommu_domain *);
> > > +
> > > +/*
> > > + * These inlines are placeholders for future routines
> > > + * which may keep statistics, show info in sysfs, etc.
> > > + */
> > > +static inline int uiommu_attach_device(struct uiommu_domain *udomain,
> > > + struct device *dev)
> > > +{
> > > + return iommu_attach_device(udomain->domain, dev);
> > > +}
> > > +
> > > +static inline void uiommu_detach_device(struct uiommu_domain *udomain,
> > > + struct device *dev)
> > > +{
> > > + iommu_detach_device(udomain->domain, dev);
> > > +}
> > > +
> > > +static inline int uiommu_map_range(struct uiommu_domain *udomain,
> > > + unsigned long iova,
> > > + phys_addr_t paddr,
> > > + size_t size,
> > > + int prot)
> > > +{
> > > + return iommu_map_range(udomain->domain, iova, paddr, size, prot);
> > > +}
> > > +
> > > +static inline void uiommu_unmap_range(struct uiommu_domain *udomain,
> > > + unsigned long iova,
> > > + size_t size)
> > > +{
> > > + iommu_unmap_range(udomain->domain, iova, size);
> > > +}
> > > +
> > > +static inline phys_addr_t uiommu_iova_to_phys(struct uiommu_domain *udomain,
> > > + unsigned long iova)
> > > +{
> > > + return iommu_iova_to_phys(udomain->domain, iova);
> > > +}
> > > +
> > > +static inline int uiommu_domain_has_cap(struct uiommu_domain *udomain,
> > > + unsigned long cap)
> > > +{
> > > + return iommu_domain_has_cap(udomain->domain, cap);
> > > +}
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index e69de29..52aa0dd 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -0,0 +1,202 @@
> > > +/*
> > > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > > + * Author: Tom Lyon, [email protected]
> > > + *
> > > + * This program is free software; you may redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; version 2 of the License.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > > + * SOFTWARE.
> > > + *
> > > + * Portions derived from drivers/uio/uio.c:
> > > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > > + *
> > > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > > + * Copyright (C) 2009 Red Hat, Inc.
> > > + * Author: Michael S. Tsirkin <[email protected]>
> > > + */
> > > +#include <linux/types.h>
> > > +
> > > +/*
> > > + * VFIO driver - allow mapping and use of certain PCI devices
> > > + * in unprivileged user processes. (If IOMMU is present)
> > > + * Especially useful for Virtual Function parts of SR-IOV devices
> > > + */
> > > +
> > > +#ifdef __KERNEL__
> > > +
> > > +struct vfio_dev {
> > > + struct device *dev;
> > > + struct pci_dev *pdev;
> > > + u8 *pci_config_map;
> > > + int pci_config_size;
> > > + char name[8];
> > > + int devnum;
> > > + void __iomem *bar[PCI_ROM_RESOURCE+1];
> > > + spinlock_t irqlock; /* guards command register accesses */
> > > + int listeners;
> > > + u32 locked_pages;
> > > + struct mutex lgate; /* listener gate */
> > > + struct mutex dgate; /* dma op gate */
> > > + struct mutex igate; /* intr op gate */
> > > + wait_queue_head_t dev_idle_q;
> > > + int mapcount;
> > > + struct uiommu_domain *udomain;
> > > + struct msix_entry *msix;
> > > + int nvec;
> > > + int cachec;
> > > + struct eventfd_ctx *ev_irq;
> > > + struct eventfd_ctx *ev_msi;
> > > + struct eventfd_ctx **ev_msix;
> > > + struct {
> > > + u8 intr;
> > > + u8 bardirty;
> > > + u8 rombar[4];
> > > + u8 bar[6*4];
> > > + u8 msi[24];
> > > + } vinfo;
> > > +};
> > > +
> > > +struct vfio_listener {
> > > + struct vfio_dev *vdev;
> > > + struct list_head dm_list;
> > > + struct mm_struct *mm;
> > > + struct mmu_notifier mmu_notifier;
> > > +};
> > > +
> > > +/*
> > > + * Structure for keeping track of memory nailed down by the
> > > + * user for DMA
> > > + */
> > > +struct dma_map_page {
> > > + struct list_head list;
> > > + struct page **pages;
> > > + dma_addr_t daddr;
> > > + unsigned long vaddr;
> > > + int npage;
> > > + int rdwr;
> > > +};
> > > +
> > > +/* VFIO class infrastructure */
> > > +struct vfio_class {
> > > + struct kref kref;
> > > + struct class *class;
> > > +};
> > > +extern struct vfio_class *vfio_class;
> > > +
> > > +ssize_t vfio_io_readwrite(int, struct vfio_dev *,
> > > + char __user *, size_t, loff_t *);
> > > +ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
> > > + char __user *, size_t, loff_t *);
> > > +ssize_t vfio_config_readwrite(int, struct vfio_dev *,
> > > + char __user *, size_t, loff_t *);
> > > +
> > > +void vfio_disable_msi(struct vfio_dev *);
> > > +void vfio_disable_msix(struct vfio_dev *);
> > > +int vfio_enable_msi(struct vfio_dev *, int);
> > > +int vfio_enable_msix(struct vfio_dev *, int, void __user *);
> > > +
> > > +#ifndef PCI_MSIX_ENTRY_SIZE
> > > +#define PCI_MSIX_ENTRY_SIZE 16
> > > +#endif
> > > +#ifndef PCI_STATUS_INTERRUPT
> > > +#define PCI_STATUS_INTERRUPT 0x08
> > > +#endif
> > > +
> > > +struct vfio_dma_map;
> > > +void vfio_dma_unmapall(struct vfio_listener *);
> > > +int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
> > > +int vfio_dma_map_common(struct vfio_listener *, unsigned int,
> > > + struct vfio_dma_map *);
> > > +int vfio_domain_set(struct vfio_dev *, int, int);
> > > +int vfio_domain_unset(struct vfio_dev *);
> > > +
> > > +int vfio_class_init(void);
> > > +void vfio_class_destroy(void);
> > > +int vfio_dev_add_attributes(struct vfio_dev *);
> > > +int vfio_build_config_map(struct vfio_dev *);
> > > +
> > > +irqreturn_t vfio_interrupt(int, void *);
> > > +
> > > +#endif /* __KERNEL__ */
> > > +
> > > +/* Kernel & User level defines for ioctls */
> > > +
> > > +/*
> > > + * Structure for DMA mapping of user buffers
> > > + * vaddr, dmaaddr, and size must all be page aligned
> > > + * buffer may only be larger than 1 page if (a) there is
> > > + * an iommu in the system, or (b) buffer is part of a huge page
> > > + */
> > > +struct vfio_dma_map {
> > > + __u64 vaddr; /* process virtual addr */
> > > + __u64 dmaaddr; /* desired and/or returned dma address */
> > > + __u64 size; /* size in bytes */
> > > + __u64 flags; /* bool: 0 for r/o; 1 for r/w */
> > > +#define VFIO_FLAG_WRITE 0x1 /* req writeable DMA mem */
> > > +};
> > > +
> > > +/* map user pages at specific dma address */
> > > +/* requires previous VFIO_DOMAIN_SET */
> > > +#define VFIO_DMA_MAP_IOVA _IOWR(';', 101, struct vfio_dma_map)
> > > +
> > > +/* unmap user pages */
> > > +#define VFIO_DMA_UNMAP _IOW(';', 102, struct vfio_dma_map)
> > > +
> > > +/* request IRQ interrupts; use given eventfd */
> > > +#define VFIO_EVENTFD_IRQ _IOW(';', 103, int)
> > > +
> > > +/* request MSI interrupts; use given eventfd */
> > > +#define VFIO_EVENTFD_MSI _IOW(';', 104, int)
> > > +
> > > +/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
> > > +#define VFIO_EVENTFDS_MSIX _IOW(';', 105, int)
> > > +
> > > +/* Get length of a BAR */
> > > +#define VFIO_BAR_LEN _IOWR(';', 167, __u32)
> > > +
> > > +/* Set the IOMMU domain - arg is fd from uiommu driver */
> > > +#define VFIO_DOMAIN_SET _IOW(';', 107, int)
> > > +
> > > +/* Unset the IOMMU domain */
> > > +#define VFIO_DOMAIN_UNSET _IO(';', 108)
> > > +
> > > +/*
> > > + * Reads, writes, and mmaps determine which PCI BAR (or config space)
> > > + * from the high level bits of the file offset
> > > + */
> > > +#define VFIO_PCI_BAR0_RESOURCE 0x0
> > > +#define VFIO_PCI_BAR1_RESOURCE 0x1
> > > +#define VFIO_PCI_BAR2_RESOURCE 0x2
> > > +#define VFIO_PCI_BAR3_RESOURCE 0x3
> > > +#define VFIO_PCI_BAR4_RESOURCE 0x4
> > > +#define VFIO_PCI_BAR5_RESOURCE 0x5
> >
> > This looks wrong.
> > In the code I snipped, we had:
> > + pci_space = vfio_offset_to_pci_space(*ppos);
> >
> > So this is actually used as resource number, not BAR number.
> > One or the other will need to get fixed, otherwise
> > with 64 bit BAR0, you will get the wrong resource.
>
> Aren't BAR number and resource number synonymous? I would have assumed
> that if BAR0 is 64bit, pci_resource_len(, BAR1) is zero, which should
> work just fine for the way it's coded up here.

You are right. However, what confused me is the way resource number is
passed to pci functions. If we do this, we should make this explicit in
a conversion function:

switch (resource) {
case VFIO_PCI_BAR0_RESOURCE..VFIO_PCI_BAR5_RESOURCE:
return PCI_STD_RESOURCES + resource - VFIO_PCI_BAR0_RESOURCE;
case VFIO_PCI_ROM_RESOURCE:
return PCI_ROM_RESOURCE;
default:
BUG_ON(...);
}

or reuse macros from linux/pci.h, instead of relying on the numbers to match.


> > > +#define VFIO_PCI_ROM_RESOURCE 0x6
> > > +#define VFIO_PCI_CONFIG_RESOURCE 0xF
> > > +#define VFIO_PCI_SPACE_SHIFT 32
> > > +#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
> > > +
> > > +static inline int vfio_offset_to_pci_space(__u64 off)
> > > +{
> > > + return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
> > > +}
> > > +
> > > +static inline u32 vfio_offset_to_pci_offset(__u64 off)
> > > +{
> > > + return off & (u32)0xFFFFFFFF;
> > > +}
> > > +
> > > +static inline __u64 vfio_pci_space_to_offset(int sp)
> > > +{
> > > + return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
> > > +}
>
>

2010-07-20 20:24:39

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Sat, Jul 17, 2010 at 10:45:23AM +0200, Piotr Jaroszy??ski wrote:
> On 16 July 2010 23:58, Tom Lyon <[email protected]> wrote:
> > The VFIO "driver" is used to allow privileged AND non-privileged processes to
> > implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> > devices.
>
> Thanks for working on that! I wonder whether it's possible to say what
> are the chances of it being merged to mainline and which version we
> might be talking about?

We still have a long way to go before you need to worry about what
kernel version it's going to show up in...

thanks,

greg k-h

2010-07-27 22:13:45

by Tom Lyon

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

[ Sorry for the long hiatus, I've been wrapped up in other issues.]

I think the fundamental issue to resolve is to decide on the model which the
VFIO driver presents to its users.

Fundamentally, VFIO as part of the OS must protect the system from its users
and also protect the users from each other. No disagreement here.

But another fundamental purpose of an OS to to present an abstract model of
the underlying hardware to its users, so that the users don't have to deal
with the full complexity of the hardware.

So I think VFIO should present a 'virtual', abstracted PCI device to its users
whereas Michael has argued for a simpler model of presenting the real PCI
device config registers while preventing writes only to the registers which
would clearly disrupt the system.

Now, the virtual model *could* look little like the real hardware, and use
bunches of ioctls for everything it needs, or it could look a lot like PCI and
use reads and writes of the virtual PCI config registers to trigger its
actions. The latter makes things more amenable to those porting drivers from
other environments.

I realize that to date the VFIO driver has been a bit of a mish-mash between
the ioctl and config based techniques; I intend to clean that up. And, yes,
the abstract model presented by VFIO will need plenty of documentation.

Since KVM/qemu already has its own notion of a virtual PCI device which it
presents to the guest OS, we either need to reconcile VFIO and qemu, or
provide a bypass of the VFIO virtual model. This could be direct access
through sysfs, or else an ioctl to VFIO. Since I have no internals knowledge
of qemu, I look to others to choose.

Other little things:
1. Yes, I can share some code with sysfs if I can get the right EXPORTs there.
2. I'll add multiple MSI support, but I wish to point out that even though the
PCI MSI API supports it, none of the architectures do.
3. FLR needs work. I was foolish enough to assume that FLR wouldn't reset
BARs; now I know better.
4. I'll get rid of the vfio config_map in sysfs; it was there for debugging.
5. I'm still looking to support hotplug/unplug and power management stuff via
generic netlink notifications.

2010-07-27 23:59:09

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jul 27, 2010 at 03:13:14PM -0700, Tom Lyon wrote:
> [ Sorry for the long hiatus, I've been wrapped up in other issues.]
>
> I think the fundamental issue to resolve is to decide on the model which the
> VFIO driver presents to its users.
>
> Fundamentally, VFIO as part of the OS must protect the system from its users
> and also protect the users from each other. No disagreement here.
>
> But another fundamental purpose of an OS to to present an abstract model of
> the underlying hardware to its users, so that the users don't have to deal
> with the full complexity of the hardware.
>
> So I think VFIO should present a 'virtual', abstracted PCI device to its users
> whereas Michael has argued for a simpler model of presenting the real PCI
> device config registers while preventing writes only to the registers which
> would clearly disrupt the system.

In fact, there is no contradiction. I am all for an abstracted
API *and* I think the virtualization concept is a bad way
to build this API.

The 'virtual' interface you present is very complex and hardware specific:
you do not hide literally *anything*. Deciding which functionality userspace
needs, and exposing it to userspace as a set of APIs would be abstract.
Instead you ask people to go read the PCI spec, the device spec, and bang
on PCI registers, little-endian-ness and all, then try to interpret
what do the virtual values mean.

Example:

How do I find # of MSI-X vectors? Sure, scan the capability list,
find the capability, read the value, convert from little endian
at each step.
A page or two of code, and let's hope I have a ppc to test on.
And note no driver has this code - they all use OS routines.

So why wouldn't
ioctl(dev, VFIO_GET_MSIX_VECTORS, &n);
better serve the declared goal of presenting an abstracted PCI device to
users?


> Now, the virtual model *could* look little like the real hardware, and use
> bunches of ioctls for everything it needs,

Or reads/writes at special offsets, or sysfs attributes.

> or it could look a lot like PCI and
> use reads and writes of the virtual PCI config registers to trigger its
> actions. The latter makes things more amenable to those porting drivers from
> other environments.

I really doubt this helps at all. Drivers typically use OS-specific
APIs. It is very uncommon for them to touch standard registers,
which is 100% of what your patch seem to be dealing with.

And again, how about a small userspace library that would wrap vfio and
add the abstractions for drivers that do need them?

> I realize that to date the VFIO driver has been a bit of a mish-mash between
> the ioctl and config based techniques; I intend to clean that up. And, yes,
> the abstract model presented by VFIO will need plenty of documentation.

And, it will need to be maintained forever, bugs and all.
For example, if you change some register you emulated
to fix a bug, to the driver this looks like a hardware change,
and it will crash.

The PCI spec has some weak versioning support, but it
is mostly not a problem in that space: a specific driver needs to
only deal with a specific device. We have a generic driver so PCI
configuration space is a bad interface to use.


> Since KVM/qemu already has its own notion of a virtual PCI device which it
> presents to the guest OS, we either need to reconcile VFIO and qemu, or
> provide a bypass of the VFIO virtual model. This could be direct access
> through sysfs, or else an ioctl to VFIO. Since I have no internals knowledge
> of qemu, I look to others to choose.

Ah, so there will be 2 APIs, one for qemu, one for userspace drivers?

> Other little things:
> 1. Yes, I can share some code with sysfs if I can get the right EXPORTs there.
> 2. I'll add multiple MSI support, but I wish to point out that even though the
> PCI MSI API supports it, none of the architectures do.
> 3. FLR needs work. I was foolish enough to assume that FLR wouldn't reset
> BARs; now I know better.

And as I said separately, drivers might reset BARs without FLR as well.
As long as io/memory is disabled, we really should allow userspace
write anything in BARs. And once we let it do it, most of the problem goes
away.

> 4. I'll get rid of the vfio config_map in sysfs; it was there for debugging.
> 5. I'm still looking to support hotplug/unplug and power management stuff via
> generic netlink notifications.

2010-07-28 21:14:59

by Tom Lyon

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Tuesday, July 27, 2010 04:53:22 pm Michael S. Tsirkin wrote:
> On Tue, Jul 27, 2010 at 03:13:14PM -0700, Tom Lyon wrote:
> > [ Sorry for the long hiatus, I've been wrapped up in other issues.]
> >
> > I think the fundamental issue to resolve is to decide on the model which
> > the VFIO driver presents to its users.
> >
> > Fundamentally, VFIO as part of the OS must protect the system from its
> > users and also protect the users from each other. No disagreement here.
> >
> > But another fundamental purpose of an OS to to present an abstract model
> > of the underlying hardware to its users, so that the users don't have to
> > deal with the full complexity of the hardware.
> >
> > So I think VFIO should present a 'virtual', abstracted PCI device to its
> > users whereas Michael has argued for a simpler model of presenting the
> > real PCI device config registers while preventing writes only to the
> > registers which would clearly disrupt the system.
>
> In fact, there is no contradiction. I am all for an abstracted
> API *and* I think the virtualization concept is a bad way
> to build this API.
>
> The 'virtual' interface you present is very complex and hardware specific:
> you do not hide literally *anything*. Deciding which functionality
> userspace needs, and exposing it to userspace as a set of APIs would be
> abstract. Instead you ask people to go read the PCI spec, the device spec,
> and bang on PCI registers, little-endian-ness and all, then try to
> interpret what do the virtual values mean.

Exactly! The PCI bus is far better *specified*, *documented*, and widely
implemented than a Linux driver could ever hope to be. And there are lots of
current Linux drivers which bang around in pci config space simply because the
authors were not aware of some api call buried deep in linux which would do
the work for them - or - got tired of using OS-specific APIs when porting a
driver and decided to just ask the hardware.

> Example:
>
> How do I find # of MSI-X vectors? Sure, scan the capability list,
> find the capability, read the value, convert from little endian
> at each step.
> A page or two of code, and let's hope I have a ppc to test on.
> And note no driver has this code - they all use OS routines.
>
> So why wouldn't
> ioctl(dev, VFIO_GET_MSIX_VECTORS, &n);
> better serve the declared goal of presenting an abstracted PCI device to
> users?

By and large, the user drivers just know how many because the hardware is
constant.

And inventing 20 or 30 ioctls to do a bunch of random stuff is gross when you
can instead use normal read and write calls to a well defined structure.
>
> > Now, the virtual model *could* look little like the real hardware, and
> > use bunches of ioctls for everything it needs,
>
> Or reads/writes at special offsets, or sysfs attributes.
>
> > or it could look a lot like PCI and
> > use reads and writes of the virtual PCI config registers to trigger its
> > actions. The latter makes things more amenable to those porting drivers
> > from other environments.
>
> I really doubt this helps at all. Drivers typically use OS-specific
> APIs. It is very uncommon for them to touch standard registers,
> which is 100% of what your patch seem to be dealing with.
>
> And again, how about a small userspace library that would wrap vfio and
> add the abstractions for drivers that do need them?

Yes, there will be userspace libraries - I already have a vfio backend for
libpci.
>
> > I realize that to date the VFIO driver has been a bit of a mish-mash
> > between the ioctl and config based techniques; I intend to clean that
> > up. And, yes, the abstract model presented by VFIO will need plenty of
> > documentation.
>
> And, it will need to be maintained forever, bugs and all.
> For example, if you change some register you emulated
> to fix a bug, to the driver this looks like a hardware change,
> and it will crash.

The changes will come only to allow for a more-perfect emulation, so I doubt
that will cause driver problems. No different than discovering and fixing
bugs in the ioctls needed in you scenario.

>
> The PCI spec has some weak versioning support, but it
> is mostly not a problem in that space: a specific driver needs to
> only deal with a specific device. We have a generic driver so PCI
> configuration space is a bad interface to use.

PCI has great versioning. Damn near every change made in 16+ years has been
upwards compatible. BIOS and OS writers don't have trouble with generic PCI,
why should vfio?

>
> > Since KVM/qemu already has its own notion of a virtual PCI device which
> > it presents to the guest OS, we either need to reconcile VFIO and qemu,
> > or provide a bypass of the VFIO virtual model. This could be direct
> > access through sysfs, or else an ioctl to VFIO. Since I have no
> > internals knowledge of qemu, I look to others to choose.
>
> Ah, so there will be 2 APIs, one for qemu, one for userspace drivers?

I hope not, but I also hope not to become the qemu expert to find out. Alex
W. seemed to be making progress in this area.

>
> > Other little things:
> > 1. Yes, I can share some code with sysfs if I can get the right EXPORTs
> > there. 2. I'll add multiple MSI support, but I wish to point out that
> > even though the PCI MSI API supports it, none of the architectures do.
> > 3. FLR needs work. I was foolish enough to assume that FLR wouldn't
> > reset BARs; now I know better.
>
> And as I said separately, drivers might reset BARs without FLR as well.
> As long as io/memory is disabled, we really should allow userspace
> write anything in BARs. And once we let it do it, most of the problem goes
> away.
>
> > 4. I'll get rid of the vfio config_map in sysfs; it was there for
> > debugging. 5. I'm still looking to support hotplug/unplug and power
> > management stuff via generic netlink notifications.

2010-07-28 21:52:45

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jul 28, 2010 at 02:14:21PM -0700, Tom Lyon wrote:
> On Tuesday, July 27, 2010 04:53:22 pm Michael S. Tsirkin wrote:
> > On Tue, Jul 27, 2010 at 03:13:14PM -0700, Tom Lyon wrote:
> > > [ Sorry for the long hiatus, I've been wrapped up in other issues.]
> > >
> > > I think the fundamental issue to resolve is to decide on the model which
> > > the VFIO driver presents to its users.
> > >
> > > Fundamentally, VFIO as part of the OS must protect the system from its
> > > users and also protect the users from each other. No disagreement here.
> > >
> > > But another fundamental purpose of an OS to to present an abstract model
> > > of the underlying hardware to its users, so that the users don't have to
> > > deal with the full complexity of the hardware.
> > >
> > > So I think VFIO should present a 'virtual', abstracted PCI device to its
> > > users whereas Michael has argued for a simpler model of presenting the
> > > real PCI device config registers while preventing writes only to the
> > > registers which would clearly disrupt the system.
> >
> > In fact, there is no contradiction. I am all for an abstracted
> > API *and* I think the virtualization concept is a bad way
> > to build this API.
> >
> > The 'virtual' interface you present is very complex and hardware specific:
> > you do not hide literally *anything*. Deciding which functionality
> > userspace needs, and exposing it to userspace as a set of APIs would be
> > abstract. Instead you ask people to go read the PCI spec, the device spec,
> > and bang on PCI registers, little-endian-ness and all, then try to
> > interpret what do the virtual values mean.
>
> Exactly! The PCI bus is far better *specified*, *documented*, and widely
> implemented than a Linux driver could ever hope to be.

Yes but it does not map all that well to what you need to do.
We need a sane backward compatibility plan, cross-platform support,
error reporting, atomicity ... PCI config has support for none of this.
So you implement a "kind of" PCI config, where accesses might fail
or not go through to device, where there are some atomicity guarantees
but not others ...
And there won't even be a header file to look at to say "aha,
this driver has this functionality".
How does an application know whether you support capability X?
Reading the driver source seems to be shaping up the only way.

> And there are lots of
> current Linux drivers which bang around in pci config space simply because the
> authors were not aware of some api call buried deep in linux which would do
> the work for them - or - got tired of using OS-specific APIs when porting a
> driver and decided to just ask the hardware.

Really? Example? drivers either use proper APIs or are broken in some way.
You can not even size the BARs without using the OS API.
So what's safe to do directly? Maybe reading out device/vendor/revision ID ...
looks like small change to me.

>
> > Example:
> >
> > How do I find # of MSI-X vectors? Sure, scan the capability list,
> > find the capability, read the value, convert from little endian
> > at each step.
> > A page or two of code, and let's hope I have a ppc to test on.
> > And note no driver has this code - they all use OS routines.
> >
> > So why wouldn't
> > ioctl(dev, VFIO_GET_MSIX_VECTORS, &n);
> > better serve the declared goal of presenting an abstracted PCI device to
> > users?
>
> By and large, the user drivers just know how many because the hardware is
> constant.

But you might not have CPU resources to allocate all vectors.
And, same will apply to any register you spend code virtualizing.

> And inventing 20 or 30 ioctls to do a bunch of random stuff is gross


If you dislike ioctls, use read/write at a defined offset,
or sysfs. Just don't pretend you can say "look at PCI spec"
and avoid the need to document your interface this way.

> when you
> can instead use normal read and write calls to a well defined structure.

It is not all that well defined.
What if hardware supports MSIX but host controller does not?
Do you return error from write enabling MSIX?
Virtualize it and pretend there is no capability?
PCI has no provision for this, and deciding what to do
here is policy which kernel should not dictate.


> >
> > > Now, the virtual model *could* look little like the real hardware, and
> > > use bunches of ioctls for everything it needs,
> >
> > Or reads/writes at special offsets, or sysfs attributes.
> >
> > > or it could look a lot like PCI and
> > > use reads and writes of the virtual PCI config registers to trigger its
> > > actions. The latter makes things more amenable to those porting drivers
> > > from other environments.
> >
> > I really doubt this helps at all. Drivers typically use OS-specific
> > APIs. It is very uncommon for them to touch standard registers,
> > which is 100% of what your patch seem to be dealing with.
> >
> > And again, how about a small userspace library that would wrap vfio and
> > add the abstractions for drivers that do need them?
>
> Yes, there will be userspace libraries - I already have a vfio backend for
> libpci.

So move the virtualization stuff there, and out of kernel.

> > > I realize that to date the VFIO driver has been a bit of a mish-mash
> > > between the ioctl and config based techniques; I intend to clean that
> > > up. And, yes, the abstract model presented by VFIO will need plenty of
> > > documentation.
> >
> > And, it will need to be maintained forever, bugs and all.
> > For example, if you change some register you emulated
> > to fix a bug, to the driver this looks like a hardware change,
> > and it will crash.
>
> The changes will come only to allow for a more-perfect emulation,
> so I doubt
> that will cause driver problems.

You plan changing the API to accomodate new hardware
and doubt this will create problems?
'more perfect emulation' for one app is a crasher bug for another one.


> No different than discovering and fixing
> bugs in the ioctls needed in you scenario.

Very different. With a sane interface you can just add
another register to encode new information, keeping
the old one around to avoid breaking userspace.
PCI is not designed to allow this, so it does not.

> >
> > The PCI spec has some weak versioning support, but it
> > is mostly not a problem in that space: a specific driver needs to
> > only deal with a specific device. We have a generic driver so PCI
> > configuration space is a bad interface to use.
>
> PCI has great versioning. Damn near every change made in 16+ years has been
> upwards compatible.

You plan to push interface extensions for your driver through PCI SIG?

> BIOS and OS writers don't have trouble with generic PCI,
> why should vfio?

They do with it what it was defined to do. You want to use it
as a system call interface which it was never intended for.

> >
> > > Since KVM/qemu already has its own notion of a virtual PCI device which
> > > it presents to the guest OS, we either need to reconcile VFIO and qemu,
> > > or provide a bypass of the VFIO virtual model. This could be direct
> > > access through sysfs, or else an ioctl to VFIO. Since I have no
> > > internals knowledge of qemu, I look to others to choose.
> >
> > Ah, so there will be 2 APIs, one for qemu, one for userspace drivers?
>
> I hope not, but I also hope not to become the qemu expert to find out. Alex
> W. seemed to be making progress in this area.
>
> >
> > > Other little things:
> > > 1. Yes, I can share some code with sysfs if I can get the right EXPORTs
> > > there. 2. I'll add multiple MSI support, but I wish to point out that
> > > even though the PCI MSI API supports it, none of the architectures do.
> > > 3. FLR needs work. I was foolish enough to assume that FLR wouldn't
> > > reset BARs; now I know better.
> >
> > And as I said separately, drivers might reset BARs without FLR as well.
> > As long as io/memory is disabled, we really should allow userspace
> > write anything in BARs. And once we let it do it, most of the problem goes
> > away.
> >
> > > 4. I'll get rid of the vfio config_map in sysfs; it was there for
> > > debugging. 5. I'm still looking to support hotplug/unplug and power
> > > management stuff via generic netlink notifications.

2010-07-28 21:57:23

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Wed, 2010-07-28 at 14:14 -0700, Tom Lyon wrote:
> On Tuesday, July 27, 2010 04:53:22 pm Michael S. Tsirkin wrote:
> > On Tue, Jul 27, 2010 at 03:13:14PM -0700, Tom Lyon wrote:
> > > [ Sorry for the long hiatus, I've been wrapped up in other issues.]
> > >
> > > I think the fundamental issue to resolve is to decide on the model which
> > > the VFIO driver presents to its users.
> > >
> > > Fundamentally, VFIO as part of the OS must protect the system from its
> > > users and also protect the users from each other. No disagreement here.
> > >
> > > But another fundamental purpose of an OS to to present an abstract model
> > > of the underlying hardware to its users, so that the users don't have to
> > > deal with the full complexity of the hardware.
> > >
> > > So I think VFIO should present a 'virtual', abstracted PCI device to its
> > > users whereas Michael has argued for a simpler model of presenting the
> > > real PCI device config registers while preventing writes only to the
> > > registers which would clearly disrupt the system.
> >
> > In fact, there is no contradiction. I am all for an abstracted
> > API *and* I think the virtualization concept is a bad way
> > to build this API.
> >
> > The 'virtual' interface you present is very complex and hardware specific:
> > you do not hide literally *anything*. Deciding which functionality
> > userspace needs, and exposing it to userspace as a set of APIs would be
> > abstract. Instead you ask people to go read the PCI spec, the device spec,
> > and bang on PCI registers, little-endian-ness and all, then try to
> > interpret what do the virtual values mean.
>
> Exactly! The PCI bus is far better *specified*, *documented*, and widely
> implemented than a Linux driver could ever hope to be. And there are lots of
> current Linux drivers which bang around in pci config space simply because the
> authors were not aware of some api call buried deep in linux which would do
> the work for them - or - got tired of using OS-specific APIs when porting a
> driver and decided to just ask the hardware.
>
> > Example:
> >
> > How do I find # of MSI-X vectors? Sure, scan the capability list,
> > find the capability, read the value, convert from little endian
> > at each step.
> > A page or two of code, and let's hope I have a ppc to test on.
> > And note no driver has this code - they all use OS routines.
> >
> > So why wouldn't
> > ioctl(dev, VFIO_GET_MSIX_VECTORS, &n);
> > better serve the declared goal of presenting an abstracted PCI device to
> > users?
>
> By and large, the user drivers just know how many because the hardware is
> constant.

Something like GET_MSIX_VECTORS seems like a user library routine to me.
The PCI config space is well specified and if we try to do more than
shortcut trivial operations (like getting the BAR length), we risk
losing functionality. And for my purposes, translating to and from a
made up API to PCI for the guest seems like a pain.

> And inventing 20 or 30 ioctls to do a bunch of random stuff is gross when you
> can instead use normal read and write calls to a well defined structure.

Yep, this sounds like a job for libvfio.

> > > Now, the virtual model *could* look little like the real hardware, and
> > > use bunches of ioctls for everything it needs,
> >
> > Or reads/writes at special offsets, or sysfs attributes.
> >
> > > or it could look a lot like PCI and
> > > use reads and writes of the virtual PCI config registers to trigger its
> > > actions. The latter makes things more amenable to those porting drivers
> > > from other environments.
> >
> > I really doubt this helps at all. Drivers typically use OS-specific
> > APIs. It is very uncommon for them to touch standard registers,
> > which is 100% of what your patch seem to be dealing with.
> >
> > And again, how about a small userspace library that would wrap vfio and
> > add the abstractions for drivers that do need them?
>
> Yes, there will be userspace libraries - I already have a vfio backend for
> libpci.
> >
> > > I realize that to date the VFIO driver has been a bit of a mish-mash
> > > between the ioctl and config based techniques; I intend to clean that
> > > up. And, yes, the abstract model presented by VFIO will need plenty of
> > > documentation.
> >
> > And, it will need to be maintained forever, bugs and all.
> > For example, if you change some register you emulated
> > to fix a bug, to the driver this looks like a hardware change,
> > and it will crash.
>
> The changes will come only to allow for a more-perfect emulation, so I doubt
> that will cause driver problems. No different than discovering and fixing
> bugs in the ioctls needed in you scenario.
>
> >
> > The PCI spec has some weak versioning support, but it
> > is mostly not a problem in that space: a specific driver needs to
> > only deal with a specific device. We have a generic driver so PCI
> > configuration space is a bad interface to use.
>
> PCI has great versioning. Damn near every change made in 16+ years has been
> upwards compatible. BIOS and OS writers don't have trouble with generic PCI,
> why should vfio?
>
> >
> > > Since KVM/qemu already has its own notion of a virtual PCI device which
> > > it presents to the guest OS, we either need to reconcile VFIO and qemu,
> > > or provide a bypass of the VFIO virtual model. This could be direct
> > > access through sysfs, or else an ioctl to VFIO. Since I have no
> > > internals knowledge of qemu, I look to others to choose.
> >
> > Ah, so there will be 2 APIs, one for qemu, one for userspace drivers?
>
> I hope not, but I also hope not to become the qemu expert to find out. Alex
> W. seemed to be making progress in this area.

I hope not too, the qemu driver I wrote does perfectly fine with the
virtualized config space (aside from things like guest initiated FLR or
resent via oem specific mechanism, which we need to figure out how to
handle anway). I end up calling into the qemu PCI emulation for writes
only to keep the qemu infrastructure working when BARs get mapped.
Reads come from VFIO except for the command register due to the mem/io
bits not necessarily tracking what the driver expects for VFs.

The old device assignment driver uses pci sysfs, so I think we can
easily adapt the qemu vfio driver in either direction, virtualized
config space or non. I do prefer the virtualized config space because
it makes my life easier in the qemu vfio driver. I have far, far fewer
traps for reads and writes to specific addresses than the old device
assignment driver. Thanks,

Alex

> > > Other little things:
> > > 1. Yes, I can share some code with sysfs if I can get the right EXPORTs
> > > there. 2. I'll add multiple MSI support, but I wish to point out that
> > > even though the PCI MSI API supports it, none of the architectures do.
> > > 3. FLR needs work. I was foolish enough to assume that FLR wouldn't
> > > reset BARs; now I know better.
> >
> > And as I said separately, drivers might reset BARs without FLR as well.
> > As long as io/memory is disabled, we really should allow userspace
> > write anything in BARs. And once we let it do it, most of the problem goes
> > away.
> >
> > > 4. I'll get rid of the vfio config_map in sysfs; it was there for
> > > debugging. 5. I'm still looking to support hotplug/unplug and power
> > > management stuff via generic netlink notifications.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2010-07-28 22:03:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jul 28, 2010 at 03:57:02PM -0600, Alex Williamson wrote:
>
> Something like GET_MSIX_VECTORS seems like a user library routine to me.
> The PCI config space is well specified and if we try to do more than
> shortcut trivial operations (like getting the BAR length), we risk
> losing functionality. And for my purposes, translating to and from a
> made up API to PCI for the guest seems like a pain.

Won't a userspace library do just as well for you?

2010-07-28 22:13:49

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Thu, 2010-07-29 at 00:57 +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 28, 2010 at 03:57:02PM -0600, Alex Williamson wrote:
> >
> > Something like GET_MSIX_VECTORS seems like a user library routine to me.
> > The PCI config space is well specified and if we try to do more than
> > shortcut trivial operations (like getting the BAR length), we risk
> > losing functionality. And for my purposes, translating to and from a
> > made up API to PCI for the guest seems like a pain.
>
> Won't a userspace library do just as well for you?

You mean aside from qemu's reluctance to add dependencies for more
libraries? My only concern is that I want enough virtualized/raw config
space that I'm not always translating PCI config accesses from the guest
into some userspace API. If it makes sense to do this for things like
MSI, since I need someone to figure out what resources can actually be
allocated on the host, then maybe the library makes sense for that.
Then again, if every user needs to do this, let the vfio kernel driver
check what it can get and virtualize the available MSIs in exposed
config space, and my driver would be even happier.

Alex

2010-07-29 14:33:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jul 28, 2010 at 04:13:19PM -0600, Alex Williamson wrote:
> On Thu, 2010-07-29 at 00:57 +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 28, 2010 at 03:57:02PM -0600, Alex Williamson wrote:
> > >
> > > Something like GET_MSIX_VECTORS seems like a user library routine to me.
> > > The PCI config space is well specified and if we try to do more than
> > > shortcut trivial operations (like getting the BAR length), we risk
> > > losing functionality. And for my purposes, translating to and from a
> > > made up API to PCI for the guest seems like a pain.
> >
> > Won't a userspace library do just as well for you?
>
> You mean aside from qemu's reluctance to add dependencies for more
> libraries?

Main reason is portability. So as long as it's kvm-only stuff, they
likely won't care.

> My only concern is that I want enough virtualized/raw config
> space that I'm not always translating PCI config accesses from the guest
> into some userspace API. If it makes sense to do this for things like
> MSI, since I need someone to figure out what resources can actually be
> allocated on the host, then maybe the library makes sense for that.
> Then again, if every user needs to do this, let the vfio kernel driver
> check what it can get and virtualize the available MSIs in exposed
> config space, and my driver would be even happier.
>
> Alex

It would? guest driver might or might not work if you reduce the number
of vectors for device. So I think you need an API to find out whether
all vectors can be allocated.

And these are examples of why virtualizing is wrong:
1. hides real hardware
2. no way to report errors

--
MST

2010-07-30 20:43:00

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

On Thu, 2010-07-29 at 17:27 +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 28, 2010 at 04:13:19PM -0600, Alex Williamson wrote:
> > On Thu, 2010-07-29 at 00:57 +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jul 28, 2010 at 03:57:02PM -0600, Alex Williamson wrote:
> > > >
> > > > Something like GET_MSIX_VECTORS seems like a user library routine to me.
> > > > The PCI config space is well specified and if we try to do more than
> > > > shortcut trivial operations (like getting the BAR length), we risk
> > > > losing functionality. And for my purposes, translating to and from a
> > > > made up API to PCI for the guest seems like a pain.
> > >
> > > Won't a userspace library do just as well for you?
> >
> > You mean aside from qemu's reluctance to add dependencies for more
> > libraries?
>
> Main reason is portability. So as long as it's kvm-only stuff, they
> likely won't care.

I'd like the qemu vfio driver to be available for both qemu and kvm. It
may have limitation in qemu mode from non-atomic memory writes via tcg,
but I really don't want to have it only live in qemu-kvm like the
current device assignment code.

> > My only concern is that I want enough virtualized/raw config
> > space that I'm not always translating PCI config accesses from the guest
> > into some userspace API. If it makes sense to do this for things like
> > MSI, since I need someone to figure out what resources can actually be
> > allocated on the host, then maybe the library makes sense for that.
> > Then again, if every user needs to do this, let the vfio kernel driver
> > check what it can get and virtualize the available MSIs in exposed
> > config space, and my driver would be even happier.
> >
> > Alex
>
> It would? guest driver might or might not work if you reduce the number
> of vectors for device. So I think you need an API to find out whether
> all vectors can be allocated.
>
> And these are examples of why virtualizing is wrong:
> 1. hides real hardware
> 2. no way to report errors

So I think you're proposing something like:

ioctl(GET_MSIX_VECTORS)
ioctl(VFIO_EVENTFDS_MSIX)

which I don't see any value over:

Get #MSIX supported via expose PCI config space
ioctl(VFIO_EVENTFDS_MSIX)

The GET_MSIX_VECTORS can't do any better job of predicting how many
interrupts can be allocated than looking at what the hardware supports.
Someone else could have exhausted the interrupt vectors by the time you
make the second call. You'll note here that enabling the MSIX
interrupts actually requires an ioctl to associate eventfds with each
interrupt, so yes, we do have a way to report errors if the system can't
support the number of interrupts requested. Thanks,

Alex