LinuxLists.cc - [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 08, 2010 at 04:54:43PM -0700, Tom Lyon wrote:
> On Tuesday 08 June 2010 03:38:44 pm Michael S. Tsirkin wrote:
> > On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
> > > The VFIO "driver" is used to allow privileged AND non-privileged processes to
> > > implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> > > devices.
> > > Signed-off-by: Tom Lyon <[email protected]>
> >
> > Some general comments:
> > - Please pass this through ./scripts/checkpatch.pl to fix some formatting.
> I did. What did you find?

Heh, it passed, sorry.

> > - Lots of hard-coded constants. Please try using pci_regs.h much more,
> > where not possible please add named enums.
> This is mostly for lengths not specified in pci_regs, but given in standards docs.

Add them to pci_regs.h, or to named define in your code.
The largest problem is the tables though.

> > - There are places where you get parameters from userspace and pass them
> > on to kmalloc etc. Everything you get from userspace needs to be
> > validated.
> I thought I had. Thats what more eyeballs are for.

That's what code comments are for :)
Go over malloc calls - each time you malloc in response
to ioctl, and keep memory around, ask yourself why isn't this a DOS
attack. If it's not, document why.

> > - You play non-standard tricks with minor numbers.
> > Won't it be easier to just make udev create a node
> > for the device in the way everyone does it? The name could be
> > descriptive including e.g. bus/dev/fn so userspace can find
> > your device.
> I just copied what uio was doing. What is "the way everyone does it?"

Everyone is an over-statement :), but look at virtio-serial for example.
Your open routine could be as simple as:

struct cdev *cdev = inode->i_cdev;
struct vfio_dev *dev = container_of(cdev, struct vfio_dev, vfio_cdev);

filp->private_data = dev;

> >
> > - I note that if we exclude the iommu mapping, the rest conceptually could belong
> > in pci_generic driver in uio. So if we move these ioctls to the iommu driver,
> > as Avi suggested, then vfio can be part of the uio framework.
> But the interrupt handling is different in uio;

eventfd? This can be supported IMO.

> uio doesn't support read or write calls
> to read and write registers or memory,

We don't use write in pci generic uio so this can be fixed:
all we need is pass offset and size to control call, right?

> and it doesn't support ioctls at all for other
> misc stuff.
> If we could blow off backwards compatibility with uio, then, sure we
> could have a nice unified solution.

What I said above, *if* you move the mapping ioctl to iommu
device - no important ioctls are left after this, really.

....

>
> > > + /* reset to known state if we can */
> > > + (void) pci_reset_function(vdev->pdev);
> >
> > We are opening the device - how can it not be in a known state?
> If an alternate driver left it in a weird state.

Don't we care if it fails then? I think we do ...

> >
> > > + }
> > > + vdev->listeners++;
> > > + mutex_unlock(&vdev->lgate);
> > > + return 0;
> > > +
> > > +err_alloc_listener:
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > > +static int vfio_release(struct inode *inode, struct file *filep)
> > > +{
> > > + int ret = 0;
> > > + struct vfio_listener *listener = filep->private_data;
> > > + struct vfio_dev *vdev = listener->vdev;
> > > +
> > > + vfio_dma_unmapall(listener);
> > > + if (listener->mm) {
> > > +#ifdef CONFIG_MMU_NOTIFIER
> > > + mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
> > > +#endif
> > > + listener->mm = NULL;
> > > + }
> > > +
> > > + mutex_lock(&vdev->lgate);
> > > + if (--vdev->listeners <= 0) {
> > > + /* we don't need to hold igate here since there are
> > > + * no more listeners doing ioctls
> > > + */
> > > + if (vdev->ev_msix)
> > > + vfio_disable_msix(vdev);
> > > + if (vdev->ev_msi)
> > > + vfio_disable_msi(vdev);
> > > + if (vdev->ev_irq) {
> > > + eventfd_ctx_put(vdev->ev_msi);
> > > + vdev->ev_irq = NULL;
> > > + }
> > > + vfio_domain_unset(vdev);
> > > + /* reset to known state if we can */
> > > + (void) pci_reset_function(vdev->pdev);
> >
> > This is too late - device could be doing DMA here and we moved it from under the domain!
> OK - how about a pci_clear_master before the domain_unset?

Not all devices let you do it at any random time.
How about pci_reset_function before domain_unset?

> >
> > And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
> How do you propose?

Fail open if reset fails on open?

> > > + if (bir == vfio_offset_to_pci_space(start) &&
> > > + overlap(lo, hi, startp, endp)) {
> > > + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> > > + __func__);
> > > + return -EINVAL;
> > > + }
> >
> > Tricky, slow, and - is it really necessary?
> > And it won't work if PAGE_SIZE is > 4K, because MSIX page is only 4K in size.
> It can be sped up with some caching.
> BTW, MSI-X can be up to 2048 entries of 16 bytes..

Right. Point is, it can be < PAGE_SIZE, so page will have other stuff in it,
so prohibiting it will create problems.

> >
> >
> > > + return 0;
> > > +}
> > > +
> > > +static ssize_t vfio_write(struct file *filep, const char __user *buf,
> > > + size_t count, loff_t *ppos)
> > > +{
> > > + struct vfio_listener *listener = filep->private_data;
> > > + struct vfio_dev *vdev = listener->vdev;
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + int pci_space;
> > > + int ret;
> > > +
> > > + /* no reads or writes until IOMMU domain set */
> > > + if (vdev->udomain == NULL)
> > > + return -EINVAL;
> > > + pci_space = vfio_offset_to_pci_space(*ppos);
> > > + if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
> > > + return vfio_config_readwrite(1, vdev,
> > > + (char __user *)buf, count, ppos);
> > > + if (pci_space > PCI_ROM_RESOURCE)
> > > + return -EINVAL;
> > > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
> > > + return vfio_io_readwrite(1, vdev,
> > > + (char __user *)buf, count, ppos);
> > > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
> > > + /* don't allow writes to msi-x vectors */
> >
> > What happens if we don't do all these checks?
> These are just sorting out config, io, and memory read/writes.

Sorry, I mean the one below.

> >
> > > + ret = vfio_msix_check(vdev, *ppos, count);
> > > + if (ret)
> > > + return ret;
> > > + return vfio_mem_readwrite(1, vdev,
> > > + (char __user *)buf, count, ppos);
> > > + }
> > > + return -EINVAL;
> > > +}
> > > +
> > > +static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
> > > +{
> > > + struct vfio_listener *listener = filep->private_data;
> > > + struct vfio_dev *vdev = listener->vdev;
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + unsigned long requested, actual;
> > > + int pci_space;
> > > + u64 start;
> > > + u32 len;
> > > + unsigned long phys;
> > > + int ret;
> > > +
> > > + /* no reads or writes until IOMMU domain set */
> > > + if (vdev->udomain == NULL)
> > > + return -EINVAL;
> > > +
> > > + if (vma->vm_end < vma->vm_start)
> > > + return -EINVAL;
> > > + if ((vma->vm_flags & VM_SHARED) == 0)
> > > + return -EINVAL;
> > > +
> > > +
> > > + pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
> > > + if (pci_space > PCI_ROM_RESOURCE)
> > > + return -EINVAL;
> > > + switch (pci_space) {
> > > + case PCI_ROM_RESOURCE:
> > > + if (vma->vm_flags & VM_WRITE)
> > > + return -EINVAL;
> > > + if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
> > > + return -EINVAL;
> > > + actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
> > > + break;
> > > + default:
> > > + if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
> > > + return -EINVAL;
> > > + actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
> > > + break;
> > > + }
> > > +
> >
> > I don't think we really care about non-memory mmaps. They can all go
> > through read.
> Its conceivable that a virtual machine may want to jump to ROM code.

It can always shadow ROM if it wants to do that. And I didn't think
you can jump to code mmaped in this way - can you?

> >
> > > + requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > > + if (requested > actual || actual == 0)
> > > + return -EINVAL;
> > > +
> > > + /*
> > > + * Can't allow non-priv users to mmap MSI-X vectors
> > > + * else they can write anywhere in phys memory
> > > + */
> >
> > not if there's an iommu.
> I'm not totally convinced that the IOMMU code, as implemented, forces the
> devices to use only their own vectors. But the iommu code is deep.

MSI is just a memory write. So if it didn't, then device could
trigger interrupt for another device just by doing a memory write.
IOW, if you let userspace control the device, it's useless from security POW
to prevent control to the MSI/MSIX table.

> >
> > > + start = vma->vm_pgoff << PAGE_SHIFT;
> > > + len = vma->vm_end - vma->vm_start;
> > > + if (vma->vm_flags & VM_WRITE) {
> > > + ret = vfio_msix_check(vdev, start, len);
> > > + if (ret)
> > > + return ret;
> > > + }
> > > +
> > > + vma->vm_private_data = vdev;
> > > + vma->vm_flags |= VM_IO | VM_RESERVED;
> > > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > + phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
> > > +
> > > + return remap_pfn_range(vma, vma->vm_start, phys,
> > > + vma->vm_end - vma->vm_start,
> > > + vma->vm_page_prot);
> >
> > I think there's a security problem here:
> > userspace can do mmap, then close the file and unbind
> > device from iommu, now that device is not bound
> > (or bound to anothe rprocess)
> > access device through mmap and crash the system.
> >
> > We must make sure device stays open until no one
> > maps the memory range.
> The memory system holds a file reference when pages are mapped;
> the driver release routine won't be called until the region is unmapped or
> killed.

Right. sorry about the noise.

> >
> >
> > > +}
> > > +
> > > +static long vfio_unl_ioctl(struct file *filep,
> > > + unsigned int cmd,
> > > + unsigned long arg)
> > > +{
> > > + struct vfio_listener *listener = filep->private_data;
> > > + struct vfio_dev *vdev = listener->vdev;
> > > + void __user *uarg = (void __user *)arg;
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + struct vfio_dma_map dm;
> > > + int ret = 0;
> > > + u64 mask;
> > > + int fd, nfd;
> > > + int bar;
> > > +
> > > + if (vdev == NULL)
> > > + return -EINVAL;
> > > +
> > > + switch (cmd) {
> > > +
> > > + case VFIO_DMA_MAP_IOVA:
> > > + if (copy_from_user(&dm, uarg, sizeof dm))
> > > + return -EFAULT;
> > > + ret = vfio_dma_map_common(listener, cmd, &dm);
> > > + if (!ret && copy_to_user(uarg, &dm, sizeof dm))
> > > + ret = -EFAULT;
> > > + break;
> > > +
> > > + case VFIO_DMA_UNMAP:
> > > + if (copy_from_user(&dm, uarg, sizeof dm))
> > > + return -EFAULT;
> > > + ret = vfio_dma_unmap_dm(listener, &dm);
> > > + break;
> > > +
> > > + case VFIO_DMA_MASK: /* set master mode and DMA mask */
> > > + if (copy_from_user(&mask, uarg, sizeof mask))
> > > + return -EFAULT;
> > > + pci_set_master(pdev);
> >
> > This better be done by userspace when it sees fit.
> > Otherwise device might corrupt userspace memory.

Hope the above is clear: userspace driver might need
to do some initializations before it lets
device write into its memory.

> > > + ret = pci_set_dma_mask(pdev, mask);
> > > + break;
> >
> > Is the above needed?
> >

....

> > So what happens if someone has a device file open and device
> > is being hot-unplugged? At a minimum, we want userspace to
> > have a way to get and handle these notifications.
> > But also remember we can not trust userspace to be well-behaved.
> For now,

And when it's fixed, how will the sysadmin know it's now safe?
It's the driver's responsibility to prevent crash and burn.

> hotplug and suspend/resume not supported -
> sys admin must not enable vfio for these devices.

For *which* devices? suspend is a system wide feature.
If device doesn't support it, make it depend on !SUSPEND && !HOTPLUG
or something?

> I think they are doable,

Issue is, if userspace interface is designed not to support these
events, applications will be written ignoring it, and we will
be stuck. Had this problem with infiniband userspace, no
hotplug/suspend support to this day.

> but lots of testing
> work - and not important for my use case.

That's what users are for :)

> >
> >
> > > + struct vfio_dev *vdev = pci_get_drvdata(pdev);
> > > +
> > > + vfio_free_minor(vdev);
> > > +
> > > + if (pdev->irq > 0)
> > > + free_irq(pdev->irq, vdev);
> > > +
> > > +#ifdef notdef
> > > + vfio_dev_del_attributes(vdev);
> > > +#endif
> > > +
> > > + pci_set_drvdata(pdev, NULL);
> > > + device_destroy(vfio_class->class, vdev->devnum);
> > > + kfree(vdev);
> > > + vfio_class_destroy();
> > > + pci_disable_device(pdev);
> > > +}
> > > +
> > > +static struct pci_driver driver = {
> > > + .name = "vfio",
> > > + .id_table = NULL, /* only dynamic id's */
> > > + .probe = vfio_probe,
> > > + .remove = vfio_remove,
> >
> > Also - I think we need to handle e.g. suspend in some way.
> > Again, this likely involves notifying userspace so it can
> > save state to memory.
> >
> > > +};
> > > +
> > > +static int __init init(void)
> > > +{
> > > + pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> > > + return pci_register_driver(&driver);
> > > +}
> > > +
> > > +static void __exit cleanup(void)
> > > +{
> > > + if (vfio_major >= 0)
> > > + unregister_chrdev(vfio_major, "vfio");
> > > + pci_unregister_driver(&driver);
> > > +}
> > > +
> > > +module_init(init);
> > > +module_exit(cleanup);
> > > +
> > > +MODULE_VERSION(DRIVER_VERSION);
> > > +MODULE_LICENSE("GPL v2");
> > > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > > +MODULE_DESCRIPTION(DRIVER_DESC);
> > > diff -uprN linux-2.6.34/drivers/vfio/vfio_pci_config.c vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c
> > > --- linux-2.6.34/drivers/vfio/vfio_pci_config.c 1969-12-31 16:00:00.000000000 -0800
> > > +++ vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c 2010-05-28 14:26:47.000000000 -0700
> > > @@ -0,0 +1,554 @@
> > > +/*
> > > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > > + * Author: Tom Lyon, [email protected]
> > > + *
> > > + * This program is free software; you may redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; version 2 of the License.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > > + * SOFTWARE.
> > > + *
> > > + * Portions derived from drivers/uio/uio.c:
> > > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > > + *
> > > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > > + * Copyright (C) 2009 Red Hat, Inc.
> > > + * Author: Michael S. Tsirkin <[email protected]>
> > > + */
> > > +
> > > +#include <linux/fs.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/vfio.h>
> > > +
> > > +#define PCI_CAP_ID_BASIC 0
> > > +#ifndef PCI_CAP_ID_MAX
> > > +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> > > +#endif
> > > +
> > > +/*
> > > + * Lengths of PCI Config Capabilities
> > > + * 0 means unknown (but at least 4)
> > > + * FF means special/variable
> > > + */
> > > +static u8 pci_capability_length[] = {
> > > + [PCI_CAP_ID_BASIC] = 64, /* pci config header */
> > > + [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
> > > + [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
> > > + [PCI_CAP_ID_VPD] = 8,
> > > + [PCI_CAP_ID_SLOTID] = 4,
> > > + [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, or 24 */
> > > + [PCI_CAP_ID_CHSWP] = 4,
> > > + [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
> > > + [PCI_CAP_ID_HT] = 28,
> > > + [PCI_CAP_ID_VNDR] = 0xFF,
> > > + [PCI_CAP_ID_DBG] = 0,
> > > + [PCI_CAP_ID_CCRC] = 0,
> > > + [PCI_CAP_ID_SHPC] = 0,
> > > + [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
> > > + [PCI_CAP_ID_AGP3] = 0,
> > > + [PCI_CAP_ID_EXP] = 36,
> > > + [PCI_CAP_ID_MSIX] = 12,
> > > + [PCI_CAP_ID_AF] = 6,
> > > +};
> > > +
> > > +/*
> > > + * Read/Write Permission Bits - one bit for each bit in capability
> > > + * Any field can be read if it exists,
> > > + * but what is read depends on whether the field
> > > + * is 'virtualized',

Using '' implies it is not really virtualized?
what exactly do you mean?

> or just pass thru to the hardware.
> > > + * Any virtualized field is also virtualized for writes.

virtualized for writes? what does it mean?

> > > + * Writes are only permitted if they have a 1 bit here.

you mean they are ignored if register is all 0, right?

> > > + */
> > > +struct perm_bits {
> > > + u32 rvirt; /* read bits which must be virtualized */
> > > + u32 write; /* writeable bits - virt if read virt */
> > > +};

By the time I got half page down, I forgot the order.
So please use .rvirt/.write initializers.

> > > +

When this table changes - as it invariably will - how
will userspace know?
We could add an ioctl to query this table - but more
ioctls, more bugs in kernel and userspace.
It would be easier if we had just one or two registers,
and just did protection, failing illegal reads/writes.
Then userspace would have a simple standard way to
find out, right where it happens, and it could deal with virtualization.

> > > +static struct perm_bits pci_cap_basic_perm[] = {

what endian-ness is all this in, btw?

> > > + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */

you virtualize vendor and device id?

> > > + { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> > > + { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */

Virtualizing bars is not that simple. PCI is full of
this stuff, and qemu has to do a ton of tricks in userspace.
If you are really sure we want this in kernel - and I still don't see
why - still better pass through as much as possible.

> > > + { 0, 0, }, /* 0x24 cardbus - not yet */
> > > + { 0, 0, }, /* 0x28 subsys vendor & dev */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
> > > + { 0, 0, }, /* 0x30 capability ptr & resv */
> > > + { 0, 0, }, /* 0x34 resv */
> > > + { 0, 0, }, /* 0x38 resv */
> > > + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */

Use [] initializers here as well.
More importantly, please document *why* you are virtualizing
or blocking access, not just what you do.

irq is safe to write I think, it's for the OS, device does not use it.

> > > +};

Above it broken for non-type 0 devices. Is there a check in code
that this is what we get? Should be ...

> > > +
> > > +static struct perm_bits pci_cap_pm_perm[] = {
> > > + { 0, 0, }, /* 0x00 PM capabilities */
> > > + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_vpd_perm[] = {
> > > + { 0, 0xFFFF0000, }, /* 0x00 address */

You don't let userspace write address?

> > > + { 0, 0xFFFFFFFF, }, /* 0x04 data */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_slotid_perm[] = {
> > > + { 0, 0, }, /* 0x00 all read only */

Why do we bother protecting readonly registers?

> > > +};
> > > +
> > > +static struct perm_bits pci_cap_msi_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI message control */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message addr/data */
> > > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> > > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
> > > + { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_pcix_perm[] = {
> > > + { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
> > > + { 0, 0, }, /* 0x04 PCI_X_STATUS */
> > > + { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
> > > + { 0, 0, }, /* 0x0c ECC first addr */
> > > + { 0, 0, }, /* 0x10 ECC second addr */
> > > + { 0, 0, }, /* 0x14 ECC attr */
> > > +};
> > > +
> > > +/* pci express capabilities */
> > > +static struct perm_bits pci_cap_exp_perm[] = {
> > > + { 0, 0, }, /* 0x00 PCIe capabilities */
> > > + { 0, 0, }, /* 0x04 PCIe device capabilities */
> > > + { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
> > > + { 0, 0, }, /* 0x0c PCIe link capabilities */
> > > + { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
> > > + { 0, 0, }, /* 0x14 PCIe slot capabilities */
> > > + { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
> > > + { 0, 0, }, /* 0x1c PCIe root port stuff */
> > > + { 0, 0, }, /* 0x20 PCIe root port stuff */

better check you don't get a root port then.

> > > +};
> > > +
> > > +static struct perm_bits pci_cap_msix_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI-X Enable */
> > > + { 0, 0, }, /* 0x04 table offset & bir */
> > > + { 0, 0, }, /* 0x08 pba offset & bir */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_af_perm[] = {
> > > + { 0, 0, }, /* 0x00 af capability */
> > > + { 0, 0x0001, }, /* 0x04 af flr bit */
> > > +};
> > > +
> > > +static struct perm_bits *pci_cap_perms[] = {
> > > + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
> > > + [PCI_CAP_ID_PM] = pci_cap_pm_perm,
> > > + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
> > > + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
> > > + [PCI_CAP_ID_MSI] = pci_cap_msi_perm,
> > > + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
> > > + [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
> > > + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
> > > + [PCI_CAP_ID_AF] = pci_cap_af_perm,
> > > +};

Disallowing access or 'virtualizing' - by which I think
you mean replacing write with read modify write? Should
really be very special case.

Can't simple open-coded C work?
if (addr == PCI_COMMAND)
return -EPERM;

....

> > > +
> > > +/*
> > > + * We build a map of the config space that tells us where
> > > + * and what capabilities exist, so that we can map reads and
> > > + * writes back to capabilities, and thus figure out what to
> > > + * allow, deny, or virtualize
> > > + */
> >
> > So the above looks like it is very unlikely to be exhaustive and
> > correct. Maybe there aren't bugs in this table to be found just by
> > looking hard at the spec, but likely will surface when someone tries
> > to actually run driver with e.g. a working pm on top.
> > Let's ask another question:
> >
> > since we have the iommu protecting us, can't all or most of this be
> > done in userspace? What can userspace do that will harm the host?
> > I think each place where we block access to a register, there should
> > be a very specific documentation for why we do this.
> I think, in an ideal world, you would be correct. I don't trust either
> the hardware or the iommu software

Not sure adding emulation bugs on top will help though: and
we will break userspace. Note how there's no way for userspace
to query the tables to figure out what works and what doesn't.

> to feel good about this though.

About documenting the tables? Me too, but for different reasons:
if there are bugs, we should be finding them and documenting
what they are, not plastering over.

> >
> >
> > > +int vfio_build_config_map(struct vfio_dev *vdev)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + u8 *map;
> > > + int i, len;
> > > + u8 pos, cap, tmp;
> > > + u16 flags;
> > > + int ret;
> > > + int loops = 100;
> >
> > 100?
> Why not?

The answer's 42. Document the constants please.

> >
> >
> > .....
> >
> > > +int vfio_dma_map_common(struct vfio_listener *listener,
> > > + unsigned int cmd, struct vfio_dma_map *dmp)
> > > +{
> > > + int locked, lock_limit;
> > > + struct page **pages;
> > > + int npage;
> > > + struct dma_map_page *mlp;
> > > + int rdwr = (dmp->flags & VFIO_FLAG_WRITE) ? 1 : 0;
> > > + int ret = 0;
> > > +
> > > + if (dmp->vaddr & (PAGE_SIZE-1))
> > > + return -EINVAL;
> > > + if (dmp->size & (PAGE_SIZE-1))
> > > + return -EINVAL;
> > > + if (dmp->size <= 0)
> > > + return -EINVAL;
> > > + npage = dmp->size >> PAGE_SHIFT;
> > > + if (npage <= 0)
> > > + return -EINVAL;
> > > +
> > > + mutex_lock(&listener->vdev->dgate);
> > > +
> > > + /* account for locked pages */
> > > + locked = npage + current->mm->locked_vm;
> > > + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> > > + >> PAGE_SHIFT;
> > > + if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> > > + printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
> > > + __func__);
> > > + ret = -ENOMEM;
> > > + goto out_lock;
> > > + }
> > > + /* only 1 address space per fd */
> > > + if (current->mm != listener->mm) {
> >
> > Why is that?
> Its OK to have multiple fds per device, but multiple address spaces per fd
> means somebody forked - and I don't know how to keep track of mmu
> notifications once that happens.

Maybe we need a notifier for that :) I thought we lock the memory
though - using the mlock rlimit - so why are we using notifiers
at all?

Just noticed this btw:

+static void vfio_dma_handle_mmu_notify(struct mmu_notifier *mn,
+ unsigned long start, unsigned long end)
+{
+ struct vfio_listener *listener;
+ unsigned long myend;
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ listener = container_of(mn, struct vfio_listener, mmu_notifier);
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (mlp->vaddr >= end)

I think you can't do a mutex: mmu notifiers are called
under rcu critical section. Also, userspace can make the dm_list
very long, making the rcu critical section very long.
Set some limit on the number of entries, and replace list
with an array? Or use a tree?

> >
> > > + if (listener->mm != NULL) {
> > > + ret = -EINVAL;
> > > + goto out_lock;
> > > + }
> > > + listener->mm = current->mm;
> > > +#ifdef CONFIG_MMU_NOTIFIER
> > > + listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> > > + ret = mmu_notifier_register(&listener->mmu_notifier,
> > > + listener->mm);
> > > + if (ret)
> > > + printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> > > + __func__, ret);
> > > + ret = 0;
> > > +#endif
> > > + }
> > > +
> > > + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> >
> > If you map a 4G region, this will try to allocate 8Mbytes?
> Yes. Ce la vie.

First of all, this will fail - and the request is quite real with
decent sized guests. Second, with appropriately sized allocs before failing
it will stress the system pretty hard. Split it in chunks of 4K or something.

> >
> > > + if (pages == NULL) {
> > > + ret = ENOMEM;
> > > + goto out_lock;
> > > + }
> > > + ret = get_user_pages_fast(dmp->vaddr, npage, rdwr, pages);
> > > + if (ret != npage) {
> > > + printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
> > > + __func__, ret, npage);
> > > + kfree(pages);
> > > + ret = -EFAULT;
> > > + goto out_lock;
> > > + }
> > > + ret = 0;
> > > +
> > > + mlp = vfio_dma_map_iova(listener, dmp->dmaaddr,
> > > + pages, npage, rdwr);
> > > + if (IS_ERR(mlp)) {
> > > + ret = PTR_ERR(mlp);
> > > + kfree(pages);
> > > + goto out_lock;
> > > + }
> > > + mlp->vaddr = dmp->vaddr;
> > > + mlp->rdwr = rdwr;
> > > + dmp->dmaaddr = mlp->daddr;
> > > + list_add(&mlp->list, &listener->dm_list);
> > > +
> > > + current->mm->locked_vm += npage;
> > > + listener->vdev->locked_pages += npage;
> > > +out_lock:
> > > + mutex_unlock(&listener->vdev->dgate);
> > > + return ret;
> > > +}
> > > +
> >
>

2010-06-08 23:58:09

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tuesday 08 June 2010 03:38:44 pm Michael S. Tsirkin wrote:
> On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
> > The VFIO "driver" is used to allow privileged AND non-privileged processes to
> > implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> > devices.
> > Signed-off-by: Tom Lyon <[email protected]>
>
> Some general comments:
> - Please pass this through ./scripts/checkpatch.pl to fix some formatting.
I did. What did you find?

> - Lots of hard-coded constants. Please try using pci_regs.h much more,
> where not possible please add named enums.
This is mostly for lengths not specified in pci_regs, but given in standards docs.

> - There are places where you get parameters from userspace and pass them
> on to kmalloc etc. Everything you get from userspace needs to be
> validated.
I thought I had. Thats what more eyeballs are for.

> - You play non-standard tricks with minor numbers.
> Won't it be easier to just make udev create a node
> for the device in the way everyone does it? The name could be
> descriptive including e.g. bus/dev/fn so userspace can find
> your device.
I just copied what uio was doing. What is "the way everyone does it?"
>
> - I note that if we exclude the iommu mapping, the rest conceptually could belong
> in pci_generic driver in uio. So if we move these ioctls to the iommu driver,
> as Avi suggested, then vfio can be part of the uio framework.
But the interrupt handling is different in uio; uio doesn't support read or write calls
to read and write registers or memory, and it doesn't support ioctls at all for other
misc stuff. If we could blow off backwards compatibility with uio, then, sure we
could have a nice unified solution.

> > ---
> > This version now requires an IOMMU domain to be set before any access to
> > device registers is granted (except that config space may be read). In
> > addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> > which does not have sufficient controls around IOMMU usage. The IOMMU domain
> > is obtained from the 'uiommu' driver which is included in this patch.
> >
> > Various locking, security, and documentation issues have also been fixed.
> >
> > Please commit - it or me!
> > But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?
> >
> > Blurb from previous patch version:
> >
> > This patch is the evolution of code which was first proposed as a patch to
> > uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
> > out of the uio framework, and things seem much cleaner. Of course, there is
> > a lot of functional overlap with uio, but the previous version just seemed
> > like a giant mode switch in the uio code that did not lead to clarity for
> > either the new or old code.
> >
> > [a pony for avi...]
> > The major new functionality in this version is the ability to deal with
> > PCI config space accesses (through read & write calls) - but includes table
> > driven code to determine whats safe to write and what is not. Also, some
> > virtualization of the config space to allow drivers to think they're writing
> > some registers when they're not. Also, IO space accesses are also allowed.
> > Drivers for devices which use MSI-X are now prevented from directly writing
> > the MSI-X vector area.
>
> This table adds a lot of complexity to the code,
> and I don't really understand why we need this code in
> kernel: isn't the point of iommu that it protects us
> from buggy devices? If yes, we should be able to
> just ask userspace to be careful and avoid doing silly things
> like overwriting MSI-X vectors, and if it's not careful,
> no one gets hurt :)
>
> If some registers absolutely must be protected,
> I think we should document this carefully in code.
>
> > +/*
> > + * MSI and MSI-X Interrupt handler.
> > + * Just signal an event
> > + */
> > +static irqreturn_t msihandler(int irq, void *arg)
> > +{
> > + struct eventfd_ctx *ctx = arg;
> > +
> > + eventfd_signal(ctx, 1);
> > + return IRQ_HANDLED;
> > +}
> > +
> > +void vfio_disable_msi(struct vfio_dev *vdev)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > +
> > + if (vdev->ev_msi) {
> > + eventfd_ctx_put(vdev->ev_msi);
> > + free_irq(pdev->irq, vdev->ev_msi);
> > + vdev->ev_msi = NULL;
> > + }
> > + pci_disable_msi(pdev);
> > +}
> > +
> > +int vfio_enable_msi(struct vfio_dev *vdev, int fd)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + struct eventfd_ctx *ctx;
> > + int ret;
> > +
> > + ctx = eventfd_ctx_fdget(fd);
> > + if (IS_ERR(ctx))
> > + return PTR_ERR(ctx);
> > + vdev->ev_msi = ctx;
> > + pci_enable_msi(pdev);
> > + ret = request_irq(pdev->irq, msihandler, 0,
> > + vdev->name, ctx);
> > + if (ret) {
> > + eventfd_ctx_put(ctx);
> > + pci_disable_msi(pdev);
> > + vdev->ev_msi = NULL;
> > + }
> > + return ret;
> > +}
> > +
> > +void vfio_disable_msix(struct vfio_dev *vdev)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + int i;
> > +
> > + if (vdev->ev_msix && vdev->msix) {
> > + for (i = 0; i < vdev->nvec; i++) {
> > + free_irq(vdev->msix[i].vector, vdev->ev_msix[i]);
> > + if (vdev->ev_msix[i])
> > + eventfd_ctx_put(vdev->ev_msix[i]);
> > + }
> > + }
> > + kfree(vdev->ev_msix);
> > + vdev->ev_msix = NULL;
> > + kfree(vdev->msix);
> > + vdev->msix = NULL;
> > + vdev->nvec = 0;
> > + pci_disable_msix(pdev);
> > +}
> > +
> > +int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + struct eventfd_ctx *ctx;
> > + int ret = 0;
> > + int i;
> > + int fd;
> > +
> > + vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
> > + GFP_KERNEL);
>
> kzalloc with size coming from userspace? And it's signed. Ugh.
> I think you should just enable all vectors and map them,
> at startup.
Thanks for the catch.

>
> > + if (vdev->msix == NULL)
> > + return -ENOMEM;
> > + vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
> > + GFP_KERNEL);
> > + if (vdev->ev_msix == NULL) {
> > + kfree(vdev->msix);
> > + return -ENOMEM;
> > + }
> > + for (i = 0; i < nvec; i++) {
> > + if (copy_from_user(&fd, uarg, sizeof fd)) {
> > + ret = -EFAULT;
> > + break;
> > + }
> > + uarg += sizeof fd;
> > + ctx = eventfd_ctx_fdget(fd);
> > + if (IS_ERR(ctx)) {
> > + ret = PTR_ERR(ctx);
> > + break;
> > + }
> > + vdev->msix[i].entry = i;
> > + vdev->ev_msix[i] = ctx;
> > + }
> > + if (!ret)
> > + ret = pci_enable_msix(pdev, vdev->msix, nvec);
> > + vdev->nvec = 0;
> > + for (i = 0; i < nvec && !ret; i++) {
> > + ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> > + vdev->name, vdev->ev_msix[i]);
> > + if (ret)
> > + break;
> > + vdev->nvec = i+1;
> > + }
> > + if (ret)
> > + vfio_disable_msix(vdev);
> > + return ret;
> > +}
> > diff -uprN linux-2.6.34/drivers/vfio/vfio_main.c vfio-linux-2.6.34/drivers/vfio/vfio_main.c
> > --- linux-2.6.34/drivers/vfio/vfio_main.c 1969-12-31 16:00:00.000000000 -0800
> > +++ vfio-linux-2.6.34/drivers/vfio/vfio_main.c 2010-06-07 12:39:17.000000000 -0700
> > @@ -0,0 +1,624 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <[email protected]>
> > + */
> > +
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/mm.h>
> > +#include <linux/idr.h>
> > +#include <linux/string.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/fs.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/pci.h>
> > +#include <linux/iommu.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/uaccess.h>
> > +
> > +#include <linux/vfio.h>
> > +
> > +
> > +#define DRIVER_VERSION "0.1"
> > +#define DRIVER_AUTHOR "Tom Lyon <[email protected]>"
> > +#define DRIVER_DESC "VFIO - User Level PCI meta-driver"
> > +
> > +static int vfio_major = -1;
> > +DEFINE_IDR(vfio_idr);
> > +/* Protect idr accesses */
> > +DEFINE_MUTEX(vfio_minor_lock);
> > +
> > +/*
> > + * Does [a1,b1) overlap [a2,b2) ?
> > + */
> > +static inline int overlap(int a1, int b1, int a2, int b2)
> > +{
> > + /*
> > + * Ranges overlap if they're not disjoint; and they're
> > + * disjoint if the end of one is before the start of
> > + * the other one.
> > + */
> > + return !(b2 <= a1 || b1 <= a2);
> > +}
> > +
> > +static int vfio_open(struct inode *inode, struct file *filep)
> > +{
> > + struct vfio_dev *vdev;
> > + struct vfio_listener *listener;
> > + int ret = 0;
> > +
> > + mutex_lock(&vfio_minor_lock);
> > + vdev = idr_find(&vfio_idr, iminor(inode));
> > + mutex_unlock(&vfio_minor_lock);
>
> Instead of all this complexity, can't we just stick a pointer to your device
> in 'struct cdev *i_cdev' on the inode?
Again, just copied this from uio.

>
> > + if (!vdev) {
> > + ret = -ENODEV;
> > + goto out;
> > + }
> > +
> > + listener = kzalloc(sizeof(*listener), GFP_KERNEL);
> > + if (!listener) {
> > + ret = -ENOMEM;
> > + goto err_alloc_listener;
> > + }
> > +
> > + listener->vdev = vdev;
> > + INIT_LIST_HEAD(&listener->dm_list);
> > + filep->private_data = listener;
> > +
> > + mutex_lock(&vdev->lgate);
> > + if (vdev->listeners == 0) { /* first open */
> > + /* reset to known state if we can */
> > + (void) pci_reset_function(vdev->pdev);
>
> We are opening the device - how can it not be in a known state?
If an alternate driver left it in a weird state.

>
> > + }
> > + vdev->listeners++;
> > + mutex_unlock(&vdev->lgate);
> > + return 0;
> > +
> > +err_alloc_listener:
> > +out:
> > + return ret;
> > +}
> > +
> > +static int vfio_release(struct inode *inode, struct file *filep)
> > +{
> > + int ret = 0;
> > + struct vfio_listener *listener = filep->private_data;
> > + struct vfio_dev *vdev = listener->vdev;
> > +
> > + vfio_dma_unmapall(listener);
> > + if (listener->mm) {
> > +#ifdef CONFIG_MMU_NOTIFIER
> > + mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
> > +#endif
> > + listener->mm = NULL;
> > + }
> > +
> > + mutex_lock(&vdev->lgate);
> > + if (--vdev->listeners <= 0) {
> > + /* we don't need to hold igate here since there are
> > + * no more listeners doing ioctls
> > + */
> > + if (vdev->ev_msix)
> > + vfio_disable_msix(vdev);
> > + if (vdev->ev_msi)
> > + vfio_disable_msi(vdev);
> > + if (vdev->ev_irq) {
> > + eventfd_ctx_put(vdev->ev_msi);
> > + vdev->ev_irq = NULL;
> > + }
> > + vfio_domain_unset(vdev);
> > + /* reset to known state if we can */
> > + (void) pci_reset_function(vdev->pdev);
>
> This is too late - device could be doing DMA here and we moved it from under the domain!
OK - how about a pci_clear_master before the domain_unset?
>
> And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
How do you propose?

>
> > + }
> > + mutex_unlock(&vdev->lgate);
> > +
> > + kfree(listener);
> > + return ret;
> > +}
> > +
> > +static ssize_t vfio_read(struct file *filep, char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct vfio_listener *listener = filep->private_data;
> > + struct vfio_dev *vdev = listener->vdev;
> > + struct pci_dev *pdev = vdev->pdev;
> > + int pci_space;
> > +
> > + /* no reads or writes until IOMMU domain set */
> > + if (vdev->udomain == NULL)
> > + return -EINVAL;
> > + pci_space = vfio_offset_to_pci_space(*ppos);
> > + if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
> > + return vfio_config_readwrite(0, vdev, buf, count, ppos);
> > + if (pci_space > PCI_ROM_RESOURCE)
> > + return -EINVAL;
> > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
> > + return vfio_io_readwrite(0, vdev, buf, count, ppos);
> > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM)
> > + return vfio_mem_readwrite(0, vdev, buf, count, ppos);
> > + if (pci_space == PCI_ROM_RESOURCE)
> > + return vfio_mem_readwrite(0, vdev, buf, count, ppos);
> > + return -EINVAL;
> > +}
> > +
> > +static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u16 pos;
> > + u32 table_offset;
> > + u16 table_size;
> > + u8 bir;
> > + u32 lo, hi, startp, endp;
> > +
> > + pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
> > + if (!pos)
> > + return 0;
> > +
> > + pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
> > + table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
> > + pci_read_config_dword(pdev, pos + 4, &table_offset);
> > + bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
> > + lo = table_offset >> PAGE_SHIFT;
> > + hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
> > + >> PAGE_SHIFT;
> > + startp = start >> PAGE_SHIFT;
> > + endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>
> PAGE_ALIGN here and elsewhere?
>
> > + if (bir == vfio_offset_to_pci_space(start) &&
> > + overlap(lo, hi, startp, endp)) {
> > + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> > + __func__);
> > + return -EINVAL;
> > + }
>
> Tricky, slow, and - is it really necessary?
> And it won't work if PAGE_SIZE is > 4K, because MSIX page is only 4K in size.
It can be sped up with some caching.
BTW, MSI-X can be up to 2048 entries of 16 bytes..
>
>
> > + return 0;
> > +}
> > +
> > +static ssize_t vfio_write(struct file *filep, const char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct vfio_listener *listener = filep->private_data;
> > + struct vfio_dev *vdev = listener->vdev;
> > + struct pci_dev *pdev = vdev->pdev;
> > + int pci_space;
> > + int ret;
> > +
> > + /* no reads or writes until IOMMU domain set */
> > + if (vdev->udomain == NULL)
> > + return -EINVAL;
> > + pci_space = vfio_offset_to_pci_space(*ppos);
> > + if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
> > + return vfio_config_readwrite(1, vdev,
> > + (char __user *)buf, count, ppos);
> > + if (pci_space > PCI_ROM_RESOURCE)
> > + return -EINVAL;
> > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
> > + return vfio_io_readwrite(1, vdev,
> > + (char __user *)buf, count, ppos);
> > + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
> > + /* don't allow writes to msi-x vectors */
>
> What happens if we don't do all these checks?
These are just sorting out config, io, and memory read/writes.
>
> > + ret = vfio_msix_check(vdev, *ppos, count);
> > + if (ret)
> > + return ret;
> > + return vfio_mem_readwrite(1, vdev,
> > + (char __user *)buf, count, ppos);
> > + }
> > + return -EINVAL;
> > +}
> > +
> > +static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
> > +{
> > + struct vfio_listener *listener = filep->private_data;
> > + struct vfio_dev *vdev = listener->vdev;
> > + struct pci_dev *pdev = vdev->pdev;
> > + unsigned long requested, actual;
> > + int pci_space;
> > + u64 start;
> > + u32 len;
> > + unsigned long phys;
> > + int ret;
> > +
> > + /* no reads or writes until IOMMU domain set */
> > + if (vdev->udomain == NULL)
> > + return -EINVAL;
> > +
> > + if (vma->vm_end < vma->vm_start)
> > + return -EINVAL;
> > + if ((vma->vm_flags & VM_SHARED) == 0)
> > + return -EINVAL;
> > +
> > +
> > + pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
> > + if (pci_space > PCI_ROM_RESOURCE)
> > + return -EINVAL;
> > + switch (pci_space) {
> > + case PCI_ROM_RESOURCE:
> > + if (vma->vm_flags & VM_WRITE)
> > + return -EINVAL;
> > + if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
> > + return -EINVAL;
> > + actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
> > + break;
> > + default:
> > + if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
> > + return -EINVAL;
> > + actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
> > + break;
> > + }
> > +
>
> I don't think we really care about non-memory mmaps. They can all go
> through read.
Its conceivable that a virtual machine may want to jump to ROM code.
>
> > + requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > + if (requested > actual || actual == 0)
> > + return -EINVAL;
> > +
> > + /*
> > + * Can't allow non-priv users to mmap MSI-X vectors
> > + * else they can write anywhere in phys memory
> > + */
>
> not if there's an iommu.
I'm not totally convinced that the IOMMU code, as implemented, forces the
devices to use only their own vectors. But the iommu code is deep.
>
> > + start = vma->vm_pgoff << PAGE_SHIFT;
> > + len = vma->vm_end - vma->vm_start;
> > + if (vma->vm_flags & VM_WRITE) {
> > + ret = vfio_msix_check(vdev, start, len);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + vma->vm_private_data = vdev;
> > + vma->vm_flags |= VM_IO | VM_RESERVED;
> > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > + phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
> > +
> > + return remap_pfn_range(vma, vma->vm_start, phys,
> > + vma->vm_end - vma->vm_start,
> > + vma->vm_page_prot);
>
> I think there's a security problem here:
> userspace can do mmap, then close the file and unbind
> device from iommu, now that device is not bound
> (or bound to anothe rprocess)
> access device through mmap and crash the system.
>
> We must make sure device stays open until no one
> maps the memory range.
The memory system holds a file reference when pages are mapped;
the driver release routine won't be called until the region is unmapped or
killed.

>
>
> > +}
> > +
> > +static long vfio_unl_ioctl(struct file *filep,
> > + unsigned int cmd,
> > + unsigned long arg)
> > +{
> > + struct vfio_listener *listener = filep->private_data;
> > + struct vfio_dev *vdev = listener->vdev;
> > + void __user *uarg = (void __user *)arg;
> > + struct pci_dev *pdev = vdev->pdev;
> > + struct vfio_dma_map dm;
> > + int ret = 0;
> > + u64 mask;
> > + int fd, nfd;
> > + int bar;
> > +
> > + if (vdev == NULL)
> > + return -EINVAL;
> > +
> > + switch (cmd) {
> > +
> > + case VFIO_DMA_MAP_IOVA:
> > + if (copy_from_user(&dm, uarg, sizeof dm))
> > + return -EFAULT;
> > + ret = vfio_dma_map_common(listener, cmd, &dm);
> > + if (!ret && copy_to_user(uarg, &dm, sizeof dm))
> > + ret = -EFAULT;
> > + break;
> > +
> > + case VFIO_DMA_UNMAP:
> > + if (copy_from_user(&dm, uarg, sizeof dm))
> > + return -EFAULT;
> > + ret = vfio_dma_unmap_dm(listener, &dm);
> > + break;
> > +
> > + case VFIO_DMA_MASK: /* set master mode and DMA mask */
> > + if (copy_from_user(&mask, uarg, sizeof mask))
> > + return -EFAULT;
> > + pci_set_master(pdev);
>
> This better be done by userspace when it sees fit.
> Otherwise device might corrupt userspace memory.
>
> > + ret = pci_set_dma_mask(pdev, mask);
> > + break;
>
> Is the above needed?
>
> > +
> > + case VFIO_EVENTFD_IRQ:
> > + if (copy_from_user(&fd, uarg, sizeof fd))
> > + return -EFAULT;
> > + mutex_lock(&vdev->igate);
> > + if (vdev->ev_irq)
> > + eventfd_ctx_put(vdev->ev_irq);
> > + if (fd >= 0) {
> > + vdev->ev_irq = eventfd_ctx_fdget(fd);
> > + if (vdev->ev_irq == NULL)
> > + ret = -EINVAL;
> > + }
> > + mutex_unlock(&vdev->igate);
> > + break;
> > +
> > + case VFIO_EVENTFD_MSI:
> > + if (copy_from_user(&fd, uarg, sizeof fd))
> > + return -EFAULT;
> > + mutex_lock(&vdev->igate);
> > + if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> > + ret = vfio_enable_msi(vdev, fd);
> > + else if (fd < 0 && vdev->ev_msi)
> > + vfio_disable_msi(vdev);
> > + else
> > + ret = -EINVAL;
> > + mutex_unlock(&vdev->igate);
> > + break;
> > +
> > + case VFIO_EVENTFDS_MSIX:
> > + if (copy_from_user(&nfd, uarg, sizeof nfd))
> > + return -EFAULT;
> > + uarg += sizeof nfd;
> > + mutex_lock(&vdev->igate);
> > + if (nfd > 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> > + ret = vfio_enable_msix(vdev, nfd, uarg);
> > + else if (nfd == 0 && vdev->ev_msix)
> > + vfio_disable_msix(vdev);
> > + else
> > + ret = -EINVAL;
> > + mutex_unlock(&vdev->igate);
> > + break;
> > +
> > + case VFIO_BAR_LEN:
> > + if (copy_from_user(&bar, uarg, sizeof bar))
> > + return -EFAULT;
> > + if (bar < 0 || bar > PCI_ROM_RESOURCE)
> > + return -EINVAL;
> > + bar = pci_resource_len(pdev, bar);
> > + if (copy_to_user(uarg, &bar, sizeof bar))
> > + return -EFAULT;
> > + break;
> > +
> > + case VFIO_DOMAIN_SET:
> > + if (copy_from_user(&fd, uarg, sizeof fd))
> > + return -EFAULT;
> > + ret = vfio_domain_set(vdev, fd);
> > + break;
> > +
> > + case VFIO_DOMAIN_UNSET:
> > + vfio_domain_unset(vdev);
> > + ret = 0;
> > + break;
> > +
> > + default:
> > + return -EINVAL;
> > + }
> > + return ret;
> > +}
> > +
> > +static const struct file_operations vfio_fops = {
> > + .owner = THIS_MODULE,
> > + .open = vfio_open,
> > + .release = vfio_release,
> > + .read = vfio_read,
> > + .write = vfio_write,
> > + .unlocked_ioctl = vfio_unl_ioctl,
> > + .mmap = vfio_mmap,
> > +};
> > +
> > +static int vfio_get_devnum(struct vfio_dev *vdev)
> > +{
> > + int retval = -ENOMEM;
> > + int id;
> > +
> > + mutex_lock(&vfio_minor_lock);
> > + if (idr_pre_get(&vfio_idr, GFP_KERNEL) == 0)
> > + goto exit;
> > +
> > + retval = idr_get_new(&vfio_idr, vdev, &id);
> > + if (retval < 0) {
> > + if (retval == -EAGAIN)
> > + retval = -ENOMEM;
> > + goto exit;
> > + }
> > + if (id > MINORMASK) {
> > + idr_remove(&vfio_idr, id);
> > + retval = -ENOMEM;
> > + }
> > + if (vfio_major < 0) {
> > + retval = register_chrdev(0, "vfio", &vfio_fops);
> > + if (retval < 0)
> > + goto exit;
> > + vfio_major = retval;
> > + }
> > +
> > + retval = MKDEV(vfio_major, id);
> > +exit:
> > + mutex_unlock(&vfio_minor_lock);
> > + return retval;
> > +}
> > +
> > +static void vfio_free_minor(struct vfio_dev *vdev)
> > +{
> > + mutex_lock(&vfio_minor_lock);
> > + idr_remove(&vfio_idr, MINOR(vdev->devnum));
> > + mutex_unlock(&vfio_minor_lock);
> > +}
> > +
> > +/*
> > + * Verify that the device supports Interrupt Disable bit in command register,
> > + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
> > + * in PCI 2.2. (from uio_pci_generic)
> > + */
> > +static int verify_pci_2_3(struct pci_dev *pdev)
> > +{
> > + u16 orig, new;
> > + int err = 0;
> > + u8 pin;
> > +
> > + pci_block_user_cfg_access(pdev);
> > +
> > + pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
> > + if (pin == 0) /* irqs not needed */
> > + goto out;
> > +
> > + pci_read_config_word(pdev, PCI_COMMAND, &orig);
> > + pci_write_config_word(pdev, PCI_COMMAND,
> > + orig ^ PCI_COMMAND_INTX_DISABLE);
> > + pci_read_config_word(pdev, PCI_COMMAND, &new);
> > + /* There's no way to protect against
> > + * hardware bugs or detect them reliably, but as long as we know
> > + * what the value should be, let's go ahead and check it. */
> > + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
> > + err = -EBUSY;
> > + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
> > + "driver or HW bug?\n", orig, new);
> > + goto out;
> > + }
> > + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
> > + dev_warn(&pdev->dev, "Device does not support "
> > + "disabling interrupts: unable to bind.\n");
> > + err = -ENODEV;
> > + goto out;
> > + }
> > + /* Now restore the original value. */
> > + pci_write_config_word(pdev, PCI_COMMAND, orig);
> > +out:
> > + pci_unblock_user_cfg_access(pdev);
> > + return err;
> > +}
> > +
> > +static int vfio_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > +{
> > + struct vfio_dev *vdev;
> > + int err;
> > +
> > + if (!iommu_found())
> > + return -EINVAL;
> > +
> > + err = pci_enable_device(pdev);
> > + if (err) {
> > + dev_err(&pdev->dev, "%s: pci_enable_device failed: %d\n",
> > + __func__, err);
> > + return err;
> > + }
> > +
> > + err = verify_pci_2_3(pdev);
> > + if (err)
> > + goto err_verify;
> > +
> > + vdev = kzalloc(sizeof(struct vfio_dev), GFP_KERNEL);
> > + if (!vdev) {
> > + err = -ENOMEM;
> > + goto err_alloc;
> > + }
> > + vdev->pdev = pdev;
> > +
> > + err = vfio_class_init();
> > + if (err)
> > + goto err_class;
> > +
> > + mutex_init(&vdev->lgate);
> > + mutex_init(&vdev->dgate);
> > + mutex_init(&vdev->igate);
> > +
> > + err = vfio_get_devnum(vdev);
> > + if (err < 0)
> > + goto err_get_devnum;
> > + vdev->devnum = err;
> > + err = 0;
> > +
> > + sprintf(vdev->name, "vfio%d", MINOR(vdev->devnum));
> > + pci_set_drvdata(pdev, vdev);
> > + vdev->dev = device_create(vfio_class->class, &pdev->dev,
> > + vdev->devnum, vdev, vdev->name);
> > + if (IS_ERR(vdev->dev)) {
> > + printk(KERN_ERR "VFIO: device register failed\n");
> > + err = PTR_ERR(vdev->dev);
> > + goto err_device_create;
> > + }
> > +
> > + err = vfio_dev_add_attributes(vdev);
> > + if (err)
> > + goto err_vfio_dev_add_attributes;
> > +
> > +
> > + if (pdev->irq > 0) {
> > + err = request_irq(pdev->irq, vfio_interrupt,
> > + IRQF_SHARED, "vfio", vdev);
> > + if (err)
> > + goto err_request_irq;
> > + }
> > + vdev->vinfo.bardirty = 1;
> > +
> > + return 0;
> > +
> > +err_request_irq:
> > +#ifdef notdef
> > + vfio_dev_del_attributes(vdev);
> > +#endif
> > +err_vfio_dev_add_attributes:
> > + device_destroy(vfio_class->class, vdev->devnum);
> > +err_device_create:
> > + vfio_free_minor(vdev);
> > +err_get_devnum:
> > +err_class:
> > + kfree(vdev);
> > +err_alloc:
> > +err_verify:
> > + pci_disable_device(pdev);
> > + return err;
> > +}
> > +
> > +static void vfio_remove(struct pci_dev *pdev)
> > +{
>
> So what happens if someone has a device file open and device
> is being hot-unplugged? At a minimum, we want userspace to
> have a way to get and handle these notifications.
> But also remember we can not trust userspace to be well-behaved.
For now, hotplug and suspend/resume not supported - sys admin must
not enable vfio for these devices. I think they are doable, but lots of testing
work - and not important for my use case.

>
>
> > + struct vfio_dev *vdev = pci_get_drvdata(pdev);
> > +
> > + vfio_free_minor(vdev);
> > +
> > + if (pdev->irq > 0)
> > + free_irq(pdev->irq, vdev);
> > +
> > +#ifdef notdef
> > + vfio_dev_del_attributes(vdev);
> > +#endif
> > +
> > + pci_set_drvdata(pdev, NULL);
> > + device_destroy(vfio_class->class, vdev->devnum);
> > + kfree(vdev);
> > + vfio_class_destroy();
> > + pci_disable_device(pdev);
> > +}
> > +
> > +static struct pci_driver driver = {
> > + .name = "vfio",
> > + .id_table = NULL, /* only dynamic id's */
> > + .probe = vfio_probe,
> > + .remove = vfio_remove,
>
> Also - I think we need to handle e.g. suspend in some way.
> Again, this likely involves notifying userspace so it can
> save state to memory.
>
> > +};
> > +
> > +static int __init init(void)
> > +{
> > + pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> > + return pci_register_driver(&driver);
> > +}
> > +
> > +static void __exit cleanup(void)
> > +{
> > + if (vfio_major >= 0)
> > + unregister_chrdev(vfio_major, "vfio");
> > + pci_unregister_driver(&driver);
> > +}
> > +
> > +module_init(init);
> > +module_exit(cleanup);
> > +
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > +MODULE_DESCRIPTION(DRIVER_DESC);
> > diff -uprN linux-2.6.34/drivers/vfio/vfio_pci_config.c vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c
> > --- linux-2.6.34/drivers/vfio/vfio_pci_config.c 1969-12-31 16:00:00.000000000 -0800
> > +++ vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c 2010-05-28 14:26:47.000000000 -0700
> > @@ -0,0 +1,554 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> > + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> > + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <[email protected]>
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/pci.h>
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +
> > +#define PCI_CAP_ID_BASIC 0
> > +#ifndef PCI_CAP_ID_MAX
> > +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> > +#endif
> > +
> > +/*
> > + * Lengths of PCI Config Capabilities
> > + * 0 means unknown (but at least 4)
> > + * FF means special/variable
> > + */
> > +static u8 pci_capability_length[] = {
> > + [PCI_CAP_ID_BASIC] = 64, /* pci config header */
> > + [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
> > + [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
> > + [PCI_CAP_ID_VPD] = 8,
> > + [PCI_CAP_ID_SLOTID] = 4,
> > + [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, or 24 */
> > + [PCI_CAP_ID_CHSWP] = 4,
> > + [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
> > + [PCI_CAP_ID_HT] = 28,
> > + [PCI_CAP_ID_VNDR] = 0xFF,
> > + [PCI_CAP_ID_DBG] = 0,
> > + [PCI_CAP_ID_CCRC] = 0,
> > + [PCI_CAP_ID_SHPC] = 0,
> > + [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
> > + [PCI_CAP_ID_AGP3] = 0,
> > + [PCI_CAP_ID_EXP] = 36,
> > + [PCI_CAP_ID_MSIX] = 12,
> > + [PCI_CAP_ID_AF] = 6,
> > +};
> > +
> > +/*
> > + * Read/Write Permission Bits - one bit for each bit in capability
> > + * Any field can be read if it exists,
> > + * but what is read depends on whether the field
> > + * is 'virtualized', or just pass thru to the hardware.
> > + * Any virtualized field is also virtualized for writes.
> > + * Writes are only permitted if they have a 1 bit here.
> > + */
> > +struct perm_bits {
> > + u32 rvirt; /* read bits which must be virtualized */
> > + u32 write; /* writeable bits - virt if read virt */
> > +};
> > +
> > +static struct perm_bits pci_cap_basic_perm[] = {
> > + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> > + { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> > + { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> > + { 0, 0, }, /* 0x24 cardbus - not yet */
> > + { 0, 0, }, /* 0x28 subsys vendor & dev */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
> > + { 0, 0, }, /* 0x30 capability ptr & resv */
> > + { 0, 0, }, /* 0x34 resv */
> > + { 0, 0, }, /* 0x38 resv */
> > + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> > +};
> > +
> > +static struct perm_bits pci_cap_pm_perm[] = {
> > + { 0, 0, }, /* 0x00 PM capabilities */
> > + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
> > +};
> > +
> > +static struct perm_bits pci_cap_vpd_perm[] = {
> > + { 0, 0xFFFF0000, }, /* 0x00 address */
> > + { 0, 0xFFFFFFFF, }, /* 0x04 data */
> > +};
> > +
> > +static struct perm_bits pci_cap_slotid_perm[] = {
> > + { 0, 0, }, /* 0x00 all read only */
> > +};
> > +
> > +static struct perm_bits pci_cap_msi_perm[] = {
> > + { 0, 0, }, /* 0x00 MSI message control */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message addr/data */
> > + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> > + { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
> > + { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
> > +};
> > +
> > +static struct perm_bits pci_cap_pcix_perm[] = {
> > + { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
> > + { 0, 0, }, /* 0x04 PCI_X_STATUS */
> > + { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
> > + { 0, 0, }, /* 0x0c ECC first addr */
> > + { 0, 0, }, /* 0x10 ECC second addr */
> > + { 0, 0, }, /* 0x14 ECC attr */
> > +};
> > +
> > +/* pci express capabilities */
> > +static struct perm_bits pci_cap_exp_perm[] = {
> > + { 0, 0, }, /* 0x00 PCIe capabilities */
> > + { 0, 0, }, /* 0x04 PCIe device capabilities */
> > + { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
> > + { 0, 0, }, /* 0x0c PCIe link capabilities */
> > + { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
> > + { 0, 0, }, /* 0x14 PCIe slot capabilities */
> > + { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
> > + { 0, 0, }, /* 0x1c PCIe root port stuff */
> > + { 0, 0, }, /* 0x20 PCIe root port stuff */
> > +};
> > +
> > +static struct perm_bits pci_cap_msix_perm[] = {
> > + { 0, 0, }, /* 0x00 MSI-X Enable */
> > + { 0, 0, }, /* 0x04 table offset & bir */
> > + { 0, 0, }, /* 0x08 pba offset & bir */
> > +};
> > +
> > +static struct perm_bits pci_cap_af_perm[] = {
> > + { 0, 0, }, /* 0x00 af capability */
> > + { 0, 0x0001, }, /* 0x04 af flr bit */
> > +};
> > +
> > +static struct perm_bits *pci_cap_perms[] = {
> > + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
> > + [PCI_CAP_ID_PM] = pci_cap_pm_perm,
> > + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
> > + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
> > + [PCI_CAP_ID_MSI] = pci_cap_msi_perm,
> > + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
> > + [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
> > + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
> > + [PCI_CAP_ID_AF] = pci_cap_af_perm,
> > +};
> > +
> > +/*
> > + * We build a map of the config space that tells us where
> > + * and what capabilities exist, so that we can map reads and
> > + * writes back to capabilities, and thus figure out what to
> > + * allow, deny, or virtualize
> > + */
>
> So the above looks like it is very unlikely to be exhaustive and
> correct. Maybe there aren't bugs in this table to be found just by
> looking hard at the spec, but likely will surface when someone tries
> to actually run driver with e.g. a working pm on top.
> Let's ask another question:
>
> since we have the iommu protecting us, can't all or most of this be
> done in userspace? What can userspace do that will harm the host?
> I think each place where we block access to a register, there should
> be a very specific documentation for why we do this.
I think, in an ideal world, you would be correct. I don't trust either
the hardware or the iommu software to feel good about this though.
>
>
> > +int vfio_build_config_map(struct vfio_dev *vdev)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u8 *map;
> > + int i, len;
> > + u8 pos, cap, tmp;
> > + u16 flags;
> > + int ret;
> > + int loops = 100;
>
> 100?
Why not?
>
>
> .....
>
> > +int vfio_dma_map_common(struct vfio_listener *listener,
> > + unsigned int cmd, struct vfio_dma_map *dmp)
> > +{
> > + int locked, lock_limit;
> > + struct page **pages;
> > + int npage;
> > + struct dma_map_page *mlp;
> > + int rdwr = (dmp->flags & VFIO_FLAG_WRITE) ? 1 : 0;
> > + int ret = 0;
> > +
> > + if (dmp->vaddr & (PAGE_SIZE-1))
> > + return -EINVAL;
> > + if (dmp->size & (PAGE_SIZE-1))
> > + return -EINVAL;
> > + if (dmp->size <= 0)
> > + return -EINVAL;
> > + npage = dmp->size >> PAGE_SHIFT;
> > + if (npage <= 0)
> > + return -EINVAL;
> > +
> > + mutex_lock(&listener->vdev->dgate);
> > +
> > + /* account for locked pages */
> > + locked = npage + current->mm->locked_vm;
> > + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> > + >> PAGE_SHIFT;
> > + if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> > + printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
> > + __func__);
> > + ret = -ENOMEM;
> > + goto out_lock;
> > + }
> > + /* only 1 address space per fd */
> > + if (current->mm != listener->mm) {
>
> Why is that?
Its OK to have multiple fds per device, but multiple address spaces per fd
means somebody forked - and I don't know how to keep track of mmu
notifications once that happens.
>
> > + if (listener->mm != NULL) {
> > + ret = -EINVAL;
> > + goto out_lock;
> > + }
> > + listener->mm = current->mm;
> > +#ifdef CONFIG_MMU_NOTIFIER
> > + listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> > + ret = mmu_notifier_register(&listener->mmu_notifier,
> > + listener->mm);
> > + if (ret)
> > + printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> > + __func__, ret);
> > + ret = 0;
> > +#endif
> > + }
> > +
> > + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
>
> If you map a 4G region, this will try to allocate 8Mbytes?
Yes. Ce la vie.
>
> > + if (pages == NULL) {
> > + ret = ENOMEM;
> > + goto out_lock;
> > + }
> > + ret = get_user_pages_fast(dmp->vaddr, npage, rdwr, pages);
> > + if (ret != npage) {
> > + printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
> > + __func__, ret, npage);
> > + kfree(pages);
> > + ret = -EFAULT;
> > + goto out_lock;
> > + }
> > + ret = 0;
> > +
> > + mlp = vfio_dma_map_iova(listener, dmp->dmaaddr,
> > + pages, npage, rdwr);
> > + if (IS_ERR(mlp)) {
> > + ret = PTR_ERR(mlp);
> > + kfree(pages);
> > + goto out_lock;
> > + }
> > + mlp->vaddr = dmp->vaddr;
> > + mlp->rdwr = rdwr;
> > + dmp->dmaaddr = mlp->daddr;
> > + list_add(&mlp->list, &listener->dm_list);
> > +
> > + current->mm->locked_vm += npage;
> > + listener->vdev->locked_pages += npage;
> > +out_lock:
> > + mutex_unlock(&listener->vdev->dgate);
> > + return ret;
> > +}
> > +
>

2010-06-08 22:43:50

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
> The VFIO "driver" is used to allow privileged AND non-privileged processes to
> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> devices.
> Signed-off-by: Tom Lyon <[email protected]>

Some general comments:
- Please pass this through ./scripts/checkpatch.pl to fix some formatting.
- Lots of hard-coded constants. Please try using pci_regs.h much more,
where not possible please add named enums.
- There are places where you get parameters from userspace and pass them
on to kmalloc etc. Everything you get from userspace needs to be
validated.
- You play non-standard tricks with minor numbers.
Won't it be easier to just make udev create a node
for the device in the way everyone does it? The name could be
descriptive including e.g. bus/dev/fn so userspace can find
your device.

- I note that if we exclude the iommu mapping, the rest conceptually could belong
in pci_generic driver in uio. So if we move these ioctls to the iommu driver,
as Avi suggested, then vfio can be part of the uio framework.

> ---
> This version now requires an IOMMU domain to be set before any access to
> device registers is granted (except that config space may be read). In
> addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> which does not have sufficient controls around IOMMU usage. The IOMMU domain
> is obtained from the 'uiommu' driver which is included in this patch.
>
> Various locking, security, and documentation issues have also been fixed.
>
> Please commit - it or me!
> But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?
>
> Blurb from previous patch version:
>
> This patch is the evolution of code which was first proposed as a patch to
> uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
> out of the uio framework, and things seem much cleaner. Of course, there is
> a lot of functional overlap with uio, but the previous version just seemed
> like a giant mode switch in the uio code that did not lead to clarity for
> either the new or old code.
>
> [a pony for avi...]
> The major new functionality in this version is the ability to deal with
> PCI config space accesses (through read & write calls) - but includes table
> driven code to determine whats safe to write and what is not. Also, some
> virtualization of the config space to allow drivers to think they're writing
> some registers when they're not. Also, IO space accesses are also allowed.
> Drivers for devices which use MSI-X are now prevented from directly writing
> the MSI-X vector area.

This table adds a lot of complexity to the code,
and I don't really understand why we need this code in
kernel: isn't the point of iommu that it protects us
from buggy devices? If yes, we should be able to
just ask userspace to be careful and avoid doing silly things
like overwriting MSI-X vectors, and if it's not careful,
no one gets hurt :)

If some registers absolutely must be protected,
I think we should document this carefully in code.

> +/*
> + * MSI and MSI-X Interrupt handler.
> + * Just signal an event
> + */
> +static irqreturn_t msihandler(int irq, void *arg)
> +{
> + struct eventfd_ctx *ctx = arg;
> +
> + eventfd_signal(ctx, 1);
> + return IRQ_HANDLED;
> +}
> +
> +void vfio_disable_msi(struct vfio_dev *vdev)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> +
> + if (vdev->ev_msi) {
> + eventfd_ctx_put(vdev->ev_msi);
> + free_irq(pdev->irq, vdev->ev_msi);
> + vdev->ev_msi = NULL;
> + }
> + pci_disable_msi(pdev);
> +}
> +
> +int vfio_enable_msi(struct vfio_dev *vdev, int fd)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + struct eventfd_ctx *ctx;
> + int ret;
> +
> + ctx = eventfd_ctx_fdget(fd);
> + if (IS_ERR(ctx))
> + return PTR_ERR(ctx);
> + vdev->ev_msi = ctx;
> + pci_enable_msi(pdev);
> + ret = request_irq(pdev->irq, msihandler, 0,
> + vdev->name, ctx);
> + if (ret) {
> + eventfd_ctx_put(ctx);
> + pci_disable_msi(pdev);
> + vdev->ev_msi = NULL;
> + }
> + return ret;
> +}
> +
> +void vfio_disable_msix(struct vfio_dev *vdev)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + int i;
> +
> + if (vdev->ev_msix && vdev->msix) {
> + for (i = 0; i < vdev->nvec; i++) {
> + free_irq(vdev->msix[i].vector, vdev->ev_msix[i]);
> + if (vdev->ev_msix[i])
> + eventfd_ctx_put(vdev->ev_msix[i]);
> + }
> + }
> + kfree(vdev->ev_msix);
> + vdev->ev_msix = NULL;
> + kfree(vdev->msix);
> + vdev->msix = NULL;
> + vdev->nvec = 0;
> + pci_disable_msix(pdev);
> +}
> +
> +int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + struct eventfd_ctx *ctx;
> + int ret = 0;
> + int i;
> + int fd;
> +
> + vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
> + GFP_KERNEL);

kzalloc with size coming from userspace? And it's signed. Ugh.
I think you should just enable all vectors and map them,
at startup.

> + if (vdev->msix == NULL)
> + return -ENOMEM;
> + vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
> + GFP_KERNEL);
> + if (vdev->ev_msix == NULL) {
> + kfree(vdev->msix);
> + return -ENOMEM;
> + }
> + for (i = 0; i < nvec; i++) {
> + if (copy_from_user(&fd, uarg, sizeof fd)) {
> + ret = -EFAULT;
> + break;
> + }
> + uarg += sizeof fd;
> + ctx = eventfd_ctx_fdget(fd);
> + if (IS_ERR(ctx)) {
> + ret = PTR_ERR(ctx);
> + break;
> + }
> + vdev->msix[i].entry = i;
> + vdev->ev_msix[i] = ctx;
> + }
> + if (!ret)
> + ret = pci_enable_msix(pdev, vdev->msix, nvec);
> + vdev->nvec = 0;
> + for (i = 0; i < nvec && !ret; i++) {
> + ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> + vdev->name, vdev->ev_msix[i]);
> + if (ret)
> + break;
> + vdev->nvec = i+1;
> + }
> + if (ret)
> + vfio_disable_msix(vdev);
> + return ret;
> +}
> diff -uprN linux-2.6.34/drivers/vfio/vfio_main.c vfio-linux-2.6.34/drivers/vfio/vfio_main.c
> --- linux-2.6.34/drivers/vfio/vfio_main.c 1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_main.c 2010-06-07 12:39:17.000000000 -0700
> @@ -0,0 +1,624 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <[email protected]>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/mm.h>
> +#include <linux/idr.h>
> +#include <linux/string.h>
> +#include <linux/interrupt.h>
> +#include <linux/fs.h>
> +#include <linux/eventfd.h>
> +#include <linux/pci.h>
> +#include <linux/iommu.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/uaccess.h>
> +
> +#include <linux/vfio.h>
> +
> +
> +#define DRIVER_VERSION "0.1"
> +#define DRIVER_AUTHOR "Tom Lyon <[email protected]>"
> +#define DRIVER_DESC "VFIO - User Level PCI meta-driver"
> +
> +static int vfio_major = -1;
> +DEFINE_IDR(vfio_idr);
> +/* Protect idr accesses */
> +DEFINE_MUTEX(vfio_minor_lock);
> +
> +/*
> + * Does [a1,b1) overlap [a2,b2) ?
> + */
> +static inline int overlap(int a1, int b1, int a2, int b2)
> +{
> + /*
> + * Ranges overlap if they're not disjoint; and they're
> + * disjoint if the end of one is before the start of
> + * the other one.
> + */
> + return !(b2 <= a1 || b1 <= a2);
> +}
> +
> +static int vfio_open(struct inode *inode, struct file *filep)
> +{
> + struct vfio_dev *vdev;
> + struct vfio_listener *listener;
> + int ret = 0;
> +
> + mutex_lock(&vfio_minor_lock);
> + vdev = idr_find(&vfio_idr, iminor(inode));
> + mutex_unlock(&vfio_minor_lock);

Instead of all this complexity, can't we just stick a pointer to your device
in 'struct cdev *i_cdev' on the inode?

> + if (!vdev) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> + listener = kzalloc(sizeof(*listener), GFP_KERNEL);
> + if (!listener) {
> + ret = -ENOMEM;
> + goto err_alloc_listener;
> + }
> +
> + listener->vdev = vdev;
> + INIT_LIST_HEAD(&listener->dm_list);
> + filep->private_data = listener;
> +
> + mutex_lock(&vdev->lgate);
> + if (vdev->listeners == 0) { /* first open */
> + /* reset to known state if we can */
> + (void) pci_reset_function(vdev->pdev);

We are opening the device - how can it not be in a known state?

> + }
> + vdev->listeners++;
> + mutex_unlock(&vdev->lgate);
> + return 0;
> +
> +err_alloc_listener:
> +out:
> + return ret;
> +}
> +
> +static int vfio_release(struct inode *inode, struct file *filep)
> +{
> + int ret = 0;
> + struct vfio_listener *listener = filep->private_data;
> + struct vfio_dev *vdev = listener->vdev;
> +
> + vfio_dma_unmapall(listener);
> + if (listener->mm) {
> +#ifdef CONFIG_MMU_NOTIFIER
> + mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
> +#endif
> + listener->mm = NULL;
> + }
> +
> + mutex_lock(&vdev->lgate);
> + if (--vdev->listeners <= 0) {
> + /* we don't need to hold igate here since there are
> + * no more listeners doing ioctls
> + */
> + if (vdev->ev_msix)
> + vfio_disable_msix(vdev);
> + if (vdev->ev_msi)
> + vfio_disable_msi(vdev);
> + if (vdev->ev_irq) {
> + eventfd_ctx_put(vdev->ev_msi);
> + vdev->ev_irq = NULL;
> + }
> + vfio_domain_unset(vdev);
> + /* reset to known state if we can */
> + (void) pci_reset_function(vdev->pdev);

This is too late - device could be doing DMA here and we moved it from under the domain!

And we should make sure (at open time) we *can* reset on close, fail binding if we can't.

> + }
> + mutex_unlock(&vdev->lgate);
> +
> + kfree(listener);
> + return ret;
> +}
> +
> +static ssize_t vfio_read(struct file *filep, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct vfio_listener *listener = filep->private_data;
> + struct vfio_dev *vdev = listener->vdev;
> + struct pci_dev *pdev = vdev->pdev;
> + int pci_space;
> +
> + /* no reads or writes until IOMMU domain set */
> + if (vdev->udomain == NULL)
> + return -EINVAL;
> + pci_space = vfio_offset_to_pci_space(*ppos);
> + if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
> + return vfio_config_readwrite(0, vdev, buf, count, ppos);
> + if (pci_space > PCI_ROM_RESOURCE)
> + return -EINVAL;
> + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
> + return vfio_io_readwrite(0, vdev, buf, count, ppos);
> + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM)
> + return vfio_mem_readwrite(0, vdev, buf, count, ppos);
> + if (pci_space == PCI_ROM_RESOURCE)
> + return vfio_mem_readwrite(0, vdev, buf, count, ppos);
> + return -EINVAL;
> +}
> +
> +static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u16 pos;
> + u32 table_offset;
> + u16 table_size;
> + u8 bir;
> + u32 lo, hi, startp, endp;
> +
> + pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
> + if (!pos)
> + return 0;
> +
> + pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
> + table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
> + pci_read_config_dword(pdev, pos + 4, &table_offset);
> + bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
> + lo = table_offset >> PAGE_SHIFT;
> + hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
> + >> PAGE_SHIFT;
> + startp = start >> PAGE_SHIFT;
> + endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;

PAGE_ALIGN here and elsewhere?

> + if (bir == vfio_offset_to_pci_space(start) &&
> + overlap(lo, hi, startp, endp)) {
> + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> + __func__);
> + return -EINVAL;
> + }

Tricky, slow, and - is it really necessary?
And it won't work if PAGE_SIZE is > 4K, because MSIX page is only 4K in size.

> + return 0;
> +}
> +
> +static ssize_t vfio_write(struct file *filep, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct vfio_listener *listener = filep->private_data;
> + struct vfio_dev *vdev = listener->vdev;
> + struct pci_dev *pdev = vdev->pdev;
> + int pci_space;
> + int ret;
> +
> + /* no reads or writes until IOMMU domain set */
> + if (vdev->udomain == NULL)
> + return -EINVAL;
> + pci_space = vfio_offset_to_pci_space(*ppos);
> + if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
> + return vfio_config_readwrite(1, vdev,
> + (char __user *)buf, count, ppos);
> + if (pci_space > PCI_ROM_RESOURCE)
> + return -EINVAL;
> + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
> + return vfio_io_readwrite(1, vdev,
> + (char __user *)buf, count, ppos);
> + if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
> + /* don't allow writes to msi-x vectors */

What happens if we don't do all these checks?

> + ret = vfio_msix_check(vdev, *ppos, count);
> + if (ret)
> + return ret;
> + return vfio_mem_readwrite(1, vdev,
> + (char __user *)buf, count, ppos);
> + }
> + return -EINVAL;
> +}
> +
> +static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
> +{
> + struct vfio_listener *listener = filep->private_data;
> + struct vfio_dev *vdev = listener->vdev;
> + struct pci_dev *pdev = vdev->pdev;
> + unsigned long requested, actual;
> + int pci_space;
> + u64 start;
> + u32 len;
> + unsigned long phys;
> + int ret;
> +
> + /* no reads or writes until IOMMU domain set */
> + if (vdev->udomain == NULL)
> + return -EINVAL;
> +
> + if (vma->vm_end < vma->vm_start)
> + return -EINVAL;
> + if ((vma->vm_flags & VM_SHARED) == 0)
> + return -EINVAL;
> +
> +
> + pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
> + if (pci_space > PCI_ROM_RESOURCE)
> + return -EINVAL;
> + switch (pci_space) {
> + case PCI_ROM_RESOURCE:
> + if (vma->vm_flags & VM_WRITE)
> + return -EINVAL;
> + if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
> + return -EINVAL;
> + actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
> + break;
> + default:
> + if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
> + return -EINVAL;
> + actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
> + break;
> + }
> +

I don't think we really care about non-memory mmaps. They can all go
through read.

> + requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> + if (requested > actual || actual == 0)
> + return -EINVAL;
> +
> + /*
> + * Can't allow non-priv users to mmap MSI-X vectors
> + * else they can write anywhere in phys memory
> + */

not if there's an iommu.

> + start = vma->vm_pgoff << PAGE_SHIFT;
> + len = vma->vm_end - vma->vm_start;
> + if (vma->vm_flags & VM_WRITE) {
> + ret = vfio_msix_check(vdev, start, len);
> + if (ret)
> + return ret;
> + }
> +
> + vma->vm_private_data = vdev;
> + vma->vm_flags |= VM_IO | VM_RESERVED;
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
> +
> + return remap_pfn_range(vma, vma->vm_start, phys,
> + vma->vm_end - vma->vm_start,
> + vma->vm_page_prot);

I think there's a security problem here:
userspace can do mmap, then close the file and unbind
device from iommu, now that device is not bound
(or bound to anothe rprocess)
access device through mmap and crash the system.

We must make sure device stays open until no one
maps the memory range.

> +}
> +
> +static long vfio_unl_ioctl(struct file *filep,
> + unsigned int cmd,
> + unsigned long arg)
> +{
> + struct vfio_listener *listener = filep->private_data;
> + struct vfio_dev *vdev = listener->vdev;
> + void __user *uarg = (void __user *)arg;
> + struct pci_dev *pdev = vdev->pdev;
> + struct vfio_dma_map dm;
> + int ret = 0;
> + u64 mask;
> + int fd, nfd;
> + int bar;
> +
> + if (vdev == NULL)
> + return -EINVAL;
> +
> + switch (cmd) {
> +
> + case VFIO_DMA_MAP_IOVA:
> + if (copy_from_user(&dm, uarg, sizeof dm))
> + return -EFAULT;
> + ret = vfio_dma_map_common(listener, cmd, &dm);
> + if (!ret && copy_to_user(uarg, &dm, sizeof dm))
> + ret = -EFAULT;
> + break;
> +
> + case VFIO_DMA_UNMAP:
> + if (copy_from_user(&dm, uarg, sizeof dm))
> + return -EFAULT;
> + ret = vfio_dma_unmap_dm(listener, &dm);
> + break;
> +
> + case VFIO_DMA_MASK: /* set master mode and DMA mask */
> + if (copy_from_user(&mask, uarg, sizeof mask))
> + return -EFAULT;
> + pci_set_master(pdev);

This better be done by userspace when it sees fit.
Otherwise device might corrupt userspace memory.

> + ret = pci_set_dma_mask(pdev, mask);
> + break;

Is the above needed?

> +
> + case VFIO_EVENTFD_IRQ:
> + if (copy_from_user(&fd, uarg, sizeof fd))
> + return -EFAULT;
> + mutex_lock(&vdev->igate);
> + if (vdev->ev_irq)
> + eventfd_ctx_put(vdev->ev_irq);
> + if (fd >= 0) {
> + vdev->ev_irq = eventfd_ctx_fdget(fd);
> + if (vdev->ev_irq == NULL)
> + ret = -EINVAL;
> + }
> + mutex_unlock(&vdev->igate);
> + break;
> +
> + case VFIO_EVENTFD_MSI:
> + if (copy_from_user(&fd, uarg, sizeof fd))
> + return -EFAULT;
> + mutex_lock(&vdev->igate);
> + if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> + ret = vfio_enable_msi(vdev, fd);
> + else if (fd < 0 && vdev->ev_msi)
> + vfio_disable_msi(vdev);
> + else
> + ret = -EINVAL;
> + mutex_unlock(&vdev->igate);
> + break;
> +
> + case VFIO_EVENTFDS_MSIX:
> + if (copy_from_user(&nfd, uarg, sizeof nfd))
> + return -EFAULT;
> + uarg += sizeof nfd;
> + mutex_lock(&vdev->igate);
> + if (nfd > 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> + ret = vfio_enable_msix(vdev, nfd, uarg);
> + else if (nfd == 0 && vdev->ev_msix)
> + vfio_disable_msix(vdev);
> + else
> + ret = -EINVAL;
> + mutex_unlock(&vdev->igate);
> + break;
> +
> + case VFIO_BAR_LEN:
> + if (copy_from_user(&bar, uarg, sizeof bar))
> + return -EFAULT;
> + if (bar < 0 || bar > PCI_ROM_RESOURCE)
> + return -EINVAL;
> + bar = pci_resource_len(pdev, bar);
> + if (copy_to_user(uarg, &bar, sizeof bar))
> + return -EFAULT;
> + break;
> +
> + case VFIO_DOMAIN_SET:
> + if (copy_from_user(&fd, uarg, sizeof fd))
> + return -EFAULT;
> + ret = vfio_domain_set(vdev, fd);
> + break;
> +
> + case VFIO_DOMAIN_UNSET:
> + vfio_domain_unset(vdev);
> + ret = 0;
> + break;
> +
> + default:
> + return -EINVAL;
> + }
> + return ret;
> +}
> +
> +static const struct file_operations vfio_fops = {
> + .owner = THIS_MODULE,
> + .open = vfio_open,
> + .release = vfio_release,
> + .read = vfio_read,
> + .write = vfio_write,
> + .unlocked_ioctl = vfio_unl_ioctl,
> + .mmap = vfio_mmap,
> +};
> +
> +static int vfio_get_devnum(struct vfio_dev *vdev)
> +{
> + int retval = -ENOMEM;
> + int id;
> +
> + mutex_lock(&vfio_minor_lock);
> + if (idr_pre_get(&vfio_idr, GFP_KERNEL) == 0)
> + goto exit;
> +
> + retval = idr_get_new(&vfio_idr, vdev, &id);
> + if (retval < 0) {
> + if (retval == -EAGAIN)
> + retval = -ENOMEM;
> + goto exit;
> + }
> + if (id > MINORMASK) {
> + idr_remove(&vfio_idr, id);
> + retval = -ENOMEM;
> + }
> + if (vfio_major < 0) {
> + retval = register_chrdev(0, "vfio", &vfio_fops);
> + if (retval < 0)
> + goto exit;
> + vfio_major = retval;
> + }
> +
> + retval = MKDEV(vfio_major, id);
> +exit:
> + mutex_unlock(&vfio_minor_lock);
> + return retval;
> +}
> +
> +static void vfio_free_minor(struct vfio_dev *vdev)
> +{
> + mutex_lock(&vfio_minor_lock);
> + idr_remove(&vfio_idr, MINOR(vdev->devnum));
> + mutex_unlock(&vfio_minor_lock);
> +}
> +
> +/*
> + * Verify that the device supports Interrupt Disable bit in command register,
> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
> + * in PCI 2.2. (from uio_pci_generic)
> + */
> +static int verify_pci_2_3(struct pci_dev *pdev)
> +{
> + u16 orig, new;
> + int err = 0;
> + u8 pin;
> +
> + pci_block_user_cfg_access(pdev);
> +
> + pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
> + if (pin == 0) /* irqs not needed */
> + goto out;
> +
> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
> + pci_write_config_word(pdev, PCI_COMMAND,
> + orig ^ PCI_COMMAND_INTX_DISABLE);
> + pci_read_config_word(pdev, PCI_COMMAND, &new);
> + /* There's no way to protect against
> + * hardware bugs or detect them reliably, but as long as we know
> + * what the value should be, let's go ahead and check it. */
> + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
> + err = -EBUSY;
> + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
> + "driver or HW bug?\n", orig, new);
> + goto out;
> + }
> + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
> + dev_warn(&pdev->dev, "Device does not support "
> + "disabling interrupts: unable to bind.\n");
> + err = -ENODEV;
> + goto out;
> + }
> + /* Now restore the original value. */
> + pci_write_config_word(pdev, PCI_COMMAND, orig);
> +out:
> + pci_unblock_user_cfg_access(pdev);
> + return err;
> +}
> +
> +static int vfio_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{
> + struct vfio_dev *vdev;
> + int err;
> +
> + if (!iommu_found())
> + return -EINVAL;
> +
> + err = pci_enable_device(pdev);
> + if (err) {
> + dev_err(&pdev->dev, "%s: pci_enable_device failed: %d\n",
> + __func__, err);
> + return err;
> + }
> +
> + err = verify_pci_2_3(pdev);
> + if (err)
> + goto err_verify;
> +
> + vdev = kzalloc(sizeof(struct vfio_dev), GFP_KERNEL);
> + if (!vdev) {
> + err = -ENOMEM;
> + goto err_alloc;
> + }
> + vdev->pdev = pdev;
> +
> + err = vfio_class_init();
> + if (err)
> + goto err_class;
> +
> + mutex_init(&vdev->lgate);
> + mutex_init(&vdev->dgate);
> + mutex_init(&vdev->igate);
> +
> + err = vfio_get_devnum(vdev);
> + if (err < 0)
> + goto err_get_devnum;
> + vdev->devnum = err;
> + err = 0;
> +
> + sprintf(vdev->name, "vfio%d", MINOR(vdev->devnum));
> + pci_set_drvdata(pdev, vdev);
> + vdev->dev = device_create(vfio_class->class, &pdev->dev,
> + vdev->devnum, vdev, vdev->name);
> + if (IS_ERR(vdev->dev)) {
> + printk(KERN_ERR "VFIO: device register failed\n");
> + err = PTR_ERR(vdev->dev);
> + goto err_device_create;
> + }
> +
> + err = vfio_dev_add_attributes(vdev);
> + if (err)
> + goto err_vfio_dev_add_attributes;
> +
> +
> + if (pdev->irq > 0) {
> + err = request_irq(pdev->irq, vfio_interrupt,
> + IRQF_SHARED, "vfio", vdev);
> + if (err)
> + goto err_request_irq;
> + }
> + vdev->vinfo.bardirty = 1;
> +
> + return 0;
> +
> +err_request_irq:
> +#ifdef notdef
> + vfio_dev_del_attributes(vdev);
> +#endif
> +err_vfio_dev_add_attributes:
> + device_destroy(vfio_class->class, vdev->devnum);
> +err_device_create:
> + vfio_free_minor(vdev);
> +err_get_devnum:
> +err_class:
> + kfree(vdev);
> +err_alloc:
> +err_verify:
> + pci_disable_device(pdev);
> + return err;
> +}
> +
> +static void vfio_remove(struct pci_dev *pdev)
> +{

So what happens if someone has a device file open and device
is being hot-unplugged? At a minimum, we want userspace to
have a way to get and handle these notifications.
But also remember we can not trust userspace to be well-behaved.

> + struct vfio_dev *vdev = pci_get_drvdata(pdev);
> +
> + vfio_free_minor(vdev);
> +
> + if (pdev->irq > 0)
> + free_irq(pdev->irq, vdev);
> +
> +#ifdef notdef
> + vfio_dev_del_attributes(vdev);
> +#endif
> +
> + pci_set_drvdata(pdev, NULL);
> + device_destroy(vfio_class->class, vdev->devnum);
> + kfree(vdev);
> + vfio_class_destroy();
> + pci_disable_device(pdev);
> +}
> +
> +static struct pci_driver driver = {
> + .name = "vfio",
> + .id_table = NULL, /* only dynamic id's */
> + .probe = vfio_probe,
> + .remove = vfio_remove,

Also - I think we need to handle e.g. suspend in some way.
Again, this likely involves notifying userspace so it can
save state to memory.

> +};
> +
> +static int __init init(void)
> +{
> + pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> + return pci_register_driver(&driver);
> +}
> +
> +static void __exit cleanup(void)
> +{
> + if (vfio_major >= 0)
> + unregister_chrdev(vfio_major, "vfio");
> + pci_unregister_driver(&driver);
> +}
> +
> +module_init(init);
> +module_exit(cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff -uprN linux-2.6.34/drivers/vfio/vfio_pci_config.c vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c
> --- linux-2.6.34/drivers/vfio/vfio_pci_config.c 1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c 2010-05-28 14:26:47.000000000 -0700
> @@ -0,0 +1,554 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <[email protected]>
> + * Copyright(C) 2005, Thomas Gleixner <[email protected]>
> + * Copyright(C) 2006, Hans J. Koch <[email protected]>
> + * Copyright(C) 2006, Greg Kroah-Hartman <[email protected]>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <[email protected]>
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/pci.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +
> +#define PCI_CAP_ID_BASIC 0
> +#ifndef PCI_CAP_ID_MAX
> +#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
> +#endif
> +
> +/*
> + * Lengths of PCI Config Capabilities
> + * 0 means unknown (but at least 4)
> + * FF means special/variable
> + */
> +static u8 pci_capability_length[] = {
> + [PCI_CAP_ID_BASIC] = 64, /* pci config header */
> + [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
> + [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
> + [PCI_CAP_ID_VPD] = 8,
> + [PCI_CAP_ID_SLOTID] = 4,
> + [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, or 24 */
> + [PCI_CAP_ID_CHSWP] = 4,
> + [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
> + [PCI_CAP_ID_HT] = 28,
> + [PCI_CAP_ID_VNDR] = 0xFF,
> + [PCI_CAP_ID_DBG] = 0,
> + [PCI_CAP_ID_CCRC] = 0,
> + [PCI_CAP_ID_SHPC] = 0,
> + [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
> + [PCI_CAP_ID_AGP3] = 0,
> + [PCI_CAP_ID_EXP] = 36,
> + [PCI_CAP_ID_MSIX] = 12,
> + [PCI_CAP_ID_AF] = 6,
> +};
> +
> +/*
> + * Read/Write Permission Bits - one bit for each bit in capability
> + * Any field can be read if it exists,
> + * but what is read depends on whether the field
> + * is 'virtualized', or just pass thru to the hardware.
> + * Any virtualized field is also virtualized for writes.
> + * Writes are only permitted if they have a 1 bit here.
> + */
> +struct perm_bits {
> + u32 rvirt; /* read bits which must be virtualized */
> + u32 write; /* writeable bits - virt if read virt */
> +};
> +
> +static struct perm_bits pci_cap_basic_perm[] = {
> + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> + { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> + { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> + { 0, 0, }, /* 0x24 cardbus - not yet */
> + { 0, 0, }, /* 0x28 subsys vendor & dev */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
> + { 0, 0, }, /* 0x30 capability ptr & resv */
> + { 0, 0, }, /* 0x34 resv */
> + { 0, 0, }, /* 0x38 resv */
> + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> +};
> +
> +static struct perm_bits pci_cap_pm_perm[] = {
> + { 0, 0, }, /* 0x00 PM capabilities */
> + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
> +};
> +
> +static struct perm_bits pci_cap_vpd_perm[] = {
> + { 0, 0xFFFF0000, }, /* 0x00 address */
> + { 0, 0xFFFFFFFF, }, /* 0x04 data */
> +};
> +
> +static struct perm_bits pci_cap_slotid_perm[] = {
> + { 0, 0, }, /* 0x00 all read only */
> +};
> +
> +static struct perm_bits pci_cap_msi_perm[] = {
> + { 0, 0, }, /* 0x00 MSI message control */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message addr/data */
> + { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
> + { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
> + { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
> +};
> +
> +static struct perm_bits pci_cap_pcix_perm[] = {
> + { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
> + { 0, 0, }, /* 0x04 PCI_X_STATUS */
> + { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
> + { 0, 0, }, /* 0x0c ECC first addr */
> + { 0, 0, }, /* 0x10 ECC second addr */
> + { 0, 0, }, /* 0x14 ECC attr */
> +};
> +
> +/* pci express capabilities */
> +static struct perm_bits pci_cap_exp_perm[] = {
> + { 0, 0, }, /* 0x00 PCIe capabilities */
> + { 0, 0, }, /* 0x04 PCIe device capabilities */
> + { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
> + { 0, 0, }, /* 0x0c PCIe link capabilities */
> + { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
> + { 0, 0, }, /* 0x14 PCIe slot capabilities */
> + { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
> + { 0, 0, }, /* 0x1c PCIe root port stuff */
> + { 0, 0, }, /* 0x20 PCIe root port stuff */
> +};
> +
> +static struct perm_bits pci_cap_msix_perm[] = {
> + { 0, 0, }, /* 0x00 MSI-X Enable */
> + { 0, 0, }, /* 0x04 table offset & bir */
> + { 0, 0, }, /* 0x08 pba offset & bir */
> +};
> +
> +static struct perm_bits pci_cap_af_perm[] = {
> + { 0, 0, }, /* 0x00 af capability */
> + { 0, 0x0001, }, /* 0x04 af flr bit */
> +};
> +
> +static struct perm_bits *pci_cap_perms[] = {
> + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
> + [PCI_CAP_ID_PM] = pci_cap_pm_perm,
> + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
> + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
> + [PCI_CAP_ID_MSI] = pci_cap_msi_perm,
> + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
> + [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
> + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
> + [PCI_CAP_ID_AF] = pci_cap_af_perm,
> +};
> +
> +/*
> + * We build a map of the config space that tells us where
> + * and what capabilities exist, so that we can map reads and
> + * writes back to capabilities, and thus figure out what to
> + * allow, deny, or virtualize
> + */

So the above looks like it is very unlikely to be exhaustive and
correct. Maybe there aren't bugs in this table to be found just by
looking hard at the spec, but likely will surface when someone tries
to actually run driver with e.g. a working pm on top.
Let's ask another question:

since we have the iommu protecting us, can't all or most of this be
done in userspace? What can userspace do that will harm the host?
I think each place where we block access to a register, there should
be a very specific documentation for why we do this.

> +int vfio_build_config_map(struct vfio_dev *vdev)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u8 *map;
> + int i, len;
> + u8 pos, cap, tmp;
> + u16 flags;
> + int ret;
> + int loops = 100;

100?

.....

> +int vfio_dma_map_common(struct vfio_listener *listener,
> + unsigned int cmd, struct vfio_dma_map *dmp)
> +{
> + int locked, lock_limit;
> + struct page **pages;
> + int npage;
> + struct dma_map_page *mlp;
> + int rdwr = (dmp->flags & VFIO_FLAG_WRITE) ? 1 : 0;
> + int ret = 0;
> +
> + if (dmp->vaddr & (PAGE_SIZE-1))
> + return -EINVAL;
> + if (dmp->size & (PAGE_SIZE-1))
> + return -EINVAL;
> + if (dmp->size <= 0)
> + return -EINVAL;
> + npage = dmp->size >> PAGE_SHIFT;
> + if (npage <= 0)
> + return -EINVAL;
> +
> + mutex_lock(&listener->vdev->dgate);
> +
> + /* account for locked pages */
> + locked = npage + current->mm->locked_vm;
> + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> + >> PAGE_SHIFT;
> + if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> + printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
> + __func__);
> + ret = -ENOMEM;
> + goto out_lock;
> + }
> + /* only 1 address space per fd */
> + if (current->mm != listener->mm) {

Why is that?

> + if (listener->mm != NULL) {
> + ret = -EINVAL;
> + goto out_lock;
> + }
> + listener->mm = current->mm;
> +#ifdef CONFIG_MMU_NOTIFIER
> + listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> + ret = mmu_notifier_register(&listener->mmu_notifier,
> + listener->mm);
> + if (ret)
> + printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> + __func__, ret);
> + ret = 0;
> +#endif
> + }
> +
> + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);

If you map a 4G region, this will try to allocate 8Mbytes?

> + if (pages == NULL) {
> + ret = ENOMEM;
> + goto out_lock;
> + }
> + ret = get_user_pages_fast(dmp->vaddr, npage, rdwr, pages);
> + if (ret != npage) {
> + printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
> + __func__, ret, npage);
> + kfree(pages);
> + ret = -EFAULT;
> + goto out_lock;
> + }
> + ret = 0;
> +
> + mlp = vfio_dma_map_iova(listener, dmp->dmaaddr,
> + pages, npage, rdwr);
> + if (IS_ERR(mlp)) {
> + ret = PTR_ERR(mlp);
> + kfree(pages);
> + goto out_lock;
> + }
> + mlp->vaddr = dmp->vaddr;
> + mlp->rdwr = rdwr;
> + dmp->dmaaddr = mlp->daddr;
> + list_add(&mlp->list, &listener->dm_list);
> +
> + current->mm->locked_vm += npage;
> + listener->vdev->locked_pages += npage;
> +out_lock:
> + mutex_unlock(&listener->vdev->dgate);
> + return ret;
> +}
> +

2010-06-09 11:05:17

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On 06/09/2010 12:21 AM, Tom Lyon wrote:
> The VFIO "driver" is used to allow privileged AND non-privileged processes to
> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> devices.
> Signed-off-by: Tom Lyon<[email protected]>
> ---
> This version now requires an IOMMU domain to be set before any access to
> device registers is granted (except that config space may be read). In
> addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> which does not have sufficient controls around IOMMU usage. The IOMMU domain
> is obtained from the 'uiommu' driver which is included in this patch.
>
> Various locking, security, and documentation issues have also been fixed.
>
> Please commit - it or me!
> But seriously, who gets to commit this? Avi for KVM?

Definitely not me.

> or GregKH for drivers?
>

I guess.

--
error compiling committee.c: too many arguments to function

2010-06-09 12:12:26

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tuesday 08 June 2010, Randy Dunlap wrote:
> Documentation/ioctl/ioctl-number.txt | 1
> Documentation/vfio.txt | 177 +++++++
> MAINTAINERS | 7
> drivers/Kconfig | 2
> drivers/Makefile | 1
> drivers/vfio/Kconfig | 18
> drivers/vfio/Makefile | 6
> drivers/vfio/uiommu.c | 126 +++++
> drivers/vfio/vfio_dma.c | 324 ++++++++++++
> drivers/vfio/vfio_intrs.c | 191 +++++++
> drivers/vfio/vfio_main.c | 624 +++++++++++++++++++++++++
> drivers/vfio/vfio_pci_config.c | 554 ++++++++++++++++++++++
> drivers/vfio/vfio_rdwr.c | 147 +++++
> drivers/vfio/vfio_sysfs.c | 153 ++++++
> include/linux/uiommu.h | 62 ++
> include/linux/vfio.h | 200 ++++++++
> 16 files changed, 2593 insertions(+)

This seems to be missing a change to include/linux/Kbuild that
adds vfio.h to the exported files. Without the export, you cannot
use the definitions from user space programs unless they come with
their own copy of the header.

Arnd

2010-06-09 15:25:48

by Greg KH

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 09, 2010 at 02:04:53PM +0300, Avi Kivity wrote:
> On 06/09/2010 12:21 AM, Tom Lyon wrote:
> >The VFIO "driver" is used to allow privileged AND non-privileged processes to
> >implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> >devices.
> > Signed-off-by: Tom Lyon<[email protected]>
> >---
> >This version now requires an IOMMU domain to be set before any access to
> >device registers is granted (except that config space may be read). In
> >addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> >which does not have sufficient controls around IOMMU usage. The IOMMU domain
> >is obtained from the 'uiommu' driver which is included in this patch.
> >
> >Various locking, security, and documentation issues have also been fixed.
> >
> >Please commit - it or me!
> >But seriously, who gets to commit this? Avi for KVM?
>
> Definitely not me.
>
> >or GregKH for drivers?
>
> I guess.

If this ever gets that far, I'll be glad to take it.

thanks,

greg k-h

2010-06-09 16:11:34

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
> + case VFIO_DOMAIN_UNSET:
> + vfio_domain_unset(vdev);
> + ret = 0;
> + break;
> +

What if I do:
SET
mmap
UNSET

Now I have access to device which is not behind an iommu.
Simplest solution is to remove the UNSET ioctl:
it's not terribly useful anyway.

--
MST

2010-06-10 17:30:01

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

> +EXPORT_SYMBOL(uiommu_fdget);

EXPORT_SYMBOL_GPL
.. snip
> +EXPORT_SYMBOL(uiommu_put);

ditto.

2010-06-11 02:02:05

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thursday 10 June 2010 10:27:36 am Konrad Rzeszutek Wilk wrote:
> > +EXPORT_SYMBOL(uiommu_fdget);
>
> EXPORT_SYMBOL_GPL
> .. snip
> > +EXPORT_SYMBOL(uiommu_put);
>
> ditto.
>

Is there a definitive explanation somewhere of when to use each?

2010-06-11 04:20:57

by Greg KH

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thu, Jun 10, 2010 at 06:58:37PM -0700, Tom Lyon wrote:
> On Thursday 10 June 2010 10:27:36 am Konrad Rzeszutek Wilk wrote:
> > > +EXPORT_SYMBOL(uiommu_fdget);
> >
> > EXPORT_SYMBOL_GPL
> > .. snip
> > > +EXPORT_SYMBOL(uiommu_put);
> >
> > ditto.
> >
>
> Is there a definitive explanation somewhere of when to use each?

For a driver like this, that is very tied to the way that the kernel
works, I would recommend the _GPL marking, like the UIO interface has.

But that's just me. :)

thanks,

greg k-h

2010-06-11 04:56:58

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On 06/11/2010 04:58 AM, Tom Lyon wrote:
> On Thursday 10 June 2010 10:27:36 am Konrad Rzeszutek Wilk wrote:
>
>>> +EXPORT_SYMBOL(uiommu_fdget);
>>>
>> EXPORT_SYMBOL_GPL
>> .. snip
>>
>>> +EXPORT_SYMBOL(uiommu_put);
>>>
>> ditto.
>>
>>
> Is there a definitive explanation somewhere of when to use each?
>

Always use _GPL unless you have a defensible reason why you shouldn't.
The kernel's license if GPL, exporting a symbol without GPL can be seen
as adding an exception to the license.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-06-11 22:19:18

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

The inline comments are getting pretty hard to wade through, so I'm deleting some
of the lesser stuff - but I am incorporating into the code.

On Tuesday 08 June 2010 10:45:57 pm Michael S. Tsirkin wrote:
> On Tue, Jun 08, 2010 at 04:54:43PM -0700, Tom Lyon wrote:
> > On Tuesday 08 June 2010 03:38:44 pm Michael S. Tsirkin wrote:
> > > On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
...
> > > > + /* reset to known state if we can */
> > > > + (void) pci_reset_function(vdev->pdev);
> > >
> > > We are opening the device - how can it not be in a known state?
> > If an alternate driver left it in a weird state.
>
> Don't we care if it fails then? I think we do ...
> > > And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
> > How do you propose?
>
> Fail open if reset fails on open?
OK, it'll now fail open if reset fails.

[ bunch of stuff about MSI-X checking and IOMMUs and config registers...]

OK, here's the thing. The IOMMU API today does not do squat about dealing with
interrupts. Interrupts are special because the APIC addresses are not each in
their own page. Yes, the IOMMU hardware supports it (at least Intel), and there's
some Intel intr remapping code (not AMD), but it doesn't look like it is enough.

Therefore, we must not allow the user level driver to diddle the MSI or MSI-X
areas - either in config space or in the device memory space. If the device
doesn't have its MSI-X registers in nice page aligned areas, then it is not
"well-behaved" and it is S.O.L. The SR-IOV spec recommends that devices be
designed the well-behaved way.

When the code in vfio_pci_config speaks of "virtualization" it means that there
are fake registers which the user driver can read or write, but do not affect the
real registers. BARs are one case, MSI regs another. The PCI vendor and device
ID are virtual because SR-IOV doesn't supply them but I wanted the user driver
to find them in the same old place.

> > > > + case VFIO_DMA_MASK: /* set master mode and DMA mask */
> > > > + if (copy_from_user(&mask, uarg, sizeof mask))
> > > > + return -EFAULT;
> > > > + pci_set_master(pdev);
> > >
> > > This better be done by userspace when it sees fit.
> > > Otherwise device might corrupt userspace memory.
This ioctl is gone now - it was vestigial from the dma_sg_map interface.

[ Re: Hotplug and Suspend/Resume]
There are *plenty* of real drivers - brand new ones - which don't bother
with these today. Yeah, I can see adding them to the framework someday -
but if there's no urgent need then it is way down the priority list. Meanwhile,
the other uses beckon. And I never heard the Infiniband users complaining
about not having these things.

> > > > + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> > >
> > > If you map a 4G region, this will try to allocate 8Mbytes?
> > Yes. Ce la vie.
>
> First of all, this will fail - and the request is quite real with
> decent sized guests. Second, with appropriately sized allocs before failing
> it will stress the system pretty hard. Split it in chunks of 4K or something.
Changed to use vmalloc/vfree - don't need physical contiguity.

2010-06-13 10:28:44

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Fri, Jun 11, 2010 at 03:15:53PM -0700, Tom Lyon wrote:
> [ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
>
> OK, here's the thing. The IOMMU API today does not do squat about
> dealing with interrupts. Interrupts are special because the APIC
> addresses are not each in their own page. Yes, the IOMMU hardware
> supports it (at least Intel), and there's some Intel intr remapping
> code (not AMD), but it doesn't look like it is enough.

The iommu book from AMD seems to say that interrupt remapping table
address is taken from the device table entry. So hardware support seems
to be there, and to me it looks like it should be enough.
Need to look at the iommu/msi code some more to figure out
whether what linux does is handling this correctly -
if it doesn't we need to fix that.

> Therefore, we must not allow the user level driver to diddle the MSI
> or MSI-X areas - either in config space or in the device memory space.

It won't help.
Consider that you want to let a userspace driver control
the device with DMA capabilities.

So if there is a range of addresses that device
can write into that can break host, these writes
can be triggered by userspace. Limiting
userspace access to MSI registers won't help:
you need a way to protect host from the device.

> If the device doesn't have its MSI-X registers in nice page aligned
> areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> recommends that devices be designed the well-behaved way.
>
> When the code in vfio_pci_config speaks of "virtualization" it means
> that there are fake registers which the user driver can read or write,
> but do not affect the real registers. BARs are one case, MSI regs
> another. The PCI vendor and device ID are virtual because SR-IOV
> doesn't supply them but I wanted the user driver to find them in the
> same old place.

Sorry, I still don't understand why do we bother. All this is already
implemented in userspace. Why can't we just use this existing userspace
implementation? It seems that all kernel needs to do is prevent
userspace from writing BARs.

Why can't we replace all this complexity with basically:

if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
return -ENOPERM;

And maybe another register or two. Most registers should be fine.

> [ Re: Hotplug and Suspend/Resume]
> There are *plenty* of real drivers - brand new ones - which don't
> bother with these today. Yeah, I can see adding them to the framework
> someday - but if there's no urgent need then it is way down the
> priority list.

Well, for kernel drivers everything mostly works out of the box, it is
handled by the PCI subsystem. So some kind of framework will need to be
added for userspace drivers as well. And I suspect this issue won't be
fixable later without breaking applications.

> Meanwhile, the other uses beckon.

Which other uses? I thought the whole point was fixing
what's broken with current kvm implementation.
So it seems to be we should not rush it ignoring existing issues such as
hotplug.

> And I never heard
> the Infiniband users complaining about not having these things.

I did.

--
MST

2010-06-17 21:17:33

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:
> On Fri, Jun 11, 2010 at 03:15:53PM -0700, Tom Lyon wrote:
> > [ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
> >
> > OK, here's the thing. The IOMMU API today does not do squat about
> > dealing with interrupts. Interrupts are special because the APIC
> > addresses are not each in their own page. Yes, the IOMMU hardware
> > supports it (at least Intel), and there's some Intel intr remapping
> > code (not AMD), but it doesn't look like it is enough.
>
> The iommu book from AMD seems to say that interrupt remapping table
> address is taken from the device table entry. So hardware support seems
> to be there, and to me it looks like it should be enough.
> Need to look at the iommu/msi code some more to figure out
> whether what linux does is handling this correctly -
> if it doesn't we need to fix that.
>
> > Therefore, we must not allow the user level driver to diddle the MSI
> > or MSI-X areas - either in config space or in the device memory space.
>
> It won't help.
> Consider that you want to let a userspace driver control
> the device with DMA capabilities.
>
> So if there is a range of addresses that device
> can write into that can break host, these writes
> can be triggered by userspace. Limiting
> userspace access to MSI registers won't help:
> you need a way to protect host from the device.

OK, after more investigation, I realize you are right.
We definitely need the IOMMU protection for interrupts, and
if we have it, a lot of the code for config space protection is pointless.
It does seem that the Intel intr_remapping code does what we want
(accidentally) but that the AMD iommu code does not yet do any
interrupt remapping. Joerg - can you comment? On the roadmap?

I should have an AMD system w IOMMU in a couple of days, so I
can check this out.

>
> > If the device doesn't have its MSI-X registers in nice page aligned
> > areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> > recommends that devices be designed the well-behaved way.
> >
> > When the code in vfio_pci_config speaks of "virtualization" it means
> > that there are fake registers which the user driver can read or write,
> > but do not affect the real registers. BARs are one case, MSI regs
> > another. The PCI vendor and device ID are virtual because SR-IOV
> > doesn't supply them but I wanted the user driver to find them in the
> > same old place.
>
> Sorry, I still don't understand why do we bother. All this is already
> implemented in userspace. Why can't we just use this existing userspace
> implementation? It seems that all kernel needs to do is prevent
> userspace from writing BARs.

I assume the userspace of which you speak is qemu? This is not what I'm
doing with vfio - I'm interested in the HPC networking model of direct
user space access to the network.

> Why can't we replace all this complexity with basically:
>
> if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
> return -ENOPERM;
>
> And maybe another register or two. Most registers should be fine.
>
> > [ Re: Hotplug and Suspend/Resume]
> > There are *plenty* of real drivers - brand new ones - which don't
> > bother with these today. Yeah, I can see adding them to the framework
> > someday - but if there's no urgent need then it is way down the
> > priority list.
>
> Well, for kernel drivers everything mostly works out of the box, it is
> handled by the PCI subsystem. So some kind of framework will need to be
> added for userspace drivers as well. And I suspect this issue won't be
> fixable later without breaking applications.

Whatever works out of the box for the kernel drivers which don't implement
suspend/resume will work for the user level drivers which don't.
>
> > Meanwhile, the other uses beckon.
>
> Which other uses? I thought the whole point was fixing
> what's broken with current kvm implementation.
> So it seems to be we should not rush it ignoring existing issues such as
> hotplug.
Non-kvm cases. That don't care about suspend/resume.

2010-06-17 21:52:29

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thu, Jun 17, 2010 at 02:14:00PM -0700, Tom Lyon wrote:
> On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:
> > On Fri, Jun 11, 2010 at 03:15:53PM -0700, Tom Lyon wrote:
> > > [ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
> > >
> > > OK, here's the thing. The IOMMU API today does not do squat about
> > > dealing with interrupts. Interrupts are special because the APIC
> > > addresses are not each in their own page. Yes, the IOMMU hardware
> > > supports it (at least Intel), and there's some Intel intr remapping
> > > code (not AMD), but it doesn't look like it is enough.
> >
> > The iommu book from AMD seems to say that interrupt remapping table
> > address is taken from the device table entry. So hardware support seems
> > to be there, and to me it looks like it should be enough.
> > Need to look at the iommu/msi code some more to figure out
> > whether what linux does is handling this correctly -
> > if it doesn't we need to fix that.
> >
> > > Therefore, we must not allow the user level driver to diddle the MSI
> > > or MSI-X areas - either in config space or in the device memory space.
> >
> > It won't help.
> > Consider that you want to let a userspace driver control
> > the device with DMA capabilities.
> >
> > So if there is a range of addresses that device
> > can write into that can break host, these writes
> > can be triggered by userspace. Limiting
> > userspace access to MSI registers won't help:
> > you need a way to protect host from the device.
>
> OK, after more investigation, I realize you are right.
> We definitely need the IOMMU protection for interrupts, and
> if we have it, a lot of the code for config space protection is pointless.
> It does seem that the Intel intr_remapping code does what we want
> (accidentally) but that the AMD iommu code does not yet do any
> interrupt remapping. Joerg - can you comment? On the roadmap?
>
> I should have an AMD system w IOMMU in a couple of days, so I
> can check this out.
>
> >
> > > If the device doesn't have its MSI-X registers in nice page aligned
> > > areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> > > recommends that devices be designed the well-behaved way.
> > >
> > > When the code in vfio_pci_config speaks of "virtualization" it means
> > > that there are fake registers which the user driver can read or write,
> > > but do not affect the real registers. BARs are one case, MSI regs
> > > another. The PCI vendor and device ID are virtual because SR-IOV
> > > doesn't supply them but I wanted the user driver to find them in the
> > > same old place.
> >
> > Sorry, I still don't understand why do we bother. All this is already
> > implemented in userspace. Why can't we just use this existing userspace
> > implementation? It seems that all kernel needs to do is prevent
> > userspace from writing BARs.
>
> I assume the userspace of which you speak is qemu? This is not what I'm
> doing with vfio - I'm interested in the HPC networking model of direct
> user space access to the network.

In that case, I don't see the point of fake config space registers
at all. For virtualization we need them to run unmodified guest drivers,
but we have an implementation in qemu. If you write your own custom drivers
for HPC, just get -EPERM from kernel and handle this.

> > Why can't we replace all this complexity with basically:
> >
> > if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
> > return -ENOPERM;
> >
> > And maybe another register or two. Most registers should be fine.
> >
> > > [ Re: Hotplug and Suspend/Resume]
> > > There are *plenty* of real drivers - brand new ones - which don't
> > > bother with these today. Yeah, I can see adding them to the framework
> > > someday - but if there's no urgent need then it is way down the
> > > priority list.
> >
> > Well, for kernel drivers everything mostly works out of the box, it is
> > handled by the PCI subsystem. So some kind of framework will need to be
> > added for userspace drivers as well. And I suspect this issue won't be
> > fixable later without breaking applications.
>
> Whatever works out of the box for the kernel drivers which don't implement
> suspend/resume will work for the user level drivers which don't.

How will hotplug/driver unload work then?

> >
> > > Meanwhile, the other uses beckon.
> >
> > Which other uses? I thought the whole point was fixing
> > what's broken with current kvm implementation.
> > So it seems to be we should not rush it ignoring existing issues such as
> > hotplug.
> Non-kvm cases.

Try describing these use-cases in more detail then.
And given that you want to use this driver for networking,
please copy netdev, not just linux-kernel.

> That don't care about suspend/resume.
>
>
>

Let's start with hotplug please. This is a really clear-cut issue:
all pci drivers get an event on hotplug and hotunplug, you need
such events for userspace and you need them on day 0, otherwise
userspace which ignores hotplug will get written, and you will
have to support it forever.

--
MST

2010-06-24 12:22:24

by Joerg Roedel

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thu, Jun 17, 2010 at 02:14:00PM -0700, Tom Lyon wrote:
> On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:

> OK, after more investigation, I realize you are right.
> We definitely need the IOMMU protection for interrupts, and
> if we have it, a lot of the code for config space protection is pointless.
> It does seem that the Intel intr_remapping code does what we want
> (accidentally) but that the AMD iommu code does not yet do any
> interrupt remapping. Joerg - can you comment? On the roadmap?

Work on this is planned, but not at a high priority by now. I can
re-prioritize this item if needed.

Joerg

2010-06-24 15:08:50

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thu, Jun 24, 2010 at 02:22:19PM +0200, Joerg Roedel wrote:
> On Thu, Jun 17, 2010 at 02:14:00PM -0700, Tom Lyon wrote:
> > On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:
>
> > OK, after more investigation, I realize you are right.
> > We definitely need the IOMMU protection for interrupts, and
> > if we have it, a lot of the code for config space protection is pointless.
> > It does seem that the Intel intr_remapping code does what we want
> > (accidentally) but that the AMD iommu code does not yet do any
> > interrupt remapping. Joerg - can you comment? On the roadmap?
>
> Work on this is planned, but not at a high priority by now. I can
> re-prioritize this item if needed.
>
> Joerg

As a stop-gap measure, we could get by with a
portable API that let us figure out whether a given iommu supports
interrupt remapping.

--
MST

2010-06-30 06:14:57

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> The VFIO "driver" is used to allow privileged AND non-privileged processes to
> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> devices.

Hi Tom,

I found a few bugs. Patch below. The first chunk clears the
pci_config_map on close, otherwise we end up passing virtualized state
from one user to the next. The second is an off by one in the basic
perms. Finally, vfio_bar_fixup() needs an overhaul. It wasn't setting
the lower bits right and is allowing virtual writes of bits that aren't
aligned to the size. This section probably needs another pass or two of
refinement. Thanks,

Alex

Signed-off-by: Alex Williamson <[email protected]>
---

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 96639e5..a0e8227 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -129,6 +129,10 @@ static int vfio_release(struct inode *inode, struct file *filep)
eventfd_ctx_put(vdev->ev_msi);
vdev->ev_irq = NULL;
}
+ if (vdev->pci_config_map) {
+ kfree(vdev->pci_config_map);
+ vdev->pci_config_map = NULL;
+ }
vfio_domain_unset(vdev);
/* reset to known state if we can */
(void) pci_reset_function(vdev->pdev);
diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
index c821b5d..f6e26b1 100644
--- a/drivers/vfio/vfio_pci_config.c
+++ b/drivers/vfio/vfio_pci_config.c
@@ -79,18 +79,18 @@ struct perm_bits {
static struct perm_bits pci_cap_basic_perm[] = {
{ 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
{ 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
- { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
- { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
+ { 0, 0, }, /* 0x08 class code & revision id */
+ { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
- { 0, 0, }, /* 0x24 cardbus - not yet */
- { 0, 0, }, /* 0x28 subsys vendor & dev */
- { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
- { 0, 0, }, /* 0x30 capability ptr & resv */
- { 0, 0, }, /* 0x34 resv */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
+ { 0, 0, }, /* 0x28 cardbus - not yet */
+ { 0, 0, }, /* 0x2c subsys vendor & dev */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
+ { 0, 0, }, /* 0x34 capability ptr & resv */
{ 0, 0, }, /* 0x38 resv */
{ 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
};
@@ -318,30 +318,55 @@ static void vfio_virt_init(struct vfio_dev *vdev)
static void vfio_bar_fixup(struct vfio_dev *vdev)
{
struct pci_dev *pdev = vdev->pdev;
- int bar;
- u32 *lp;
- u32 len;
+ int bar, mem64 = 0;
+ u32 *lp = NULL;
+ u64 len = 0;

for (bar = 0; bar <= 5; bar++) {
- len = pci_resource_len(pdev, bar);
- lp = (u32 *)&vdev->vinfo.bar[bar * 4];
- if (len == 0) {
- *lp = 0;
- } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
- *lp &= ~0x1;
- *lp = (*lp & ~(len-1)) |
- (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
- if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
- bar++;
- } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
+ if (!mem64) {
+ len = pci_resource_len(pdev, bar);
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ if (len == 0) {
+ *lp = 0;
+ continue;
+ }
+
+ len = ~(len - 1);
+ } else
+ len >>= 32;
+
+ if (*lp == ~0U)
+ *lp = (u32)len;
+ else
+ *lp &= (u32)len;
+
+ if (mem64) {
+ mem64 = 0;
+ continue;
+ }
+
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
*lp |= PCI_BASE_ADDRESS_SPACE_IO;
- *lp = (*lp & ~(len-1)) |
- (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
+ else {
+ *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
+ *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
+ *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+ mem64 = 1;
+ }
}
}
+
lp = (u32 *)vdev->vinfo.rombar;
len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
- *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
+ len = ~(len - 1);
+
+ if (*lp == ~PCI_ROM_ADDRESS_ENABLE)
+ *lp = (u32)len;
+ else
+ *lp = *lp & ((u32)len | PCI_ROM_ADDRESS_ENABLE);
+
vdev->vinfo.bardirty = 0;
}

2010-06-30 13:42:33

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 30, 2010 at 12:14:12AM -0600, Alex Williamson wrote:
> On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > The VFIO "driver" is used to allow privileged AND non-privileged processes to
> > implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> > devices.
>
> Hi Tom,
>
> I found a few bugs. Patch below. The first chunk clears the
> pci_config_map on close, otherwise we end up passing virtualized state
> from one user to the next. The second is an off by one in the basic
> perms. Finally, vfio_bar_fixup() needs an overhaul. It wasn't setting
> the lower bits right and is allowing virtual writes of bits that aren't
> aligned to the size. This section probably needs another pass or two of
> refinement. Thanks,
>
> Alex
>
> Signed-off-by: Alex Williamson <[email protected]>

I still don't see why are we sticking all this emulation in kernel. It
is far from performance hotpath and can easily be emulated in userspace.
qemu does this, you can lift code from there if you like.
Maybe we need to protect the BARs from being manipulated by userspace,
but that should be all. No need for tables.

> ---
>
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 96639e5..a0e8227 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -129,6 +129,10 @@ static int vfio_release(struct inode *inode, struct file *filep)
> eventfd_ctx_put(vdev->ev_msi);
> vdev->ev_irq = NULL;
> }
> + if (vdev->pci_config_map) {
> + kfree(vdev->pci_config_map);
> + vdev->pci_config_map = NULL;
> + }
> vfio_domain_unset(vdev);
> /* reset to known state if we can */
> (void) pci_reset_function(vdev->pdev);
> diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
> index c821b5d..f6e26b1 100644
> --- a/drivers/vfio/vfio_pci_config.c
> +++ b/drivers/vfio/vfio_pci_config.c
> @@ -79,18 +79,18 @@ struct perm_bits {
> static struct perm_bits pci_cap_basic_perm[] = {
> { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> - { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
> - { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
> + { 0, 0, }, /* 0x08 class code & revision id */
> + { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> - { 0, 0, }, /* 0x24 cardbus - not yet */
> - { 0, 0, }, /* 0x28 subsys vendor & dev */
> - { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
> - { 0, 0, }, /* 0x30 capability ptr & resv */
> - { 0, 0, }, /* 0x34 resv */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
> + { 0, 0, }, /* 0x28 cardbus - not yet */
> + { 0, 0, }, /* 0x2c subsys vendor & dev */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
> + { 0, 0, }, /* 0x34 capability ptr & resv */
> { 0, 0, }, /* 0x38 resv */
> { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> };
> @@ -318,30 +318,55 @@ static void vfio_virt_init(struct vfio_dev *vdev)
> static void vfio_bar_fixup(struct vfio_dev *vdev)
> {
> struct pci_dev *pdev = vdev->pdev;
> - int bar;
> - u32 *lp;
> - u32 len;
> + int bar, mem64 = 0;
> + u32 *lp = NULL;
> + u64 len = 0;
>
> for (bar = 0; bar <= 5; bar++) {
> - len = pci_resource_len(pdev, bar);
> - lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> - if (len == 0) {
> - *lp = 0;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> - *lp &= ~0x1;
> - *lp = (*lp & ~(len-1)) |
> - (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
> - if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
> - bar++;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
> + if (!mem64) {
> + len = pci_resource_len(pdev, bar);
> + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> + if (len == 0) {
> + *lp = 0;
> + continue;
> + }
> +
> + len = ~(len - 1);
> + } else
> + len >>= 32;
> +
> + if (*lp == ~0U)
> + *lp = (u32)len;
> + else
> + *lp &= (u32)len;
> +
> + if (mem64) {
> + mem64 = 0;
> + continue;
> + }
> +
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
> *lp |= PCI_BASE_ADDRESS_SPACE_IO;
> - *lp = (*lp & ~(len-1)) |
> - (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
> + else {
> + *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
> + *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
> + *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> + mem64 = 1;
> + }
> }
> }
> +
> lp = (u32 *)vdev->vinfo.rombar;
> len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
> - *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
> + len = ~(len - 1);
> +
> + if (*lp == ~PCI_ROM_ADDRESS_ENABLE)
> + *lp = (u32)len;
> + else
> + *lp = *lp & ((u32)len | PCI_ROM_ADDRESS_ENABLE);
> +
> vdev->vinfo.bardirty = 0;
> }
>
>

2010-06-30 14:01:09

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Wed, 2010-06-30 at 16:36 +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 30, 2010 at 12:14:12AM -0600, Alex Williamson wrote:
> > On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > > The VFIO "driver" is used to allow privileged AND non-privileged processes to
> > > implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> > > devices.
> >
> > Hi Tom,
> >
> > I found a few bugs. Patch below. The first chunk clears the
> > pci_config_map on close, otherwise we end up passing virtualized state
> > from one user to the next. The second is an off by one in the basic
> > perms. Finally, vfio_bar_fixup() needs an overhaul. It wasn't setting
> > the lower bits right and is allowing virtual writes of bits that aren't
> > aligned to the size. This section probably needs another pass or two of
> > refinement. Thanks,
> >
> > Alex
> >
>
> I still don't see why are we sticking all this emulation in kernel. It
> is far from performance hotpath and can easily be emulated in userspace.
> qemu does this, you can lift code from there if you like.
> Maybe we need to protect the BARs from being manipulated by userspace,
> but that should be all. No need for tables.

The benefit I see so far is that it removes duplicate code. Should
every user of this interface try to extract qemu's PCI config space
emulation and jury rig it into their code base? Tom is already
providing access to more capability bits than the kvm device assignment
code. If the kernel community will accept it, I think it saves vfio
usperspace writers some hassle and provides a better environment by
having emulation in a single, well tested, hopefully well used place.
Thanks,

Alex

2010-06-30 22:21:36

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

Thanks, Alex!
Am incorporating...

On Tuesday 29 June 2010 11:14:12 pm Alex Williamson wrote:
> On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > The VFIO "driver" is used to allow privileged AND non-privileged processes to
> > implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> > devices.
>
> Hi Tom,
>
> I found a few bugs. Patch below. The first chunk clears the
> pci_config_map on close, otherwise we end up passing virtualized state
> from one user to the next. The second is an off by one in the basic
> perms. Finally, vfio_bar_fixup() needs an overhaul. It wasn't setting
> the lower bits right and is allowing virtual writes of bits that aren't
> aligned to the size. This section probably needs another pass or two of
> refinement. Thanks,
>
> Alex
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
>
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 96639e5..a0e8227 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -129,6 +129,10 @@ static int vfio_release(struct inode *inode, struct file *filep)
> eventfd_ctx_put(vdev->ev_msi);
> vdev->ev_irq = NULL;
> }
> + if (vdev->pci_config_map) {
> + kfree(vdev->pci_config_map);
> + vdev->pci_config_map = NULL;
> + }
> vfio_domain_unset(vdev);
> /* reset to known state if we can */
> (void) pci_reset_function(vdev->pdev);
> diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
> index c821b5d..f6e26b1 100644
> --- a/drivers/vfio/vfio_pci_config.c
> +++ b/drivers/vfio/vfio_pci_config.c
> @@ -79,18 +79,18 @@ struct perm_bits {
> static struct perm_bits pci_cap_basic_perm[] = {
> { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
> { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> - { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
> - { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
> + { 0, 0, }, /* 0x08 class code & revision id */
> + { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
> - { 0, 0, }, /* 0x24 cardbus - not yet */
> - { 0, 0, }, /* 0x28 subsys vendor & dev */
> - { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
> - { 0, 0, }, /* 0x30 capability ptr & resv */
> - { 0, 0, }, /* 0x34 resv */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
> + { 0, 0, }, /* 0x28 cardbus - not yet */
> + { 0, 0, }, /* 0x2c subsys vendor & dev */
> + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
> + { 0, 0, }, /* 0x34 capability ptr & resv */
> { 0, 0, }, /* 0x38 resv */
> { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
> };
> @@ -318,30 +318,55 @@ static void vfio_virt_init(struct vfio_dev *vdev)
> static void vfio_bar_fixup(struct vfio_dev *vdev)
> {
> struct pci_dev *pdev = vdev->pdev;
> - int bar;
> - u32 *lp;
> - u32 len;
> + int bar, mem64 = 0;
> + u32 *lp = NULL;
> + u64 len = 0;
>
> for (bar = 0; bar <= 5; bar++) {
> - len = pci_resource_len(pdev, bar);
> - lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> - if (len == 0) {
> - *lp = 0;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> - *lp &= ~0x1;
> - *lp = (*lp & ~(len-1)) |
> - (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
> - if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
> - bar++;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
> + if (!mem64) {
> + len = pci_resource_len(pdev, bar);
> + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> + if (len == 0) {
> + *lp = 0;
> + continue;
> + }
> +
> + len = ~(len - 1);
> + } else
> + len >>= 32;
> +
> + if (*lp == ~0U)
> + *lp = (u32)len;
> + else
> + *lp &= (u32)len;
> +
> + if (mem64) {
> + mem64 = 0;
> + continue;
> + }
> +
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
> *lp |= PCI_BASE_ADDRESS_SPACE_IO;
> - *lp = (*lp & ~(len-1)) |
> - (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
> + else {
> + *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
> + *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> + if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
> + *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> + mem64 = 1;
> + }
> }
> }
> +
> lp = (u32 *)vdev->vinfo.rombar;
> len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
> - *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
> + len = ~(len - 1);
> +
> + if (*lp == ~PCI_ROM_ADDRESS_ENABLE)
> + *lp = (u32)len;
> + else
> + *lp = *lp & ((u32)len | PCI_ROM_ADDRESS_ENABLE);
> +
> vdev->vinfo.bardirty = 0;
> }
>
>
>
>

2010-06-30 22:38:24

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Wed, Jun 30, 2010 at 03:17:55PM -0700, Tom Lyon wrote:
> Thanks, Alex!
> Am incorporating...

I get it there's no chance you'll drop the "virtualization"
from the driver then?

--
MST

2010-06-30 22:53:12

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Wednesday 30 June 2010 03:32:56 pm Michael S. Tsirkin wrote:
> On Wed, Jun 30, 2010 at 03:17:55PM -0700, Tom Lyon wrote:
> > Thanks, Alex!
> > Am incorporating...
>
> I get it there's no chance you'll drop the "virtualization"
> from the driver then?
>
I think it'll get a whole lot simpler by depending on iommu interrupt remapping,
but I thinke some of it is still useful.

I am now stuck trying to get access to a system with interrupt remapping.
Turns out my Intel IOMMU doesn't have it.

2010-07-01 04:16:48

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> +int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
> +{
> + unsigned long start, npage;
> + struct dma_map_page *mlp;
> + struct list_head *pos, *pos2;
> + int ret;
> +
> + start = dmp->vaddr & ~PAGE_SIZE;
> + npage = dmp->size >> PAGE_SHIFT;
> +
> + ret = -ENXIO;
> + mutex_lock(&listener->vdev->dgate);
> + list_for_each_safe(pos, pos2, &listener->dm_list) {
> + mlp = list_entry(pos, struct dma_map_page, list);
> + if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
> + continue;
> + ret = 0;
> + vfio_dma_unmap(listener, mlp);
> + break;
> + }

Hi Tom,

Shouldn't we be matching the mlp based on daddr instead of vaddr? We
can have multiple dma address pointing at the same virtual address, so
dma address is the unique element. I'm also nervous about this dm_list.
For qemu device assignment, we're potentially statically mapping many GB
of iova space. It seems like this could get incredibly bloated and
slow. Thanks,

Alex

2010-07-01 04:34:19

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Wednesday 30 June 2010 09:16:23 pm Alex Williamson wrote:
> On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > +int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
> > +{
> > + unsigned long start, npage;
> > + struct dma_map_page *mlp;
> > + struct list_head *pos, *pos2;
> > + int ret;
> > +
> > + start = dmp->vaddr & ~PAGE_SIZE;
> > + npage = dmp->size >> PAGE_SHIFT;
> > +
> > + ret = -ENXIO;
> > + mutex_lock(&listener->vdev->dgate);
> > + list_for_each_safe(pos, pos2, &listener->dm_list) {
> > + mlp = list_entry(pos, struct dma_map_page, list);
> > + if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
> > + continue;
> > + ret = 0;
> > + vfio_dma_unmap(listener, mlp);
> > + break;
> > + }
>
> Hi Tom,
>
> Shouldn't we be matching the mlp based on daddr instead of vaddr? We
> can have multiple dma address pointing at the same virtual address, so
> dma address is the unique element. I'm also nervous about this dm_list.
> For qemu device assignment, we're potentially statically mapping many GB
> of iova space. It seems like this could get incredibly bloated and
> slow. Thanks,
>
> Alex

In weird circumstances, differing user vaddrs could reolve to the same physical address,
so the uniqueness of any mapping is the <vaddr,len>.

Yes, a linear list is slow, but does qemu need a lot of mappings, or just big ones?
>
>

2010-07-01 05:17:40

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Wed, 2010-06-30 at 21:30 -0700, Tom Lyon wrote:
> On Wednesday 30 June 2010 09:16:23 pm Alex Williamson wrote:
> > On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > > +int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
> > > +{
> > > + unsigned long start, npage;
> > > + struct dma_map_page *mlp;
> > > + struct list_head *pos, *pos2;
> > > + int ret;
> > > +
> > > + start = dmp->vaddr & ~PAGE_SIZE;
> > > + npage = dmp->size >> PAGE_SHIFT;
> > > +
> > > + ret = -ENXIO;
> > > + mutex_lock(&listener->vdev->dgate);
> > > + list_for_each_safe(pos, pos2, &listener->dm_list) {
> > > + mlp = list_entry(pos, struct dma_map_page, list);
> > > + if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
> > > + continue;
> > > + ret = 0;
> > > + vfio_dma_unmap(listener, mlp);
> > > + break;
> > > + }
> >
> > Hi Tom,
> >
> > Shouldn't we be matching the mlp based on daddr instead of vaddr? We
> > can have multiple dma address pointing at the same virtual address, so
> > dma address is the unique element. I'm also nervous about this dm_list.
> > For qemu device assignment, we're potentially statically mapping many GB
> > of iova space. It seems like this could get incredibly bloated and
> > slow. Thanks,
> >
> > Alex
>
> In weird circumstances, differing user vaddrs could reolve to the same physical address,
> so the uniqueness of any mapping is the <vaddr,len>.

That sounds like another argument for using daddr to me, no? There's
only one address space on the PCI bus, daddr. There's a 1:1 mapping of
daddr to physical page, but an N:1 mapping of vaddr to a physical page.

> Yes, a linear list is slow, but does qemu need a lot of mappings, or just big ones?

I'm not sure yet, the first interface I tried seems to be giving me
handfuls of pages. We sometimes reprogram the guest physical address
(daddr) to a new virtual address (vaddr), but I don't know what the old
virtual address was, so I can't unmap it. That's when I ran into the
issue above. Thanks,

Alex

2010-07-01 15:29:22

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> +This call also has the side effect of enabling PCI bus mastership.

Hi Tom,

This interface doesn't make sense for the MAP_IOVA user. Especially in
qemu, we have no idea what the DMA mask is for the device we're
assigning. It doesn't really matter though because the guest will use
bounce buffers internally once it loads the device specific drivers and
discovers the DMA mask. This only seems relevant if we're using a
DMA_MAP call that gets to pick the dmaaddr, so I'd propose we only make
this a required call for that interface, and create a separate ioctl for
actually enabling bus master. Thanks,

Alex

2010-07-01 15:36:37

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thu, Jul 01, 2010 at 09:29:04AM -0600, Alex Williamson wrote:
> On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> > +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> > +This call also has the side effect of enabling PCI bus mastership.
>
> Hi Tom,
>
> This interface doesn't make sense for the MAP_IOVA user. Especially in
> qemu, we have no idea what the DMA mask is for the device we're
> assigning. It doesn't really matter though because the guest will use
> bounce buffers internally once it loads the device specific drivers and
> discovers the DMA mask. This only seems relevant if we're using a
> DMA_MAP call that gets to pick the dmaaddr, so I'd propose we only make
> this a required call for that interface, and create a separate ioctl for
> actually enabling bus master. Thanks,
>
> Alex

I expect there's no need for a separate ioctl to do this:
you can do this by write to the control register.

--
MST

2010-07-01 15:49:07

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thu, 2010-07-01 at 18:31 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 01, 2010 at 09:29:04AM -0600, Alex Williamson wrote:
> > On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > > +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> > > +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> > > +This call also has the side effect of enabling PCI bus mastership.
> >
> > Hi Tom,
> >
> > This interface doesn't make sense for the MAP_IOVA user. Especially in
> > qemu, we have no idea what the DMA mask is for the device we're
> > assigning. It doesn't really matter though because the guest will use
> > bounce buffers internally once it loads the device specific drivers and
> > discovers the DMA mask. This only seems relevant if we're using a
> > DMA_MAP call that gets to pick the dmaaddr, so I'd propose we only make
> > this a required call for that interface, and create a separate ioctl for
> > actually enabling bus master. Thanks,
> >
> > Alex
>
> I expect there's no need for a separate ioctl to do this:
> you can do this by write to the control register.

Nope, vfio only allows direct writes to the memory and io space bits of
the command register, all other bits are virtualized. I wonder if
that's necessary though since we require the device to be attached to an
iommu domain before we allow config space access.

Alex

2010-07-01 16:28:11

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thu, Jul 01, 2010 at 09:48:41AM -0600, Alex Williamson wrote:
> On Thu, 2010-07-01 at 18:31 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 01, 2010 at 09:29:04AM -0600, Alex Williamson wrote:
> > > On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > > > +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> > > > +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> > > > +This call also has the side effect of enabling PCI bus mastership.
> > >
> > > Hi Tom,
> > >
> > > This interface doesn't make sense for the MAP_IOVA user. Especially in
> > > qemu, we have no idea what the DMA mask is for the device we're
> > > assigning. It doesn't really matter though because the guest will use
> > > bounce buffers internally once it loads the device specific drivers and
> > > discovers the DMA mask. This only seems relevant if we're using a
> > > DMA_MAP call that gets to pick the dmaaddr, so I'd propose we only make
> > > this a required call for that interface, and create a separate ioctl for
> > > actually enabling bus master. Thanks,
> > >
> > > Alex
> >
> > I expect there's no need for a separate ioctl to do this:
> > you can do this by write to the control register.
>
> Nope, vfio only allows direct writes to the memory and io space bits of
> the command register,

I don't see why's there need to protect the control register.
As far as I can see, nothing userspace does with it
can damage the host.

> all other bits are virtualized. I wonder if
> that's necessary though since we require the device to be attached to an
> iommu domain before we allow config space access.
>
> Alex
>

I don't think it's necessary. IMHO all the virtualization
tables can just be replaced with
if (pci header type == PCI_HEADER_TYPE_NORMAL)
if (addr < PCI_BASE_ADDRESS_0 + 24 &&
addr + len >= PCI_BASE_ADDRESS_0)
return -EPERM;
else /* similar thing for the bridge and cardbus
types */

Much simpler and more readable than tables full of 0xffff.

Reason this is enough is because virt drivers like qemu already
have the code to treat interrupt disable, MSI/MSIX capabilities
and BARs registers specially. custom userspace drivers simply
have no reason to touch anything besides the interrupt disable bit.

--
MST

2010-07-01 18:53:10

[permalink] [raw]

Subject: Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

On Thursday 01 July 2010 08:48:41 am Alex Williamson wrote:
> On Thu, 2010-07-01 at 18:31 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 01, 2010 at 09:29:04AM -0600, Alex Williamson wrote:
> > > On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > > > +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> > > > +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> > > > +This call also has the side effect of enabling PCI bus mastership.
> > >
> > > Hi Tom,
> > >
> > > This interface doesn't make sense for the MAP_IOVA user. Especially in
> > > qemu, we have no idea what the DMA mask is for the device we're
> > > assigning. It doesn't really matter though because the guest will use
> > > bounce buffers internally once it loads the device specific drivers and
> > > discovers the DMA mask. This only seems relevant if we're using a
> > > DMA_MAP call that gets to pick the dmaaddr, so I'd propose we only make
> > > this a required call for that interface, and create a separate ioctl for
> > > actually enabling bus master. Thanks,
> > >
> > > Alex
> >
> > I expect there's no need for a separate ioctl to do this:
> > you can do this by write to the control register.
>
> Nope, vfio only allows direct writes to the memory and io space bits of
> the command register, all other bits are virtualized. I wonder if
> that's necessary though since we require the device to be attached to an
> iommu domain before we allow config space access.
>
I had already gotten rid of the mask setting & master mode ioctl. It was a remnant
of using the dma_map_sg API which is no longer in there. And I tweaked the
config stuff to allow writing the master bit.

2010-07-06 04:50:30