LinuxLists.cc - [PATCH 0/4] KVM-VFIO pseudo device for VFIO coherency

2013-10-29 16:13:22

Subject: [PATCH 0/4] KVM-VFIO pseudo device for VFIO coherency

This series allows QEMU to create a VFIO device in KVM for registering
VFIO groups. The initial user of this interface is to toggle whether
KVM emulates coherency instructions like WBINVD. For this particular
use case I've toyed with the idea of disabling NoSnoop at the device,
but we already have an example where this fails as video cards often
have backdoors to PCI config space and may re-enable NoSnoop support.
Regardless of whether the driver should be doing this or not, we don't
really want to rely on well behaved drivers for things as important as
coherency.

In this first step we assume that noncoherent DMA is possible any time
a VFIO group is present. I have follow-on patches which fix a bug in
intel-iommu which that prevents us from always enabling SNP support in
the IOMMU page tables and prevents us from tracking the hardware
capabilities of the IOMMU domain for coherency. Once that is fixed I
can fill in the TODO which would allow us to only mark VFIO as
noncoherent when necessary.

POWER support for VFIO would also like to make use of the VFIO pseudo
device interface, but has some remaining work to architect an
interface into it.

I'd very much like to see this for v3.13. Thanks,

Alex

---

Alex Williamson (4):
kvm: Destroy & free KVM devices on release
kvm: Add VFIO device
kvm/x86: Convert iommu_flags to iommu_noncoherent
kvm: Create non-coherent DMA registeration

Documentation/virtual/kvm/devices/vfio.txt | 22 ++
arch/ia64/include/asm/kvm_host.h | 2
arch/powerpc/kvm/book3s_xics.c | 1
arch/x86/include/asm/kvm_host.h | 4
arch/x86/kvm/Kconfig | 1
arch/x86/kvm/Makefile | 2
arch/x86/kvm/vmx.c | 3
arch/x86/kvm/x86.c | 21 ++
include/linux/kvm_host.h | 23 ++
include/uapi/linux/kvm.h | 4
virt/kvm/Kconfig | 3
virt/kvm/iommu.c | 22 +-
virt/kvm/kvm_main.c | 10 +
virt/kvm/vfio.c | 263 ++++++++++++++++++++++++++++
14 files changed, 362 insertions(+), 19 deletions(-)
create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
create mode 100644 virt/kvm/vfio.c

2013-10-29 16:13:28

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 1/4] kvm: Destroy & free KVM devices on release

The KVM device interface allocates a struct kvm_device and calls
kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
This returns a file descriptor to the user for them to set/get/check
further attributes. On closing the file descriptor, one would assume
that kvm_device_ops.destroy is called and all traces of the device
would go away. One would be wrong, it actually does nothing more
than release the struct kvm reference, waiting until the VM is
destroyed before doing more. This leaves devices that only want a
single instance of themselves per VM in a tough spot.

To fix this, do full cleanup on release of the device file descriptor.
It's also non-symmetric that one of the existing devices frees the
struct kvm_device from it's .destroy function, while the other
doesn't. KVM-core allocates the structure and should therefore be
responsible for freeing it. Finally, add a missing kfree for the
device creation error path.

Signed-off-by: Alex Williamson <[email protected]>
---
arch/powerpc/kvm/book3s_xics.c | 1 -
virt/kvm/kvm_main.c | 5 +++++
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index a3a5cb8..9a82426 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -1220,7 +1220,6 @@ static void kvmppc_xics_free(struct kvm_device *dev)
for (i = 0; i <= xics->max_icsid; i++)
kfree(xics->ics[i]);
kfree(xics);
- kfree(dev);
}

static int kvmppc_xics_create(struct kvm_device *dev, u32 type)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a9dd682..fec8320 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -572,6 +572,7 @@ static void kvm_destroy_devices(struct kvm *kvm)

list_del(node);
dev->ops->destroy(dev);
+ kfree(dev);
}
}

@@ -2231,6 +2232,9 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
struct kvm_device *dev = filp->private_data;
struct kvm *kvm = dev->kvm;

+ list_del(&dev->vm_node);
+ dev->ops->destroy(dev);
+ kfree(dev);
kvm_put_kvm(kvm);
return 0;
}
@@ -2294,6 +2298,7 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | O_CLOEXEC);
if (ret < 0) {
ops->destroy(dev);
+ kfree(dev);
return ret;
}

2013-10-29 16:13:37

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 2/4] kvm: Add VFIO device

So far we've succeeded at making KVM and VFIO mostly unaware of each
other, but areas are cropping up where a connection beyond eventfds
and irqfds needs to be made. This patch introduces a KVM-VFIO device
that is meant to be a gateway for such interaction. The user creates
the device and can add and remove VFIO groups to it via file
descriptors. When a group is added, KVM verifies the group is valid
and gets a reference to it via the VFIO external user interface.

Signed-off-by: Alex Williamson <[email protected]>
---
Documentation/virtual/kvm/devices/vfio.txt | 22 +++
arch/x86/kvm/Kconfig | 1
arch/x86/kvm/Makefile | 2
include/linux/kvm_host.h | 1
include/uapi/linux/kvm.h | 4 +
virt/kvm/Kconfig | 3
virt/kvm/kvm_main.c | 5 +
virt/kvm/vfio.c | 219 ++++++++++++++++++++++++++++
8 files changed, 256 insertions(+), 1 deletion(-)
create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
create mode 100644 virt/kvm/vfio.c

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
new file mode 100644
index 0000000..ef51740
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -0,0 +1,22 @@
+VFIO virtual device
+===================
+
+Device types supported:
+ KVM_DEV_TYPE_VFIO
+
+Only one VFIO instance may be created per VM. The created device
+tracks VFIO groups in use by the VM and features of those groups
+important to the correctness and acceleration of the VM. As groups
+are enabled and disabled for use by the VM, KVM should be updated
+about their presence. When registered with KVM, a reference to the
+VFIO-group is held by KVM.
+
+Groups:
+ KVM_DEV_VFIO_GROUP
+
+KVM_DEV_VFIO_GROUP attributes:
+ KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+ KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+
+For each, kvm_device_attr.addr points to an int32_t file descriptor
+for the VFIO group.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a47a3e5..b89c5db 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -38,6 +38,7 @@ config KVM
select PERF_EVENTS
select HAVE_KVM_MSI
select HAVE_KVM_CPU_RELAX_INTERCEPT
+ select KVM_VFIO
---help---
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index bf4fb04..25d22b2 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,7 +9,7 @@ KVM := ../../../virt/kvm

kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
$(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
- $(KVM)/eventfd.o $(KVM)/irqchip.o
+ $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0fbbc7a..279002b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1066,6 +1066,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);

extern struct kvm_device_ops kvm_mpic_ops;
extern struct kvm_device_ops kvm_xics_ops;
+extern struct kvm_device_ops kvm_vfio_ops;

#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..7c1a349 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -843,6 +843,10 @@ struct kvm_device_attr {
#define KVM_DEV_TYPE_FSL_MPIC_20 1
#define KVM_DEV_TYPE_FSL_MPIC_42 2
#define KVM_DEV_TYPE_XICS 3
+#define KVM_DEV_TYPE_VFIO 4
+#define KVM_DEV_VFIO_GROUP 1
+#define KVM_DEV_VFIO_GROUP_ADD 1
+#define KVM_DEV_VFIO_GROUP_DEL 2

/*
* ioctls for VM fds
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 779262f..fbe1a48 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -27,3 +27,6 @@ config HAVE_KVM_MSI

config HAVE_KVM_CPU_RELAX_INTERCEPT
bool
+
+config KVM_VFIO
+ bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fec8320..6538502 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2275,6 +2275,11 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
ops = &kvm_xics_ops;
break;
#endif
+#ifdef CONFIG_KVM_VFIO
+ case KVM_DEV_TYPE_VFIO:
+ ops = &kvm_vfio_ops;
+ break;
+#endif
default:
return -ENODEV;
}
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
new file mode 100644
index 0000000..fc2d112
--- /dev/null
+++ b/virt/kvm/vfio.c
@@ -0,0 +1,219 @@
+/*
+ * VFIO-KVM bridge pseudo device
+ *
+ * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/kvm_host.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+struct kvm_vfio_group {
+ struct list_head node;
+ struct vfio_group *vfio_group;
+};
+
+struct kvm_vfio {
+ struct list_head group_list;
+ struct mutex lock;
+};
+
+static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
+{
+ struct vfio_group *vfio_group;
+ struct vfio_group *(*fn)(struct file *);
+
+ fn = symbol_get(vfio_group_get_external_user);
+ if (!fn)
+ return ERR_PTR(-EINVAL);
+
+ vfio_group = fn(filep);
+
+ symbol_put(vfio_group_get_external_user);
+
+ return vfio_group;
+}
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+ void (*fn)(struct vfio_group *);
+
+ fn = symbol_get(vfio_group_put_external_user);
+ if (!fn)
+ return;
+
+ fn(vfio_group);
+
+ symbol_put(vfio_group_put_external_user);
+}
+
+static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
+{
+ struct kvm_vfio *kv = dev->private;
+ struct vfio_group *vfio_group;
+ struct kvm_vfio_group *kvg;
+ void __user *argp = (void __user *)arg;
+ struct fd f;
+ int32_t fd;
+ int ret;
+
+ switch (attr) {
+ case KVM_DEV_VFIO_GROUP_ADD:
+ if (get_user(fd, (int32_t __user *)argp))
+ return -EFAULT;
+
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ vfio_group = kvm_vfio_group_get_external_user(f.file);
+ fdput(f);
+
+ if (IS_ERR(vfio_group))
+ return PTR_ERR(vfio_group);
+
+ mutex_lock(&kv->lock);
+
+ list_for_each_entry(kvg, &kv->group_list, node) {
+ if (kvg->vfio_group == vfio_group) {
+ mutex_unlock(&kv->lock);
+ kvm_vfio_group_put_external_user(vfio_group);
+ return -EEXIST;
+ }
+ }
+
+ kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
+ if (!kvg) {
+ mutex_unlock(&kv->lock);
+ kvm_vfio_group_put_external_user(vfio_group);
+ return -ENOMEM;
+ }
+
+ list_add_tail(&kvg->node, &kv->group_list);
+ kvg->vfio_group = vfio_group;
+
+ mutex_unlock(&kv->lock);
+
+ return 0;
+
+ case KVM_DEV_VFIO_GROUP_DEL:
+ if (get_user(fd, (int32_t __user *)argp))
+ return -EFAULT;
+
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ vfio_group = kvm_vfio_group_get_external_user(f.file);
+ fdput(f);
+
+ if (IS_ERR(vfio_group))
+ return PTR_ERR(vfio_group);
+
+ ret = -ENOENT;
+
+ mutex_lock(&kv->lock);
+
+ list_for_each_entry(kvg, &kv->group_list, node) {
+ if (kvg->vfio_group != vfio_group)
+ continue;
+
+ list_del(&kvg->node);
+ kvm_vfio_group_put_external_user(kvg->vfio_group);
+ kfree(kvg);
+ ret = 0;
+ break;
+ }
+
+ mutex_unlock(&kv->lock);
+
+ kvm_vfio_group_put_external_user(vfio_group);
+
+ return ret;
+ }
+
+ return -ENXIO;
+}
+
+static int kvm_vfio_set_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ switch (attr->group) {
+ case KVM_DEV_VFIO_GROUP:
+ return kvm_vfio_set_group(dev, attr->attr, attr->addr);
+ }
+
+ return -ENXIO;
+}
+
+static int kvm_vfio_has_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ switch (attr->group) {
+ case KVM_DEV_VFIO_GROUP:
+ switch (attr->attr) {
+ case KVM_DEV_VFIO_GROUP_ADD:
+ case KVM_DEV_VFIO_GROUP_DEL:
+ return 0;
+ }
+
+ break;
+ }
+
+ return -ENXIO;
+}
+
+static void kvm_vfio_destroy(struct kvm_device *dev)
+{
+ struct kvm_vfio *kv = dev->private;
+ struct kvm_vfio_group *kvg, *tmp;
+
+ list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+ kvm_vfio_group_put_external_user(kvg->vfio_group);
+ list_del(&kvg->node);
+ kfree(kvg);
+ }
+
+ kfree(kv);
+}
+
+static int kvm_vfio_create(struct kvm_device *dev, u32 type)
+{
+ struct kvm_device *tmp;
+ struct kvm_vfio *kv;
+
+ /* Only one VFIO "device" per VM */
+ list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
+ if (tmp->ops == &kvm_vfio_ops)
+ return -EBUSY;
+
+ kv = kzalloc(sizeof(*kv), GFP_KERNEL);
+ if (!kv)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&kv->group_list);
+ mutex_init(&kv->lock);
+
+ dev->private = kv;
+
+ return 0;
+}
+
+struct kvm_device_ops kvm_vfio_ops = {
+ .name = "kvm-vfio",
+ .create = kvm_vfio_create,
+ .destroy = kvm_vfio_destroy,
+ .set_attr = kvm_vfio_set_attr,
+ .has_attr = kvm_vfio_has_attr,
+};

2013-10-29 16:13:42

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 3/4] kvm/x86: Convert iommu_flags to iommu_noncoherent

Default to operating in coherent mode. This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.

Signed-off-by: Alex Williamson <[email protected]>
---
arch/ia64/include/asm/kvm_host.h | 2 +-
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 3 ---
virt/kvm/iommu.c | 16 ++++++++--------
6 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index 989dd3f..1933b7a 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -480,7 +480,7 @@ struct kvm_arch {

struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
- int iommu_flags;
+ bool iommu_noncoherent;

unsigned long irq_sources_bitmap;
unsigned long irq_states[KVM_IOAPIC_NUM_PINS];
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c76ff74..1b6b5f9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -557,7 +557,7 @@ struct kvm_arch {

struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
- int iommu_flags;
+ bool iommu_noncoherent;
struct kvm_pic *vpic;
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2b2fce1..5d0e3ab 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7411,7 +7411,7 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
if (is_mmio)
ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
else if (vcpu->kvm->arch.iommu_domain &&
- !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
+ vcpu->kvm->arch.iommu_noncoherent)
ret = kvm_get_guest_memory_type(vcpu, gfn) <<
VMX_EPT_MT_EPTE_SHIFT;
else
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5ca72a..b1231b0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2716,7 +2716,7 @@ static void wbinvd_ipi(void *garbage)
static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
{
return vcpu->kvm->arch.iommu_domain &&
- !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
+ vcpu->kvm->arch.iommu_noncoherent;
}

void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 279002b..46ce3d1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -747,9 +747,6 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
int kvm_request_irq_source_id(struct kvm *kvm);
void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);

-/* For vcpu->arch.iommu_flags */
-#define KVM_IOMMU_CACHE_COHERENCY 0x1
-
#ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index 72a130b..9cde444 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -79,7 +79,7 @@ int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
flags = IOMMU_READ;
if (!(slot->flags & KVM_MEM_READONLY))
flags |= IOMMU_WRITE;
- if (kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY)
+ if (!kvm->arch.iommu_noncoherent)
flags |= IOMMU_CACHE;

@@ -158,7 +158,8 @@ int kvm_assign_device(struct kvm *kvm,
{
struct pci_dev *pdev = NULL;
struct iommu_domain *domain = kvm->arch.iommu_domain;
- int r, last_flags;
+ int r;
+ bool noncoherent;

/* check if iommu exists and in use */
if (!domain)
@@ -174,15 +175,13 @@ int kvm_assign_device(struct kvm *kvm,
return r;
}

- last_flags = kvm->arch.iommu_flags;
- if (iommu_domain_has_cap(kvm->arch.iommu_domain,
- IOMMU_CAP_CACHE_COHERENCY))
- kvm->arch.iommu_flags |= KVM_IOMMU_CACHE_COHERENCY;
+ noncoherent = !iommu_domain_has_cap(kvm->arch.iommu_domain,
+ IOMMU_CAP_CACHE_COHERENCY);

/* Check if need to update IOMMU page table for guest memory */
- if ((last_flags ^ kvm->arch.iommu_flags) ==
- KVM_IOMMU_CACHE_COHERENCY) {
+ if (noncoherent != kvm->arch.iommu_noncoherent) {
kvm_iommu_unmap_memslots(kvm);
+ kvm->arch.iommu_noncoherent = noncoherent;
r = kvm_iommu_map_memslots(kvm);
if (r)
goto out_unmap;
@@ -350,6 +349,7 @@ int kvm_iommu_unmap_guest(struct kvm *kvm)
mutex_lock(&kvm->slots_lock);
kvm_iommu_unmap_memslots(kvm);
kvm->arch.iommu_domain = NULL;
+ kvm->arch.iommu_noncoherent = false;
mutex_unlock(&kvm->slots_lock);

iommu_domain_free(domain);

2013-10-29 16:13:47

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 4/4] kvm: Create non-coherent DMA registeration

We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present. Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group. Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.

Signed-off-by: Alex Williamson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx.c | 3 +--
arch/x86/kvm/x86.c | 21 +++++++++++++++++--
include/linux/kvm_host.h | 19 +++++++++++++++++
virt/kvm/iommu.c | 6 +++++
virt/kvm/vfio.c | 44 +++++++++++++++++++++++++++++++++++++++
6 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1b6b5f9..50c1e9c1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -558,6 +558,8 @@ struct kvm_arch {
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
bool iommu_noncoherent;
+#define __KVM_HAVE_ARCH_NONCOHERENT_DMA
+ atomic_t noncoherent_dma_count;
struct kvm_pic *vpic;
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5d0e3ab..f7fe08a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7410,8 +7410,7 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
*/
if (is_mmio)
ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
- else if (vcpu->kvm->arch.iommu_domain &&
- vcpu->kvm->arch.iommu_noncoherent)
+ else if (kvm_arch_has_noncoherent_dma(vcpu->kvm))
ret = kvm_get_guest_memory_type(vcpu, gfn) <<
VMX_EPT_MT_EPTE_SHIFT;
else
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b1231b0..feec86d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2715,8 +2715,7 @@ static void wbinvd_ipi(void *garbage)

static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
{
- return vcpu->kvm->arch.iommu_domain &&
- vcpu->kvm->arch.iommu_noncoherent;
+ return kvm_arch_has_noncoherent_dma(vcpu->kvm);
}

void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
@@ -7420,6 +7419,24 @@ bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
kvm_x86_ops->interrupt_allowed(vcpu);
}

+void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
+{
+ atomic_inc(&kvm->arch.noncoherent_dma_count);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);
+
+void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
+{
+ atomic_dec(&kvm->arch.noncoherent_dma_count);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);
+
+bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
+{
+ return atomic_read(&kvm->arch.noncoherent_dma_count);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 46ce3d1..a225349 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -671,6 +671,25 @@ static inline void kvm_arch_free_vm(struct kvm *kvm)
}
#endif

+#ifdef __KVM_HAVE_ARCH_NONCOHERENT_DMA
+void kvm_arch_register_noncoherent_dma(struct kvm *kvm);
+void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm);
+bool kvm_arch_has_noncoherent_dma(struct kvm *kvm);
+#else
+static inline void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
+{
+}
+
+static inline void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
+{
+}
+
+static inline bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
+{
+ return false;
+}
+#endif
+
static inline wait_queue_head_t *kvm_arch_vcpu_wq(struct kvm_vcpu *vcpu)
{
#ifdef __KVM_HAVE_ARCH_WQP
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index 9cde444..0a54456 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -140,6 +140,9 @@ static int kvm_iommu_map_memslots(struct kvm *kvm)
struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;

+ if (kvm->arch.iommu_noncoherent)
+ kvm_arch_register_noncoherent_dma(kvm);
+
idx = srcu_read_lock(&kvm->srcu);
slots = kvm_memslots(kvm);

@@ -335,6 +338,9 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm)

srcu_read_unlock(&kvm->srcu, idx);

+ if (kvm->arch.iommu_noncoherent)
+ kvm_arch_unregister_noncoherent_dma(kvm);
+
return 0;
}

diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index fc2d112..2e336a7 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -27,6 +27,7 @@ struct kvm_vfio_group {
struct kvm_vfio {
struct list_head group_list;
struct mutex lock;
+ bool noncoherent;
};

static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
@@ -58,6 +59,43 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
symbol_put(vfio_group_put_external_user);
}

+/*
+ * Groups can use the same or different IOMMU domains. If the same then
+ * adding a new group may change the coherency of groups we've previously
+ * been told about. We don't want to care about any of that so we retest
+ * each group and bail as soon as we find one that's noncoherent. This
+ * means we only ever [un]register_noncoherent_dma once for the whole device.
+ */
+static void kvm_vfio_update_coherency(struct kvm_device *dev)
+{
+ struct kvm_vfio *kv = dev->private;
+ bool noncoherent = false;
+ struct kvm_vfio_group *kvg;
+
+ mutex_lock(&kv->lock);
+
+ list_for_each_entry(kvg, &kv->group_list, node) {
+ /*
+ * TODO: We need an interface to check the coherency of
+ * the IOMMU domain this group is using. For now, assume
+ * it's always noncoherent.
+ */
+ noncoherent = true;
+ break;
+ }
+
+ if (noncoherent != kv->noncoherent) {
+ kv->noncoherent = noncoherent;
+
+ if (kv->noncoherent)
+ kvm_arch_register_noncoherent_dma(dev->kvm);
+ else
+ kvm_arch_unregister_noncoherent_dma(dev->kvm);
+ }
+
+ mutex_unlock(&kv->lock);
+}
+
static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
{
struct kvm_vfio *kv = dev->private;
@@ -105,6 +143,8 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)

mutex_unlock(&kv->lock);

+ kvm_vfio_update_coherency(dev);
+
return 0;

case KVM_DEV_VFIO_GROUP_DEL:
@@ -140,6 +180,8 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)

kvm_vfio_group_put_external_user(vfio_group);

+ kvm_vfio_update_coherency(dev);
+
return ret;
}

@@ -185,6 +227,8 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
kfree(kvg);
}

+ kvm_vfio_update_coherency(dev);
+
kfree(kv);
}

2013-10-30 10:41:12

by Gleb Natapov

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

On Tue, Oct 29, 2013 at 10:13:22AM -0600, Alex Williamson wrote:
> The KVM device interface allocates a struct kvm_device and calls
> kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
> This returns a file descriptor to the user for them to set/get/check
> further attributes. On closing the file descriptor, one would assume
> that kvm_device_ops.destroy is called and all traces of the device
> would go away. One would be wrong, it actually does nothing more
> than release the struct kvm reference, waiting until the VM is
> destroyed before doing more. This leaves devices that only want a
> single instance of themselves per VM in a tough spot.
>
This is by design. Otherwise locking will be needed on each device access
and for interrupt controllers this is unnecessary serialization and
overhead. Device API is not designed for devices that can go away while
machine is running anyway, so after creation device is only destroyed
during VM destruction.

> To fix this, do full cleanup on release of the device file descriptor.
> It's also non-symmetric that one of the existing devices frees the
> struct kvm_device from it's .destroy function, while the other
> doesn't. KVM-core allocates the structure and should therefore be
> responsible for freeing it. Finally, add a missing kfree for the
> device creation error path.
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
> arch/powerpc/kvm/book3s_xics.c | 1 -
> virt/kvm/kvm_main.c | 5 +++++
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
> index a3a5cb8..9a82426 100644
> --- a/arch/powerpc/kvm/book3s_xics.c
> +++ b/arch/powerpc/kvm/book3s_xics.c
> @@ -1220,7 +1220,6 @@ static void kvmppc_xics_free(struct kvm_device *dev)
> for (i = 0; i <= xics->max_icsid; i++)
> kfree(xics->ics[i]);
> kfree(xics);
> - kfree(dev);
> }
>
> static int kvmppc_xics_create(struct kvm_device *dev, u32 type)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a9dd682..fec8320 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -572,6 +572,7 @@ static void kvm_destroy_devices(struct kvm *kvm)
>
> list_del(node);
> dev->ops->destroy(dev);
> + kfree(dev);
> }
> }
>
> @@ -2231,6 +2232,9 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
> struct kvm_device *dev = filp->private_data;
> struct kvm *kvm = dev->kvm;
>
> + list_del(&dev->vm_node);
> + dev->ops->destroy(dev);
> + kfree(dev);
> kvm_put_kvm(kvm);
> return 0;
> }
> @@ -2294,6 +2298,7 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | O_CLOEXEC);
> if (ret < 0) {
> ops->destroy(dev);
> + kfree(dev);
> return ret;
> }
>

--
Gleb.

2013-10-30 14:39:04

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

On Wed, 2013-10-30 at 12:40 +0200, Gleb Natapov wrote:
> On Tue, Oct 29, 2013 at 10:13:22AM -0600, Alex Williamson wrote:
> > The KVM device interface allocates a struct kvm_device and calls
> > kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
> > This returns a file descriptor to the user for them to set/get/check
> > further attributes. On closing the file descriptor, one would assume
> > that kvm_device_ops.destroy is called and all traces of the device
> > would go away. One would be wrong, it actually does nothing more
> > than release the struct kvm reference, waiting until the VM is
> > destroyed before doing more. This leaves devices that only want a
> > single instance of themselves per VM in a tough spot.
> >
> This is by design. Otherwise locking will be needed on each device access
> and for interrupt controllers this is unnecessary serialization and
> overhead. Device API is not designed for devices that can go away while
> machine is running anyway, so after creation device is only destroyed
> during VM destruction.

Hmm, ok. In that case I can drop this patch and I think the rest just
boils down to userspace use of the device. I had been close()'ing the
kvm device fd when all QEMU vfio devices are detached, but I can just as
easily leave it open in case a new device is added later. I'll send out
a new series after doing some more review and testing. Do you have any
comments on the rest of the series? Thanks,

Alex

> > To fix this, do full cleanup on release of the device file descriptor.
> > It's also non-symmetric that one of the existing devices frees the
> > struct kvm_device from it's .destroy function, while the other
> > doesn't. KVM-core allocates the structure and should therefore be
> > responsible for freeing it. Finally, add a missing kfree for the
> > device creation error path.
> >
> > Signed-off-by: Alex Williamson <[email protected]>
> > ---
> > arch/powerpc/kvm/book3s_xics.c | 1 -
> > virt/kvm/kvm_main.c | 5 +++++
> > 2 files changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
> > index a3a5cb8..9a82426 100644
> > --- a/arch/powerpc/kvm/book3s_xics.c
> > +++ b/arch/powerpc/kvm/book3s_xics.c
> > @@ -1220,7 +1220,6 @@ static void kvmppc_xics_free(struct kvm_device *dev)
> > for (i = 0; i <= xics->max_icsid; i++)
> > kfree(xics->ics[i]);
> > kfree(xics);
> > - kfree(dev);
> > }
> >
> > static int kvmppc_xics_create(struct kvm_device *dev, u32 type)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index a9dd682..fec8320 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -572,6 +572,7 @@ static void kvm_destroy_devices(struct kvm *kvm)
> >
> > list_del(node);
> > dev->ops->destroy(dev);
> > + kfree(dev);
> > }
> > }
> >
> > @@ -2231,6 +2232,9 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
> > struct kvm_device *dev = filp->private_data;
> > struct kvm *kvm = dev->kvm;
> >
> > + list_del(&dev->vm_node);
> > + dev->ops->destroy(dev);
> > + kfree(dev);
> > kvm_put_kvm(kvm);
> > return 0;
> > }
> > @@ -2294,6 +2298,7 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> > ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | O_CLOEXEC);
> > if (ret < 0) {
> > ops->destroy(dev);
> > + kfree(dev);
> > return ret;
> > }
> >
>
> --
> Gleb.

2013-10-30 15:34:14

by Gleb Natapov

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

On Wed, Oct 30, 2013 at 08:30:22AM -0600, Alex Williamson wrote:
> On Wed, 2013-10-30 at 12:40 +0200, Gleb Natapov wrote:
> > On Tue, Oct 29, 2013 at 10:13:22AM -0600, Alex Williamson wrote:
> > > The KVM device interface allocates a struct kvm_device and calls
> > > kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
> > > This returns a file descriptor to the user for them to set/get/check
> > > further attributes. On closing the file descriptor, one would assume
> > > that kvm_device_ops.destroy is called and all traces of the device
> > > would go away. One would be wrong, it actually does nothing more
> > > than release the struct kvm reference, waiting until the VM is
> > > destroyed before doing more. This leaves devices that only want a
> > > single instance of themselves per VM in a tough spot.
> > >
> > This is by design. Otherwise locking will be needed on each device access
> > and for interrupt controllers this is unnecessary serialization and
> > overhead. Device API is not designed for devices that can go away while
> > machine is running anyway, so after creation device is only destroyed
> > during VM destruction.
>
> Hmm, ok. In that case I can drop this patch and I think the rest just
> boils down to userspace use of the device. I had been close()'ing the
> kvm device fd when all QEMU vfio devices are detached, but I can just as
> easily leave it open in case a new device is added later. I'll send out
> a new series after doing some more review and testing. Do you have any
> comments on the rest of the series? Thanks,
>
If I understand 4/4 correctly if there is VFIO device connected we
assume non coherent domain. How hard it would be to do proper checking
in this path series?

--
Gleb.

2013-10-30 15:42:24

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> > Hmm, ok. In that case I can drop this patch and I think the rest just
> > boils down to userspace use of the device. I had been close()'ing the
> > kvm device fd when all QEMU vfio devices are detached, but I can just as
> > easily leave it open in case a new device is added later. I'll send out
> > a new series after doing some more review and testing. Do you have any
> > comments on the rest of the series? Thanks,
>
> If I understand 4/4 correctly if there is VFIO device connected we
> assume non coherent domain. How hard it would be to do proper checking
> in this path series?

Yes, that's my understanding as well. Is the performance impact measurable?

Paolo

2013-10-30 15:45:16

by Gleb Natapov

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

On Wed, Oct 30, 2013 at 04:42:15PM +0100, Paolo Bonzini wrote:
> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> > > Hmm, ok. In that case I can drop this patch and I think the rest just
> > > boils down to userspace use of the device. I had been close()'ing the
> > > kvm device fd when all QEMU vfio devices are detached, but I can just as
> > > easily leave it open in case a new device is added later. I'll send out
> > > a new series after doing some more review and testing. Do you have any
> > > comments on the rest of the series? Thanks,
> >
> > If I understand 4/4 correctly if there is VFIO device connected we
> > assume non coherent domain. How hard it would be to do proper checking
> > in this path series?
>
> Yes, that's my understanding as well. Is the performance impact measurable?
>
We grant a guest permission to call wbinvd() needlessly.

--
Gleb.

2013-10-30 15:56:39

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

On Wed, 2013-10-30 at 16:42 +0100, Paolo Bonzini wrote:
> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> > > Hmm, ok. In that case I can drop this patch and I think the rest just
> > > boils down to userspace use of the device. I had been close()'ing the
> > > kvm device fd when all QEMU vfio devices are detached, but I can just as
> > > easily leave it open in case a new device is added later. I'll send out
> > > a new series after doing some more review and testing. Do you have any
> > > comments on the rest of the series? Thanks,
> >
> > If I understand 4/4 correctly if there is VFIO device connected we
> > assume non coherent domain. How hard it would be to do proper checking
> > in this path series?
>
> Yes, that's my understanding as well. Is the performance impact measurable?

I have additional patches to do this, the blocker is that intel-iommu
strips IOMMU_CACHE from the flags I provide if the IOMMU domain as a
whole (ie. all of the hardware units involved in the domain) do not all
support Snoop Control (SC). Thus I can't rely on simply tracking the
hardware capabilities of the domain because some IOMMU PTEs will have
SNP set, others will not. It depends on the state of the IOMMU domain
at the time of the mapping. Therefore the only way to switch back to
coherent from non-coherent is to unmap and remap everything. This is
what legacy KVM device assignment does, but I can't see how that's not
racy wrt inflight DMA.

The patch approach I was taking is:

1) Enable KVM to handle the VM as non-coherent (which is trued, VFIO
never sets IOMMU_CACHE currently due to the above issues).

2) Get QEMU to enable the KVM device, fixing the coherency issue.

3) Fix Intel IOMMU to honor IOMMU_CACHE regardless of hw capabilities
(patch posted).

4) Make VFIO always map w/ IOMMU_CACHE

5) Create VFIO external user interface to query capabilities.

6) Update KVM device to use it.

As to performance, for graphics I can't tell a difference whether
NoSnoop is prevented or KVM does WBINVD. I'm hoping though that we can
consider the mode enabled by this patch as a temporary step in the
process and we'll eventually get to a state where we only emulate WBINVD
when necessary. Correctness trumps performance in the current round.
Thanks,

Alex

2013-10-30 16:10:46

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

Il 30/10/2013 16:56, Alex Williamson ha scritto:
> On Wed, 2013-10-30 at 16:42 +0100, Paolo Bonzini wrote:
>> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
>>>> Hmm, ok. In that case I can drop this patch and I think the rest just
>>>> boils down to userspace use of the device. I had been close()'ing the
>>>> kvm device fd when all QEMU vfio devices are detached, but I can just as
>>>> easily leave it open in case a new device is added later. I'll send out
>>>> a new series after doing some more review and testing. Do you have any
>>>> comments on the rest of the series? Thanks,
>>>
>>> If I understand 4/4 correctly if there is VFIO device connected we
>>> assume non coherent domain. How hard it would be to do proper checking
>>> in this path series?
>>
>> Yes, that's my understanding as well. Is the performance impact measurable?
>
> I have additional patches to do this, the blocker is that intel-iommu
> strips IOMMU_CACHE from the flags I provide if the IOMMU domain as a
> whole (ie. all of the hardware units involved in the domain) do not all
> support Snoop Control (SC). Thus I can't rely on simply tracking the
> hardware capabilities of the domain because some IOMMU PTEs will have
> SNP set, others will not. It depends on the state of the IOMMU domain
> at the time of the mapping. Therefore the only way to switch back to
> coherent from non-coherent is to unmap and remap everything. This is
> what legacy KVM device assignment does, but I can't see how that's not
> racy wrt inflight DMA.
>
> The patch approach I was taking is:
>
> 1) Enable KVM to handle the VM as non-coherent (which is trued, VFIO
> never sets IOMMU_CACHE currently due to the above issues).
>
> 2) Get QEMU to enable the KVM device, fixing the coherency issue.
>
> 3) Fix Intel IOMMU to honor IOMMU_CACHE regardless of hw capabilities
> (patch posted).
>
> 4) Make VFIO always map w/ IOMMU_CACHE
>
> 5) Create VFIO external user interface to query capabilities.
>
> 6) Update KVM device to use it.
>
> As to performance, for graphics I can't tell a difference whether
> NoSnoop is prevented or KVM does WBINVD. I'm hoping though that we can
> consider the mode enabled by this patch as a temporary step in the
> process and we'll eventually get to a state where we only emulate WBINVD
> when necessary. Correctness trumps performance in the current round.
> Thanks,

Thanks for the explanation. Gleb, this looks fine to me. WDYT?

Paolo

2013-10-30 16:19:05

by Gleb Natapov

[permalink] [raw]

Subject: Re: [PATCH 1/4] kvm: Destroy & free KVM devices on release

On Wed, Oct 30, 2013 at 05:10:37PM +0100, Paolo Bonzini wrote:
> Il 30/10/2013 16:56, Alex Williamson ha scritto:
> > On Wed, 2013-10-30 at 16:42 +0100, Paolo Bonzini wrote:
> >> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> >>>> Hmm, ok. In that case I can drop this patch and I think the rest just
> >>>> boils down to userspace use of the device. I had been close()'ing the
> >>>> kvm device fd when all QEMU vfio devices are detached, but I can just as
> >>>> easily leave it open in case a new device is added later. I'll send out
> >>>> a new series after doing some more review and testing. Do you have any
> >>>> comments on the rest of the series? Thanks,
> >>>
> >>> If I understand 4/4 correctly if there is VFIO device connected we
> >>> assume non coherent domain. How hard it would be to do proper checking
> >>> in this path series?
> >>
> >> Yes, that's my understanding as well. Is the performance impact measurable?
> >
> > I have additional patches to do this, the blocker is that intel-iommu
> > strips IOMMU_CACHE from the flags I provide if the IOMMU domain as a
> > whole (ie. all of the hardware units involved in the domain) do not all
> > support Snoop Control (SC). Thus I can't rely on simply tracking the
> > hardware capabilities of the domain because some IOMMU PTEs will have
> > SNP set, others will not. It depends on the state of the IOMMU domain
> > at the time of the mapping. Therefore the only way to switch back to
> > coherent from non-coherent is to unmap and remap everything. This is
> > what legacy KVM device assignment does, but I can't see how that's not
> > racy wrt inflight DMA.
> >
> > The patch approach I was taking is:
> >
> > 1) Enable KVM to handle the VM as non-coherent (which is trued, VFIO
> > never sets IOMMU_CACHE currently due to the above issues).
> >
> > 2) Get QEMU to enable the KVM device, fixing the coherency issue.
> >
> > 3) Fix Intel IOMMU to honor IOMMU_CACHE regardless of hw capabilities
> > (patch posted).
> >
> > 4) Make VFIO always map w/ IOMMU_CACHE
> >
> > 5) Create VFIO external user interface to query capabilities.
> >
> > 6) Update KVM device to use it.
> >
> > As to performance, for graphics I can't tell a difference whether
> > NoSnoop is prevented or KVM does WBINVD. I'm hoping though that we can
> > consider the mode enabled by this patch as a temporary step in the
> > process and we'll eventually get to a state where we only emulate WBINVD
> > when necessary. Correctness trumps performance in the current round.
> > Thanks,
>
> Thanks for the explanation. Gleb, this looks fine to me. WDYT?
>
To me too. Alex please resend with first patch dropped
(kvm_vfio_destroy() will have to free dev).

--
Gleb.