2013-09-12 21:23:04

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 0/3] kvm/vfio: Manage KVM IOMMU coherency with virtual VFIO device

See patch 3/3 for a description of exactly why we need this. I know
POWER folks are also interested in making use of VFIO's external user
interface from KVM and Alexey's proposed patches have a similar device
tailored for SPAPR use there. I'm hoping that we can make the base
device common and extend it for each architecture. The problem we're
solving on x86 may or may not be architecture specific. It has to do
with how the IOMMU handles No-Snoop transactions on PCIe in relation
to how KVM handles vCPU cache operations.

Appreciate any feedback and suggestions on how to make this more
palatable for sharing with POWER. I'll also send a QEMU RFC which
makes use of this. Thanks,

Alex

---

Alex Williamson (3):
kvm: Destroy & free KVM devices on release
vfio: Add check extension interface to external user support
kvm: Add VFIO device for handling IOMMU cache coherency


Documentation/virtual/kvm/devices/vfio.txt | 22 +++
arch/powerpc/kvm/book3s_xics.c | 1
arch/x86/include/asm/kvm_host.h | 1
arch/x86/kvm/Makefile | 2
arch/x86/kvm/vmx.c | 5 -
arch/x86/kvm/x86.c | 5 -
drivers/vfio/vfio.c | 8 +
drivers/vfio/vfio_iommu_type1.c | 4
include/linux/kvm_host.h | 1
include/linux/vfio.h | 2
include/uapi/linux/kvm.h | 4
include/uapi/linux/vfio.h | 1
virt/kvm/kvm_main.c | 8 +
virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
14 files changed, 295 insertions(+), 6 deletions(-)
create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
create mode 100644 virt/kvm/vfio.c


2013-09-12 21:23:12

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 1/3] kvm: Destroy & free KVM devices on release

The KVM device interface allocates a struct kvm_device and calls
kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
This returns a file descriptor to the user for them to set/get/check
further attributes. On closing the file descriptor, one would assume
that kvm_device_ops.destroy is called and all traces of the device
would go away. One would be wrong, it actually does nothing more
than release the struct kvm reference, waiting until the VM is
destroyed before doing more. This leaves devices that only want a
single instance of themselves per VM in a tough spot.

To fix this, do full cleanup on release of the device file descriptor.
It's also non-symmetric that one of the existing devices frees the
struct kvm_device from it's .destroy function, while the other
doesn't. KVM-core allocates the structure and should therefore be
responsible for freeing it. Finally, add a missing kfree for the
device creation error path.

Signed-off-by: Alex Williamson <[email protected]>
---
arch/powerpc/kvm/book3s_xics.c | 1 -
virt/kvm/kvm_main.c | 5 +++++
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index a3a5cb8..9a82426 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -1220,7 +1220,6 @@ static void kvmppc_xics_free(struct kvm_device *dev)
for (i = 0; i <= xics->max_icsid; i++)
kfree(xics->ics[i]);
kfree(xics);
- kfree(dev);
}

static int kvmppc_xics_create(struct kvm_device *dev, u32 type)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bf040c4..d9cad4d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -572,6 +572,7 @@ static void kvm_destroy_devices(struct kvm *kvm)

list_del(node);
dev->ops->destroy(dev);
+ kfree(dev);
}
}

@@ -2225,6 +2226,9 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
struct kvm_device *dev = filp->private_data;
struct kvm *kvm = dev->kvm;

+ list_del(&dev->vm_node);
+ dev->ops->destroy(dev);
+ kfree(dev);
kvm_put_kvm(kvm);
return 0;
}
@@ -2288,6 +2292,7 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | O_CLOEXEC);
if (ret < 0) {
ops->destroy(dev);
+ kfree(dev);
return ret;
}

2013-09-12 21:23:23

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

So far we've succeeded at making KVM and VFIO mostly unaware of each
other, but there's any important point where that breaks down. Intel
VT-d hardware may or may not support snoop control. When snoop
control is available, intel-iommu promotes No-Snoop transactions on
PCIe to be cache coherent. That allows KVM to handle things like the
x86 WBINVD opcode as a nop. When the hardware does not support this,
KVM must implement a hardware visible WBINVD for the guest.

We could simply let userspace tell KVM how to handle WBINVD, but it's
privileged for a reason. Allowing an arbitrary user to enable
physical WBINVD gives them a more access to the hardware. Previously,
this has only been enabled for guests supporting legacy PCI device
assignment. In such cases it's necessary for proper guest execution.
We therefore create a new KVM-VFIO virtual device. The user can add
and remove VFIO groups to this device via file descriptors. KVM
makes use of the VFIO external user interface to validate that the
user has access to physical hardware and gets the coherency state of
the IOMMU from VFIO. This provides equivalent functionality to
legacy KVM assignment, while keeping (nearly) all the bits isolated.

The one intrusion is the resulting flag indicating the coherency
state. For this RFC it's placed on the x86 kvm_arch struct, however
I know POWER has interest in using the VFIO external user interface,
and I'm hoping we can share a common KVM-VFIO device. Perhaps they
care about No-Snoop handling as well or the code can be #ifdef'd.

Signed-off-by: Alex Williamson <[email protected]>
---
Documentation/virtual/kvm/devices/vfio.txt | 22 +++
arch/x86/include/asm/kvm_host.h | 1
arch/x86/kvm/Makefile | 2
arch/x86/kvm/vmx.c | 5 -
arch/x86/kvm/x86.c | 5 -
include/linux/kvm_host.h | 1
include/uapi/linux/kvm.h | 4
virt/kvm/kvm_main.c | 3
virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
9 files changed, 275 insertions(+), 5 deletions(-)
create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
create mode 100644 virt/kvm/vfio.c

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
new file mode 100644
index 0000000..831e6a6
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -0,0 +1,22 @@
+VFIO virtual device
+===================
+
+Device types supported:
+ KVM_DEV_TYPE_VFIO
+
+Only one VFIO instance may be created per VM. The created device
+tracks VFIO groups in use by the VM and features of those groups
+important to the correctness and acceleration of the VM. As groups
+are enabled and disabled for use by the VM, KVM should be updated
+about their presence. When registered with KVM, a reference to the
+VFIO-group is held by KVM.
+
+Groups:
+ KVM_DEV_VFIO_ADD_GROUP
+ KVM_DEV_VFIO_DEL_GROUP
+
+Each takes a int32_t file descriptor for kvm_device_attr.addr and
+does not support any group device kvm_device_attr.attr.
+
+RFC - Should we use Group KVM_DEV_VFIO_GROUP with Attributes
+ KVM_DEV_VFIO_GROUP_ADD & DEL?
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c76ff74..5b9350d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -588,6 +588,7 @@ struct kvm_arch {

spinlock_t pvclock_gtod_sync_lock;
bool use_master_clock;
+ bool vfio_noncoherent;
u64 master_kernel_ns;
cycle_t master_cycle_now;

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index bf4fb04..25d22b2 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,7 +9,7 @@ KVM := ../../../virt/kvm

kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
$(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
- $(KVM)/eventfd.o $(KVM)/irqchip.o
+ $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1f1da43..94f7786 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7395,8 +7395,9 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
*/
if (is_mmio)
ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
- else if (vcpu->kvm->arch.iommu_domain &&
- !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
+ else if (vcpu->kvm->arch.vfio_noncoherent ||
+ vcpu->kvm->arch.iommu_domain &&
+ !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
ret = kvm_get_guest_memory_type(vcpu, gfn) <<
VMX_EPT_MT_EPTE_SHIFT;
else
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5ca72a..406ba6f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2715,8 +2715,9 @@ static void wbinvd_ipi(void *garbage)

static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
{
- return vcpu->kvm->arch.iommu_domain &&
- !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
+ return vcpu->kvm->arch.vfio_noncoherent ||
+ (vcpu->kvm->arch.iommu_domain &&
+ !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY));
}

void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ca645a0..615f0c3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);

extern struct kvm_device_ops kvm_mpic_ops;
extern struct kvm_device_ops kvm_xics_ops;
+extern struct kvm_device_ops kvm_vfio_ops;

#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..8869616 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -843,6 +843,10 @@ struct kvm_device_attr {
#define KVM_DEV_TYPE_FSL_MPIC_20 1
#define KVM_DEV_TYPE_FSL_MPIC_42 2
#define KVM_DEV_TYPE_XICS 3
+#define KVM_DEV_TYPE_VFIO 4
+
+#define KVM_DEV_VFIO_ADD_GROUP 1
+#define KVM_DEV_VFIO_DEL_GROUP 2

/*
* ioctls for VM fds
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cad4d..1a20425 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2269,6 +2269,9 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
ops = &kvm_xics_ops;
break;
#endif
+ case KVM_DEV_TYPE_VFIO:
+ ops = &kvm_vfio_ops;
+ break;
default:
return -ENODEV;
}
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
new file mode 100644
index 0000000..9a2faff
--- /dev/null
+++ b/virt/kvm/vfio.c
@@ -0,0 +1,237 @@
+/*
+ * VFIO bridge
+ *
+ * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/kvm_host.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+
+struct kvm_vfio_group {
+ struct list_head node;
+ struct vfio_group *vfio_group;
+};
+
+struct kvm_vfio {
+ struct list_head group_list;
+ struct mutex lock;
+};
+
+static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
+{
+ struct vfio_group *vfio_group;
+ struct vfio_group *(*fn)(struct file *);
+
+ fn = symbol_get(vfio_group_get_external_user);
+ if (!fn)
+ return ERR_PTR(-EINVAL);
+
+ vfio_group = fn(filep);
+
+ symbol_put(vfio_group_get_external_user);
+
+ return vfio_group;
+}
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+ void (*fn)(struct vfio_group *);
+
+ fn = symbol_get(vfio_group_put_external_user);
+ if (!fn)
+ return;
+
+ fn(vfio_group);
+
+ symbol_put(vfio_group_put_external_user);
+}
+
+static int kvm_vfio_external_user_check_extension(struct vfio_group *vfio_group,
+ unsigned long arg)
+{
+ int (*fn)(struct vfio_group *, unsigned long);
+ int ret;
+
+ fn = symbol_get(vfio_external_user_check_extension);
+ if (!fn)
+ return -EINVAL;
+
+ ret = fn(vfio_group, arg);
+
+ symbol_put(vfio_group_put_external_user);
+
+ return ret;
+}
+
+static void kvm_vfio_update_iommu_coherency(struct kvm_device *dev)
+{
+ struct kvm_vfio *kv = dev->private;
+ bool coherent = true;
+ struct kvm_vfio_group *kvg;
+
+ mutex_lock(&kv->lock);
+
+ list_for_each_entry(kvg, &kv->group_list, node) {
+ if (!kvm_vfio_external_user_check_extension(kvg->vfio_group,
+ VFIO_IOMMU_CAP_CACHE_COHERENCY)) {
+ coherent = false;
+ break;
+ }
+ }
+
+ mutex_unlock(&kv->lock);
+
+ dev->kvm->arch.vfio_noncoherent = !coherent;
+}
+
+static int kvm_vfio_set_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ struct kvm_vfio *kv = dev->private;
+ struct fd f;
+ struct vfio_group *vfio_group;
+ struct kvm_vfio_group *kvg;
+ int ret;
+
+ switch (attr->group) {
+ case KVM_DEV_VFIO_ADD_GROUP:
+ f = fdget(attr->addr);
+ if (!f.file)
+ return -EBADF;
+
+ vfio_group = kvm_vfio_group_get_external_user(f.file);
+ fdput(f);
+
+ if (IS_ERR(vfio_group))
+ return PTR_ERR(vfio_group);
+
+ mutex_lock(&kv->lock);
+
+ list_for_each_entry(kvg, &kv->group_list, node) {
+ if (kvg->vfio_group == vfio_group) {
+ mutex_unlock(&kv->lock);
+ kvm_vfio_group_put_external_user(vfio_group);
+ return -EEXIST;
+ }
+ }
+
+ kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
+ if (!kvg) {
+ mutex_unlock(&kv->lock);
+ kvm_vfio_group_put_external_user(vfio_group);
+ return -ENOMEM;
+ }
+
+ list_add_tail(&kvg->node, &kv->group_list);
+ kvg->vfio_group = vfio_group;
+
+ mutex_unlock(&kv->lock);
+
+ kvm_vfio_update_iommu_coherency(dev);
+
+ return 0;
+
+ case KVM_DEV_VFIO_DEL_GROUP:
+ f = fdget(attr->addr);
+ if (!f.file)
+ return -EBADF;
+
+ vfio_group = kvm_vfio_group_get_external_user(f.file);
+ fdput(f);
+
+ if (IS_ERR(vfio_group))
+ return PTR_ERR(vfio_group);
+
+ ret = -ENOENT;
+
+ mutex_lock(&kv->lock);
+
+ list_for_each_entry(kvg, &kv->group_list, node) {
+ if (kvg->vfio_group != vfio_group)
+ continue;
+
+ list_del(&kvg->node);
+ kvm_vfio_group_put_external_user(kvg->vfio_group);
+ kfree(kvg);
+ ret = 0;
+ break;
+ }
+
+ mutex_unlock(&kv->lock);
+ kvm_vfio_group_put_external_user(vfio_group);
+
+ kvm_vfio_update_iommu_coherency(dev);
+
+ return ret;
+ }
+
+ return -ENXIO;
+}
+
+static int kvm_vfio_has_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ switch (attr->group) {
+ case KVM_DEV_VFIO_ADD_GROUP:
+ case KVM_DEV_VFIO_DEL_GROUP:
+ return 0;
+ }
+
+ return -ENXIO;
+}
+
+static void kvm_vfio_destroy(struct kvm_device *dev)
+{
+ struct kvm_vfio *kv = dev->private;
+ struct kvm_vfio_group *kvg, *tmp;
+
+ list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+ kvm_vfio_group_put_external_user(kvg->vfio_group);
+ list_del(&kvg->node);
+ kfree(kvg);
+ }
+
+ dev->kvm->arch.vfio_noncoherent = false;
+ kfree(kv);
+}
+
+static int kvm_vfio_create(struct kvm_device *dev, u32 type)
+{
+ struct kvm_device *tmp;
+ struct kvm_vfio *kv;
+
+ /* Only one VFIO "device" per VM */
+ list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
+ if (tmp->ops == &kvm_vfio_ops)
+ return -EBUSY;
+
+ kv = kzalloc(sizeof(*kv), GFP_KERNEL);
+ if (!kv)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&kv->group_list);
+ mutex_init(&kv->lock);
+
+ dev->private = kv;
+
+ return 0;
+}
+
+struct kvm_device_ops kvm_vfio_ops = {
+ .name = "kvm-vfio",
+ .create = kvm_vfio_create,
+ .destroy = kvm_vfio_destroy,
+ .set_attr = kvm_vfio_set_attr,
+ .has_attr = kvm_vfio_has_attr,
+};

2013-09-12 21:23:17

by Alex Williamson

[permalink] [raw]
Subject: [RFC PATCH 2/3] vfio: Add check extension interface to external user support

This adds the ability for an external user to check VFIO extensions.
The first one we care about is support for IOMMU cache coherency.
Without this, external users, like KVM, would need to assume the IOMMU
allows No-Snoop transactions which are not coherent with processor
cache.

Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/vfio.c | 8 ++++++++
drivers/vfio/vfio_iommu_type1.c | 4 ++++
include/linux/vfio.h | 2 ++
include/uapi/linux/vfio.h | 1 +
4 files changed, 15 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1eab4ac..b55979e 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1414,6 +1414,14 @@ int vfio_external_user_iommu_id(struct vfio_group *group)
}
EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);

+int vfio_external_user_check_extension(struct vfio_group *group,
+ unsigned long arg)
+{
+ return vfio_ioctl_check_extension(group->container, arg);
+
+}
+EXPORT_SYMBOL_GPL(vfio_external_user_check_extension);
+
/**
* Module/class support
*/
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a9807de..b88b9f7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -840,6 +840,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
switch (arg) {
case VFIO_TYPE1_IOMMU:
return 1;
+ case VFIO_IOMMU_CAP_CACHE_COHERENCY:
+ return !iommu ? 0 :
+ iommu_domain_has_cap(iommu->domain,
+ IOMMU_CAP_CACHE_COHERENCY);
default:
return 0;
}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 24579a0..fe528e8 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -96,5 +96,7 @@ extern void vfio_unregister_iommu_driver(
extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
extern void vfio_group_put_external_user(struct vfio_group *group);
extern int vfio_external_user_iommu_id(struct vfio_group *group);
+extern int vfio_external_user_check_extension(struct vfio_group *group,
+ unsigned long arg);

#endif /* VFIO_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0fd47f5..1c8bad6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -23,6 +23,7 @@

#define VFIO_TYPE1_IOMMU 1
#define VFIO_SPAPR_TCE_IOMMU 2
+#define VFIO_IOMMU_CAP_CACHE_COHERENCY 3

/*
* The IOCTL interface is designed for extensibility by embedding the

2013-09-13 08:49:20

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On 09/13/2013 07:23 AM, Alex Williamson wrote:
> So far we've succeeded at making KVM and VFIO mostly unaware of each
> other, but there's any important point where that breaks down. Intel
> VT-d hardware may or may not support snoop control. When snoop
> control is available, intel-iommu promotes No-Snoop transactions on
> PCIe to be cache coherent. That allows KVM to handle things like the
> x86 WBINVD opcode as a nop. When the hardware does not support this,
> KVM must implement a hardware visible WBINVD for the guest.
>
> We could simply let userspace tell KVM how to handle WBINVD, but it's
> privileged for a reason. Allowing an arbitrary user to enable
> physical WBINVD gives them a more access to the hardware. Previously,
> this has only been enabled for guests supporting legacy PCI device
> assignment. In such cases it's necessary for proper guest execution.
> We therefore create a new KVM-VFIO virtual device. The user can add
> and remove VFIO groups to this device via file descriptors. KVM
> makes use of the VFIO external user interface to validate that the
> user has access to physical hardware and gets the coherency state of
> the IOMMU from VFIO. This provides equivalent functionality to
> legacy KVM assignment, while keeping (nearly) all the bits isolated.
>
> The one intrusion is the resulting flag indicating the coherency
> state. For this RFC it's placed on the x86 kvm_arch struct, however
> I know POWER has interest in using the VFIO external user interface,
> and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> care about No-Snoop handling as well or the code can be #ifdef'd.


POWER does not support (at least boos3s - "server", not sure about others)
this cache-non-coherent stuff at all.

Regarding reusing this device with external API for POWER - I posted a
patch which introduces KVM device to link KVM with IOMMU but besides the
list of groups registered in KVM, it also provides the way to find a group
by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So
in my case kvm_vfio_group struct needs LIOBN and it would be nice to have
there window_size too (for a quick boundary check). I am not sure we want
to mix everything here.

It is in "[PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel
handling" if you are interested (kvmppc_spapr_tce_iommu_device).



--
Alexey

2013-09-13 12:37:58

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> So far we've succeeded at making KVM and VFIO mostly unaware of each
> other, but there's any important point where that breaks down. Intel
> VT-d hardware may or may not support snoop control. When snoop
> control is available, intel-iommu promotes No-Snoop transactions on
> PCIe to be cache coherent. That allows KVM to handle things like the
> x86 WBINVD opcode as a nop. When the hardware does not support this,
> KVM must implement a hardware visible WBINVD for the guest.
>
> We could simply let userspace tell KVM how to handle WBINVD, but it's
> privileged for a reason. Allowing an arbitrary user to enable
> physical WBINVD gives them a more access to the hardware. Previously,
> this has only been enabled for guests supporting legacy PCI device
> assignment. In such cases it's necessary for proper guest execution.
> We therefore create a new KVM-VFIO virtual device. The user can add
> and remove VFIO groups to this device via file descriptors. KVM
> makes use of the VFIO external user interface to validate that the
> user has access to physical hardware and gets the coherency state of
> the IOMMU from VFIO. This provides equivalent functionality to
> legacy KVM assignment, while keeping (nearly) all the bits isolated.


So how is the isolation handled then?
How is this better than a ioctl to grant WBINVD to guest?
kvm char device can be opened by any user,
so any user can grant itself these priveledges.
What did I miss?


> The one intrusion is the resulting flag indicating the coherency
> state. For this RFC it's placed on the x86 kvm_arch struct, however
> I know POWER has interest in using the VFIO external user interface,
> and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> care about No-Snoop handling as well or the code can be #ifdef'd.
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
> Documentation/virtual/kvm/devices/vfio.txt | 22 +++
> arch/x86/include/asm/kvm_host.h | 1
> arch/x86/kvm/Makefile | 2
> arch/x86/kvm/vmx.c | 5 -
> arch/x86/kvm/x86.c | 5 -
> include/linux/kvm_host.h | 1
> include/uapi/linux/kvm.h | 4
> virt/kvm/kvm_main.c | 3
> virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
> 9 files changed, 275 insertions(+), 5 deletions(-)
> create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
> create mode 100644 virt/kvm/vfio.c
>
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> new file mode 100644
> index 0000000..831e6a6
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -0,0 +1,22 @@
> +VFIO virtual device
> +===================
> +
> +Device types supported:
> + KVM_DEV_TYPE_VFIO
> +
> +Only one VFIO instance may be created per VM. The created device
> +tracks VFIO groups in use by the VM and features of those groups
> +important to the correctness and acceleration of the VM. As groups
> +are enabled and disabled for use by the VM, KVM should be updated
> +about their presence. When registered with KVM, a reference to the
> +VFIO-group is held by KVM.
> +
> +Groups:
> + KVM_DEV_VFIO_ADD_GROUP
> + KVM_DEV_VFIO_DEL_GROUP
> +
> +Each takes a int32_t file descriptor for kvm_device_attr.addr and
> +does not support any group device kvm_device_attr.attr.
> +
> +RFC - Should we use Group KVM_DEV_VFIO_GROUP with Attributes
> + KVM_DEV_VFIO_GROUP_ADD & DEL?
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c76ff74..5b9350d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -588,6 +588,7 @@ struct kvm_arch {
>
> spinlock_t pvclock_gtod_sync_lock;
> bool use_master_clock;
> + bool vfio_noncoherent;
> u64 master_kernel_ns;
> cycle_t master_cycle_now;
>
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index bf4fb04..25d22b2 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -9,7 +9,7 @@ KVM := ../../../virt/kvm
>
> kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
> $(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
> - $(KVM)/eventfd.o $(KVM)/irqchip.o
> + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
> kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 1f1da43..94f7786 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7395,8 +7395,9 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> */
> if (is_mmio)
> ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> - else if (vcpu->kvm->arch.iommu_domain &&
> - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> + else if (vcpu->kvm->arch.vfio_noncoherent ||
> + vcpu->kvm->arch.iommu_domain &&
> + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> ret = kvm_get_guest_memory_type(vcpu, gfn) <<
> VMX_EPT_MT_EPTE_SHIFT;
> else
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e5ca72a..406ba6f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2715,8 +2715,9 @@ static void wbinvd_ipi(void *garbage)
>
> static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> {
> - return vcpu->kvm->arch.iommu_domain &&
> - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
> + return vcpu->kvm->arch.vfio_noncoherent ||
> + (vcpu->kvm->arch.iommu_domain &&
> + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY));
> }
>
> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ca645a0..615f0c3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);
>
> extern struct kvm_device_ops kvm_mpic_ops;
> extern struct kvm_device_ops kvm_xics_ops;
> +extern struct kvm_device_ops kvm_vfio_ops;
>
> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 99c2533..8869616 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -843,6 +843,10 @@ struct kvm_device_attr {
> #define KVM_DEV_TYPE_FSL_MPIC_20 1
> #define KVM_DEV_TYPE_FSL_MPIC_42 2
> #define KVM_DEV_TYPE_XICS 3
> +#define KVM_DEV_TYPE_VFIO 4
> +
> +#define KVM_DEV_VFIO_ADD_GROUP 1
> +#define KVM_DEV_VFIO_DEL_GROUP 2
>
> /*
> * ioctls for VM fds
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d9cad4d..1a20425 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2269,6 +2269,9 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> ops = &kvm_xics_ops;
> break;
> #endif
> + case KVM_DEV_TYPE_VFIO:
> + ops = &kvm_vfio_ops;
> + break;
> default:
> return -ENODEV;
> }
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> new file mode 100644
> index 0000000..9a2faff
> --- /dev/null
> +++ b/virt/kvm/vfio.c
> @@ -0,0 +1,237 @@
> +/*
> + * VFIO bridge
> + *
> + * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/file.h>
> +#include <linux/kvm_host.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/slab.h>
> +#include <linux/vfio.h>
> +
> +struct kvm_vfio_group {
> + struct list_head node;
> + struct vfio_group *vfio_group;
> +};
> +
> +struct kvm_vfio {
> + struct list_head group_list;
> + struct mutex lock;
> +};
> +
> +static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
> +{
> + struct vfio_group *vfio_group;
> + struct vfio_group *(*fn)(struct file *);
> +
> + fn = symbol_get(vfio_group_get_external_user);
> + if (!fn)
> + return ERR_PTR(-EINVAL);
> +
> + vfio_group = fn(filep);
> +
> + symbol_put(vfio_group_get_external_user);
> +
> + return vfio_group;
> +}
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> + void (*fn)(struct vfio_group *);
> +
> + fn = symbol_get(vfio_group_put_external_user);
> + if (!fn)
> + return;
> +
> + fn(vfio_group);
> +
> + symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_check_extension(struct vfio_group *vfio_group,
> + unsigned long arg)
> +{
> + int (*fn)(struct vfio_group *, unsigned long);
> + int ret;
> +
> + fn = symbol_get(vfio_external_user_check_extension);
> + if (!fn)
> + return -EINVAL;
> +
> + ret = fn(vfio_group, arg);
> +
> + symbol_put(vfio_group_put_external_user);
> +
> + return ret;
> +}
> +
> +static void kvm_vfio_update_iommu_coherency(struct kvm_device *dev)
> +{
> + struct kvm_vfio *kv = dev->private;
> + bool coherent = true;
> + struct kvm_vfio_group *kvg;
> +
> + mutex_lock(&kv->lock);
> +
> + list_for_each_entry(kvg, &kv->group_list, node) {
> + if (!kvm_vfio_external_user_check_extension(kvg->vfio_group,
> + VFIO_IOMMU_CAP_CACHE_COHERENCY)) {
> + coherent = false;
> + break;
> + }
> + }
> +
> + mutex_unlock(&kv->lock);
> +
> + dev->kvm->arch.vfio_noncoherent = !coherent;
> +}
> +
> +static int kvm_vfio_set_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + struct kvm_vfio *kv = dev->private;
> + struct fd f;
> + struct vfio_group *vfio_group;
> + struct kvm_vfio_group *kvg;
> + int ret;
> +
> + switch (attr->group) {
> + case KVM_DEV_VFIO_ADD_GROUP:
> + f = fdget(attr->addr);
> + if (!f.file)
> + return -EBADF;
> +
> + vfio_group = kvm_vfio_group_get_external_user(f.file);
> + fdput(f);
> +
> + if (IS_ERR(vfio_group))
> + return PTR_ERR(vfio_group);
> +
> + mutex_lock(&kv->lock);
> +
> + list_for_each_entry(kvg, &kv->group_list, node) {
> + if (kvg->vfio_group == vfio_group) {
> + mutex_unlock(&kv->lock);
> + kvm_vfio_group_put_external_user(vfio_group);
> + return -EEXIST;
> + }
> + }
> +
> + kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
> + if (!kvg) {
> + mutex_unlock(&kv->lock);
> + kvm_vfio_group_put_external_user(vfio_group);
> + return -ENOMEM;
> + }
> +
> + list_add_tail(&kvg->node, &kv->group_list);
> + kvg->vfio_group = vfio_group;
> +
> + mutex_unlock(&kv->lock);
> +
> + kvm_vfio_update_iommu_coherency(dev);
> +
> + return 0;
> +
> + case KVM_DEV_VFIO_DEL_GROUP:
> + f = fdget(attr->addr);
> + if (!f.file)
> + return -EBADF;
> +
> + vfio_group = kvm_vfio_group_get_external_user(f.file);
> + fdput(f);
> +
> + if (IS_ERR(vfio_group))
> + return PTR_ERR(vfio_group);
> +
> + ret = -ENOENT;
> +
> + mutex_lock(&kv->lock);
> +
> + list_for_each_entry(kvg, &kv->group_list, node) {
> + if (kvg->vfio_group != vfio_group)
> + continue;
> +
> + list_del(&kvg->node);
> + kvm_vfio_group_put_external_user(kvg->vfio_group);
> + kfree(kvg);
> + ret = 0;
> + break;
> + }
> +
> + mutex_unlock(&kv->lock);
> + kvm_vfio_group_put_external_user(vfio_group);
> +
> + kvm_vfio_update_iommu_coherency(dev);
> +
> + return ret;
> + }
> +
> + return -ENXIO;
> +}
> +
> +static int kvm_vfio_has_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + switch (attr->group) {
> + case KVM_DEV_VFIO_ADD_GROUP:
> + case KVM_DEV_VFIO_DEL_GROUP:
> + return 0;
> + }
> +
> + return -ENXIO;
> +}
> +
> +static void kvm_vfio_destroy(struct kvm_device *dev)
> +{
> + struct kvm_vfio *kv = dev->private;
> + struct kvm_vfio_group *kvg, *tmp;
> +
> + list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> + kvm_vfio_group_put_external_user(kvg->vfio_group);
> + list_del(&kvg->node);
> + kfree(kvg);
> + }
> +
> + dev->kvm->arch.vfio_noncoherent = false;
> + kfree(kv);
> +}
> +
> +static int kvm_vfio_create(struct kvm_device *dev, u32 type)
> +{
> + struct kvm_device *tmp;
> + struct kvm_vfio *kv;
> +
> + /* Only one VFIO "device" per VM */
> + list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
> + if (tmp->ops == &kvm_vfio_ops)
> + return -EBUSY;
> +
> + kv = kzalloc(sizeof(*kv), GFP_KERNEL);
> + if (!kv)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&kv->group_list);
> + mutex_init(&kv->lock);
> +
> + dev->private = kv;
> +
> + return 0;
> +}
> +
> +struct kvm_device_ops kvm_vfio_ops = {
> + .name = "kvm-vfio",
> + .create = kvm_vfio_create,
> + .destroy = kvm_vfio_destroy,
> + .set_attr = kvm_vfio_set_attr,
> + .has_attr = kvm_vfio_has_attr,
> +};

2013-09-13 14:13:48

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Fri, 2013-09-13 at 15:39 +0300, Michael S. Tsirkin wrote:
> On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > other, but there's any important point where that breaks down. Intel
> > VT-d hardware may or may not support snoop control. When snoop
> > control is available, intel-iommu promotes No-Snoop transactions on
> > PCIe to be cache coherent. That allows KVM to handle things like the
> > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > KVM must implement a hardware visible WBINVD for the guest.
> >
> > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > privileged for a reason. Allowing an arbitrary user to enable
> > physical WBINVD gives them a more access to the hardware. Previously,
> > this has only been enabled for guests supporting legacy PCI device
> > assignment. In such cases it's necessary for proper guest execution.
> > We therefore create a new KVM-VFIO virtual device. The user can add
> > and remove VFIO groups to this device via file descriptors. KVM
> > makes use of the VFIO external user interface to validate that the
> > user has access to physical hardware and gets the coherency state of
> > the IOMMU from VFIO. This provides equivalent functionality to
> > legacy KVM assignment, while keeping (nearly) all the bits isolated.
>
>
> So how is the isolation handled then?

By isolation above, I'm only talking about the intrusive-ness of the
code and "leaking" vfio knowledge into KVM. Nothing to do with device
isolation.

> How is this better than a ioctl to grant WBINVD to guest?
> kvm char device can be opened by any user,
> so any user can grant itself these priveledges.
> What did I miss?

With this interface the caller must have physical access to one or more
devices via vfio groups and each of those groups must be configured into
one or more IOMMU domains. Furthermore, at least one of the IOMMU
domains must not include the IOMMU_CAP_CACHE_COHERENCY capability. So
it's actually quite a significantly higher hurdle than an ioctl open to
anyone and the number of VMs on a given host capable of doing this is
bound by the number of IOMMU groups. We do not however verify that a
vfio device is actually in use by the VM, but I don't think there's a
way to do that from KVM and I'm not sure that it's important to do so.
I believe having access to physical hardware is already a sufficient
granting of privilege to enable things like WBINVD. Thanks,

Alex

> > The one intrusion is the resulting flag indicating the coherency
> > state. For this RFC it's placed on the x86 kvm_arch struct, however
> > I know POWER has interest in using the VFIO external user interface,
> > and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> > care about No-Snoop handling as well or the code can be #ifdef'd.
> >
> > Signed-off-by: Alex Williamson <[email protected]>
> > ---
> > Documentation/virtual/kvm/devices/vfio.txt | 22 +++
> > arch/x86/include/asm/kvm_host.h | 1
> > arch/x86/kvm/Makefile | 2
> > arch/x86/kvm/vmx.c | 5 -
> > arch/x86/kvm/x86.c | 5 -
> > include/linux/kvm_host.h | 1
> > include/uapi/linux/kvm.h | 4
> > virt/kvm/kvm_main.c | 3
> > virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
> > 9 files changed, 275 insertions(+), 5 deletions(-)
> > create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
> > create mode 100644 virt/kvm/vfio.c
> >
> > diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> > new file mode 100644
> > index 0000000..831e6a6
> > --- /dev/null
> > +++ b/Documentation/virtual/kvm/devices/vfio.txt
> > @@ -0,0 +1,22 @@
> > +VFIO virtual device
> > +===================
> > +
> > +Device types supported:
> > + KVM_DEV_TYPE_VFIO
> > +
> > +Only one VFIO instance may be created per VM. The created device
> > +tracks VFIO groups in use by the VM and features of those groups
> > +important to the correctness and acceleration of the VM. As groups
> > +are enabled and disabled for use by the VM, KVM should be updated
> > +about their presence. When registered with KVM, a reference to the
> > +VFIO-group is held by KVM.
> > +
> > +Groups:
> > + KVM_DEV_VFIO_ADD_GROUP
> > + KVM_DEV_VFIO_DEL_GROUP
> > +
> > +Each takes a int32_t file descriptor for kvm_device_attr.addr and
> > +does not support any group device kvm_device_attr.attr.
> > +
> > +RFC - Should we use Group KVM_DEV_VFIO_GROUP with Attributes
> > + KVM_DEV_VFIO_GROUP_ADD & DEL?
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index c76ff74..5b9350d 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -588,6 +588,7 @@ struct kvm_arch {
> >
> > spinlock_t pvclock_gtod_sync_lock;
> > bool use_master_clock;
> > + bool vfio_noncoherent;
> > u64 master_kernel_ns;
> > cycle_t master_cycle_now;
> >
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index bf4fb04..25d22b2 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -9,7 +9,7 @@ KVM := ../../../virt/kvm
> >
> > kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
> > $(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
> > - $(KVM)/eventfd.o $(KVM)/irqchip.o
> > + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
> > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> >
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 1f1da43..94f7786 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7395,8 +7395,9 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > */
> > if (is_mmio)
> > ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> > - else if (vcpu->kvm->arch.iommu_domain &&
> > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > + else if (vcpu->kvm->arch.vfio_noncoherent ||
> > + vcpu->kvm->arch.iommu_domain &&
> > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > ret = kvm_get_guest_memory_type(vcpu, gfn) <<
> > VMX_EPT_MT_EPTE_SHIFT;
> > else
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e5ca72a..406ba6f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2715,8 +2715,9 @@ static void wbinvd_ipi(void *garbage)
> >
> > static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> > {
> > - return vcpu->kvm->arch.iommu_domain &&
> > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
> > + return vcpu->kvm->arch.vfio_noncoherent ||
> > + (vcpu->kvm->arch.iommu_domain &&
> > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY));
> > }
> >
> > void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ca645a0..615f0c3 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);
> >
> > extern struct kvm_device_ops kvm_mpic_ops;
> > extern struct kvm_device_ops kvm_xics_ops;
> > +extern struct kvm_device_ops kvm_vfio_ops;
> >
> > #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 99c2533..8869616 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -843,6 +843,10 @@ struct kvm_device_attr {
> > #define KVM_DEV_TYPE_FSL_MPIC_20 1
> > #define KVM_DEV_TYPE_FSL_MPIC_42 2
> > #define KVM_DEV_TYPE_XICS 3
> > +#define KVM_DEV_TYPE_VFIO 4
> > +
> > +#define KVM_DEV_VFIO_ADD_GROUP 1
> > +#define KVM_DEV_VFIO_DEL_GROUP 2
> >
> > /*
> > * ioctls for VM fds
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d9cad4d..1a20425 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2269,6 +2269,9 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> > ops = &kvm_xics_ops;
> > break;
> > #endif
> > + case KVM_DEV_TYPE_VFIO:
> > + ops = &kvm_vfio_ops;
> > + break;
> > default:
> > return -ENODEV;
> > }
> > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> > new file mode 100644
> > index 0000000..9a2faff
> > --- /dev/null
> > +++ b/virt/kvm/vfio.c
> > @@ -0,0 +1,237 @@
> > +/*
> > + * VFIO bridge
> > + *
> > + * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
> > + * Author: Alex Williamson <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/errno.h>
> > +#include <linux/file.h>
> > +#include <linux/kvm_host.h>
> > +#include <linux/list.h>
> > +#include <linux/module.h>
> > +#include <linux/mutex.h>
> > +#include <linux/slab.h>
> > +#include <linux/vfio.h>
> > +
> > +struct kvm_vfio_group {
> > + struct list_head node;
> > + struct vfio_group *vfio_group;
> > +};
> > +
> > +struct kvm_vfio {
> > + struct list_head group_list;
> > + struct mutex lock;
> > +};
> > +
> > +static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
> > +{
> > + struct vfio_group *vfio_group;
> > + struct vfio_group *(*fn)(struct file *);
> > +
> > + fn = symbol_get(vfio_group_get_external_user);
> > + if (!fn)
> > + return ERR_PTR(-EINVAL);
> > +
> > + vfio_group = fn(filep);
> > +
> > + symbol_put(vfio_group_get_external_user);
> > +
> > + return vfio_group;
> > +}
> > +
> > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > +{
> > + void (*fn)(struct vfio_group *);
> > +
> > + fn = symbol_get(vfio_group_put_external_user);
> > + if (!fn)
> > + return;
> > +
> > + fn(vfio_group);
> > +
> > + symbol_put(vfio_group_put_external_user);
> > +}
> > +
> > +static int kvm_vfio_external_user_check_extension(struct vfio_group *vfio_group,
> > + unsigned long arg)
> > +{
> > + int (*fn)(struct vfio_group *, unsigned long);
> > + int ret;
> > +
> > + fn = symbol_get(vfio_external_user_check_extension);
> > + if (!fn)
> > + return -EINVAL;
> > +
> > + ret = fn(vfio_group, arg);
> > +
> > + symbol_put(vfio_group_put_external_user);
> > +
> > + return ret;
> > +}
> > +
> > +static void kvm_vfio_update_iommu_coherency(struct kvm_device *dev)
> > +{
> > + struct kvm_vfio *kv = dev->private;
> > + bool coherent = true;
> > + struct kvm_vfio_group *kvg;
> > +
> > + mutex_lock(&kv->lock);
> > +
> > + list_for_each_entry(kvg, &kv->group_list, node) {
> > + if (!kvm_vfio_external_user_check_extension(kvg->vfio_group,
> > + VFIO_IOMMU_CAP_CACHE_COHERENCY)) {
> > + coherent = false;
> > + break;
> > + }
> > + }
> > +
> > + mutex_unlock(&kv->lock);
> > +
> > + dev->kvm->arch.vfio_noncoherent = !coherent;
> > +}
> > +
> > +static int kvm_vfio_set_attr(struct kvm_device *dev,
> > + struct kvm_device_attr *attr)
> > +{
> > + struct kvm_vfio *kv = dev->private;
> > + struct fd f;
> > + struct vfio_group *vfio_group;
> > + struct kvm_vfio_group *kvg;
> > + int ret;
> > +
> > + switch (attr->group) {
> > + case KVM_DEV_VFIO_ADD_GROUP:
> > + f = fdget(attr->addr);
> > + if (!f.file)
> > + return -EBADF;
> > +
> > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > + fdput(f);
> > +
> > + if (IS_ERR(vfio_group))
> > + return PTR_ERR(vfio_group);
> > +
> > + mutex_lock(&kv->lock);
> > +
> > + list_for_each_entry(kvg, &kv->group_list, node) {
> > + if (kvg->vfio_group == vfio_group) {
> > + mutex_unlock(&kv->lock);
> > + kvm_vfio_group_put_external_user(vfio_group);
> > + return -EEXIST;
> > + }
> > + }
> > +
> > + kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
> > + if (!kvg) {
> > + mutex_unlock(&kv->lock);
> > + kvm_vfio_group_put_external_user(vfio_group);
> > + return -ENOMEM;
> > + }
> > +
> > + list_add_tail(&kvg->node, &kv->group_list);
> > + kvg->vfio_group = vfio_group;
> > +
> > + mutex_unlock(&kv->lock);
> > +
> > + kvm_vfio_update_iommu_coherency(dev);
> > +
> > + return 0;
> > +
> > + case KVM_DEV_VFIO_DEL_GROUP:
> > + f = fdget(attr->addr);
> > + if (!f.file)
> > + return -EBADF;
> > +
> > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > + fdput(f);
> > +
> > + if (IS_ERR(vfio_group))
> > + return PTR_ERR(vfio_group);
> > +
> > + ret = -ENOENT;
> > +
> > + mutex_lock(&kv->lock);
> > +
> > + list_for_each_entry(kvg, &kv->group_list, node) {
> > + if (kvg->vfio_group != vfio_group)
> > + continue;
> > +
> > + list_del(&kvg->node);
> > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > + kfree(kvg);
> > + ret = 0;
> > + break;
> > + }
> > +
> > + mutex_unlock(&kv->lock);
> > + kvm_vfio_group_put_external_user(vfio_group);
> > +
> > + kvm_vfio_update_iommu_coherency(dev);
> > +
> > + return ret;
> > + }
> > +
> > + return -ENXIO;
> > +}
> > +
> > +static int kvm_vfio_has_attr(struct kvm_device *dev,
> > + struct kvm_device_attr *attr)
> > +{
> > + switch (attr->group) {
> > + case KVM_DEV_VFIO_ADD_GROUP:
> > + case KVM_DEV_VFIO_DEL_GROUP:
> > + return 0;
> > + }
> > +
> > + return -ENXIO;
> > +}
> > +
> > +static void kvm_vfio_destroy(struct kvm_device *dev)
> > +{
> > + struct kvm_vfio *kv = dev->private;
> > + struct kvm_vfio_group *kvg, *tmp;
> > +
> > + list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > + list_del(&kvg->node);
> > + kfree(kvg);
> > + }
> > +
> > + dev->kvm->arch.vfio_noncoherent = false;
> > + kfree(kv);
> > +}
> > +
> > +static int kvm_vfio_create(struct kvm_device *dev, u32 type)
> > +{
> > + struct kvm_device *tmp;
> > + struct kvm_vfio *kv;
> > +
> > + /* Only one VFIO "device" per VM */
> > + list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
> > + if (tmp->ops == &kvm_vfio_ops)
> > + return -EBUSY;
> > +
> > + kv = kzalloc(sizeof(*kv), GFP_KERNEL);
> > + if (!kv)
> > + return -ENOMEM;
> > +
> > + INIT_LIST_HEAD(&kv->group_list);
> > + mutex_init(&kv->lock);
> > +
> > + dev->private = kv;
> > +
> > + return 0;
> > +}
> > +
> > +struct kvm_device_ops kvm_vfio_ops = {
> > + .name = "kvm-vfio",
> > + .create = kvm_vfio_create,
> > + .destroy = kvm_vfio_destroy,
> > + .set_attr = kvm_vfio_set_attr,
> > + .has_attr = kvm_vfio_has_attr,
> > +};


2013-09-13 14:50:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Fri, Sep 13, 2013 at 08:13:40AM -0600, Alex Williamson wrote:
> On Fri, 2013-09-13 at 15:39 +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> > > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > > other, but there's any important point where that breaks down. Intel
> > > VT-d hardware may or may not support snoop control. When snoop
> > > control is available, intel-iommu promotes No-Snoop transactions on
> > > PCIe to be cache coherent. That allows KVM to handle things like the
> > > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > > KVM must implement a hardware visible WBINVD for the guest.
> > >
> > > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > > privileged for a reason. Allowing an arbitrary user to enable
> > > physical WBINVD gives them a more access to the hardware. Previously,
> > > this has only been enabled for guests supporting legacy PCI device
> > > assignment. In such cases it's necessary for proper guest execution.
> > > We therefore create a new KVM-VFIO virtual device. The user can add
> > > and remove VFIO groups to this device via file descriptors. KVM
> > > makes use of the VFIO external user interface to validate that the
> > > user has access to physical hardware and gets the coherency state of
> > > the IOMMU from VFIO. This provides equivalent functionality to
> > > legacy KVM assignment, while keeping (nearly) all the bits isolated.
> >
> >
> > So how is the isolation handled then?
>
> By isolation above, I'm only talking about the intrusive-ness of the
> code and "leaking" vfio knowledge into KVM. Nothing to do with device
> isolation.
>
> > How is this better than a ioctl to grant WBINVD to guest?
> > kvm char device can be opened by any user,
> > so any user can grant itself these priveledges.
> > What did I miss?
>
> With this interface the caller must have physical access to one or more
> devices via vfio groups and each of those groups must be configured into
> one or more IOMMU domains. Furthermore, at least one of the IOMMU
> domains must not include the IOMMU_CAP_CACHE_COHERENCY capability. So
> it's actually quite a significantly higher hurdle than an ioctl open to
> anyone and the number of VMs on a given host capable of doing this is
> bound by the number of IOMMU groups. We do not however verify that a
> vfio device is actually in use by the VM, but I don't think there's a
> way to do that from KVM and I'm not sure that it's important to do so.
> I believe having access to physical hardware is already a sufficient
> granting of privilege to enable things like WBINVD. Thanks,
>
> Alex

Fair enough, but how about revoking the priveledge?
For example, device can be removed by hotplug - does
priveledge remain?
Is this important at all?


> > > The one intrusion is the resulting flag indicating the coherency
> > > state. For this RFC it's placed on the x86 kvm_arch struct, however
> > > I know POWER has interest in using the VFIO external user interface,
> > > and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> > > care about No-Snoop handling as well or the code can be #ifdef'd.
> > >
> > > Signed-off-by: Alex Williamson <[email protected]>
> > > ---
> > > Documentation/virtual/kvm/devices/vfio.txt | 22 +++
> > > arch/x86/include/asm/kvm_host.h | 1
> > > arch/x86/kvm/Makefile | 2
> > > arch/x86/kvm/vmx.c | 5 -
> > > arch/x86/kvm/x86.c | 5 -
> > > include/linux/kvm_host.h | 1
> > > include/uapi/linux/kvm.h | 4
> > > virt/kvm/kvm_main.c | 3
> > > virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
> > > 9 files changed, 275 insertions(+), 5 deletions(-)
> > > create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
> > > create mode 100644 virt/kvm/vfio.c
> > >
> > > diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> > > new file mode 100644
> > > index 0000000..831e6a6
> > > --- /dev/null
> > > +++ b/Documentation/virtual/kvm/devices/vfio.txt
> > > @@ -0,0 +1,22 @@
> > > +VFIO virtual device
> > > +===================
> > > +
> > > +Device types supported:
> > > + KVM_DEV_TYPE_VFIO
> > > +
> > > +Only one VFIO instance may be created per VM. The created device
> > > +tracks VFIO groups in use by the VM and features of those groups
> > > +important to the correctness and acceleration of the VM. As groups
> > > +are enabled and disabled for use by the VM, KVM should be updated
> > > +about their presence. When registered with KVM, a reference to the
> > > +VFIO-group is held by KVM.
> > > +
> > > +Groups:
> > > + KVM_DEV_VFIO_ADD_GROUP
> > > + KVM_DEV_VFIO_DEL_GROUP
> > > +
> > > +Each takes a int32_t file descriptor for kvm_device_attr.addr and
> > > +does not support any group device kvm_device_attr.attr.
> > > +
> > > +RFC - Should we use Group KVM_DEV_VFIO_GROUP with Attributes
> > > + KVM_DEV_VFIO_GROUP_ADD & DEL?
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index c76ff74..5b9350d 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -588,6 +588,7 @@ struct kvm_arch {
> > >
> > > spinlock_t pvclock_gtod_sync_lock;
> > > bool use_master_clock;
> > > + bool vfio_noncoherent;
> > > u64 master_kernel_ns;
> > > cycle_t master_cycle_now;
> > >
> > > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > > index bf4fb04..25d22b2 100644
> > > --- a/arch/x86/kvm/Makefile
> > > +++ b/arch/x86/kvm/Makefile
> > > @@ -9,7 +9,7 @@ KVM := ../../../virt/kvm
> > >
> > > kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
> > > $(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
> > > - $(KVM)/eventfd.o $(KVM)/irqchip.o
> > > + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > > kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
> > > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> > >
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index 1f1da43..94f7786 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -7395,8 +7395,9 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > > */
> > > if (is_mmio)
> > > ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> > > - else if (vcpu->kvm->arch.iommu_domain &&
> > > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > > + else if (vcpu->kvm->arch.vfio_noncoherent ||
> > > + vcpu->kvm->arch.iommu_domain &&
> > > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > > ret = kvm_get_guest_memory_type(vcpu, gfn) <<
> > > VMX_EPT_MT_EPTE_SHIFT;
> > > else
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index e5ca72a..406ba6f 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2715,8 +2715,9 @@ static void wbinvd_ipi(void *garbage)
> > >
> > > static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> > > {
> > > - return vcpu->kvm->arch.iommu_domain &&
> > > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
> > > + return vcpu->kvm->arch.vfio_noncoherent ||
> > > + (vcpu->kvm->arch.iommu_domain &&
> > > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY));
> > > }
> > >
> > > void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index ca645a0..615f0c3 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);
> > >
> > > extern struct kvm_device_ops kvm_mpic_ops;
> > > extern struct kvm_device_ops kvm_xics_ops;
> > > +extern struct kvm_device_ops kvm_vfio_ops;
> > >
> > > #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> > >
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 99c2533..8869616 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -843,6 +843,10 @@ struct kvm_device_attr {
> > > #define KVM_DEV_TYPE_FSL_MPIC_20 1
> > > #define KVM_DEV_TYPE_FSL_MPIC_42 2
> > > #define KVM_DEV_TYPE_XICS 3
> > > +#define KVM_DEV_TYPE_VFIO 4
> > > +
> > > +#define KVM_DEV_VFIO_ADD_GROUP 1
> > > +#define KVM_DEV_VFIO_DEL_GROUP 2
> > >
> > > /*
> > > * ioctls for VM fds
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index d9cad4d..1a20425 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2269,6 +2269,9 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> > > ops = &kvm_xics_ops;
> > > break;
> > > #endif
> > > + case KVM_DEV_TYPE_VFIO:
> > > + ops = &kvm_vfio_ops;
> > > + break;
> > > default:
> > > return -ENODEV;
> > > }
> > > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> > > new file mode 100644
> > > index 0000000..9a2faff
> > > --- /dev/null
> > > +++ b/virt/kvm/vfio.c
> > > @@ -0,0 +1,237 @@
> > > +/*
> > > + * VFIO bridge
> > > + *
> > > + * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
> > > + * Author: Alex Williamson <[email protected]>
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + */
> > > +
> > > +#include <linux/errno.h>
> > > +#include <linux/file.h>
> > > +#include <linux/kvm_host.h>
> > > +#include <linux/list.h>
> > > +#include <linux/module.h>
> > > +#include <linux/mutex.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/vfio.h>
> > > +
> > > +struct kvm_vfio_group {
> > > + struct list_head node;
> > > + struct vfio_group *vfio_group;
> > > +};
> > > +
> > > +struct kvm_vfio {
> > > + struct list_head group_list;
> > > + struct mutex lock;
> > > +};
> > > +
> > > +static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
> > > +{
> > > + struct vfio_group *vfio_group;
> > > + struct vfio_group *(*fn)(struct file *);
> > > +
> > > + fn = symbol_get(vfio_group_get_external_user);
> > > + if (!fn)
> > > + return ERR_PTR(-EINVAL);
> > > +
> > > + vfio_group = fn(filep);
> > > +
> > > + symbol_put(vfio_group_get_external_user);
> > > +
> > > + return vfio_group;
> > > +}
> > > +
> > > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > > +{
> > > + void (*fn)(struct vfio_group *);
> > > +
> > > + fn = symbol_get(vfio_group_put_external_user);
> > > + if (!fn)
> > > + return;
> > > +
> > > + fn(vfio_group);
> > > +
> > > + symbol_put(vfio_group_put_external_user);
> > > +}
> > > +
> > > +static int kvm_vfio_external_user_check_extension(struct vfio_group *vfio_group,
> > > + unsigned long arg)
> > > +{
> > > + int (*fn)(struct vfio_group *, unsigned long);
> > > + int ret;
> > > +
> > > + fn = symbol_get(vfio_external_user_check_extension);
> > > + if (!fn)
> > > + return -EINVAL;
> > > +
> > > + ret = fn(vfio_group, arg);
> > > +
> > > + symbol_put(vfio_group_put_external_user);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static void kvm_vfio_update_iommu_coherency(struct kvm_device *dev)
> > > +{
> > > + struct kvm_vfio *kv = dev->private;
> > > + bool coherent = true;
> > > + struct kvm_vfio_group *kvg;
> > > +
> > > + mutex_lock(&kv->lock);
> > > +
> > > + list_for_each_entry(kvg, &kv->group_list, node) {
> > > + if (!kvm_vfio_external_user_check_extension(kvg->vfio_group,
> > > + VFIO_IOMMU_CAP_CACHE_COHERENCY)) {
> > > + coherent = false;
> > > + break;
> > > + }
> > > + }
> > > +
> > > + mutex_unlock(&kv->lock);
> > > +
> > > + dev->kvm->arch.vfio_noncoherent = !coherent;
> > > +}
> > > +
> > > +static int kvm_vfio_set_attr(struct kvm_device *dev,
> > > + struct kvm_device_attr *attr)
> > > +{
> > > + struct kvm_vfio *kv = dev->private;
> > > + struct fd f;
> > > + struct vfio_group *vfio_group;
> > > + struct kvm_vfio_group *kvg;
> > > + int ret;
> > > +
> > > + switch (attr->group) {
> > > + case KVM_DEV_VFIO_ADD_GROUP:
> > > + f = fdget(attr->addr);
> > > + if (!f.file)
> > > + return -EBADF;
> > > +
> > > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > > + fdput(f);
> > > +
> > > + if (IS_ERR(vfio_group))
> > > + return PTR_ERR(vfio_group);
> > > +
> > > + mutex_lock(&kv->lock);
> > > +
> > > + list_for_each_entry(kvg, &kv->group_list, node) {
> > > + if (kvg->vfio_group == vfio_group) {
> > > + mutex_unlock(&kv->lock);
> > > + kvm_vfio_group_put_external_user(vfio_group);
> > > + return -EEXIST;
> > > + }
> > > + }
> > > +
> > > + kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
> > > + if (!kvg) {
> > > + mutex_unlock(&kv->lock);
> > > + kvm_vfio_group_put_external_user(vfio_group);
> > > + return -ENOMEM;
> > > + }
> > > +
> > > + list_add_tail(&kvg->node, &kv->group_list);
> > > + kvg->vfio_group = vfio_group;
> > > +
> > > + mutex_unlock(&kv->lock);
> > > +
> > > + kvm_vfio_update_iommu_coherency(dev);
> > > +
> > > + return 0;
> > > +
> > > + case KVM_DEV_VFIO_DEL_GROUP:
> > > + f = fdget(attr->addr);
> > > + if (!f.file)
> > > + return -EBADF;
> > > +
> > > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > > + fdput(f);
> > > +
> > > + if (IS_ERR(vfio_group))
> > > + return PTR_ERR(vfio_group);
> > > +
> > > + ret = -ENOENT;
> > > +
> > > + mutex_lock(&kv->lock);
> > > +
> > > + list_for_each_entry(kvg, &kv->group_list, node) {
> > > + if (kvg->vfio_group != vfio_group)
> > > + continue;
> > > +
> > > + list_del(&kvg->node);
> > > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > > + kfree(kvg);
> > > + ret = 0;
> > > + break;
> > > + }
> > > +
> > > + mutex_unlock(&kv->lock);
> > > + kvm_vfio_group_put_external_user(vfio_group);
> > > +
> > > + kvm_vfio_update_iommu_coherency(dev);
> > > +
> > > + return ret;
> > > + }
> > > +
> > > + return -ENXIO;
> > > +}
> > > +
> > > +static int kvm_vfio_has_attr(struct kvm_device *dev,
> > > + struct kvm_device_attr *attr)
> > > +{
> > > + switch (attr->group) {
> > > + case KVM_DEV_VFIO_ADD_GROUP:
> > > + case KVM_DEV_VFIO_DEL_GROUP:
> > > + return 0;
> > > + }
> > > +
> > > + return -ENXIO;
> > > +}
> > > +
> > > +static void kvm_vfio_destroy(struct kvm_device *dev)
> > > +{
> > > + struct kvm_vfio *kv = dev->private;
> > > + struct kvm_vfio_group *kvg, *tmp;
> > > +
> > > + list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> > > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > > + list_del(&kvg->node);
> > > + kfree(kvg);
> > > + }
> > > +
> > > + dev->kvm->arch.vfio_noncoherent = false;
> > > + kfree(kv);
> > > +}
> > > +
> > > +static int kvm_vfio_create(struct kvm_device *dev, u32 type)
> > > +{
> > > + struct kvm_device *tmp;
> > > + struct kvm_vfio *kv;
> > > +
> > > + /* Only one VFIO "device" per VM */
> > > + list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
> > > + if (tmp->ops == &kvm_vfio_ops)
> > > + return -EBUSY;
> > > +
> > > + kv = kzalloc(sizeof(*kv), GFP_KERNEL);
> > > + if (!kv)
> > > + return -ENOMEM;
> > > +
> > > + INIT_LIST_HEAD(&kv->group_list);
> > > + mutex_init(&kv->lock);
> > > +
> > > + dev->private = kv;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +struct kvm_device_ops kvm_vfio_ops = {
> > > + .name = "kvm-vfio",
> > > + .create = kvm_vfio_create,
> > > + .destroy = kvm_vfio_destroy,
> > > + .set_attr = kvm_vfio_set_attr,
> > > + .has_attr = kvm_vfio_has_attr,
> > > +};
>
>

2013-09-13 15:29:32

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Fri, 2013-09-13 at 17:52 +0300, Michael S. Tsirkin wrote:
> On Fri, Sep 13, 2013 at 08:13:40AM -0600, Alex Williamson wrote:
> > On Fri, 2013-09-13 at 15:39 +0300, Michael S. Tsirkin wrote:
> > > On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> > > > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > > > other, but there's any important point where that breaks down. Intel
> > > > VT-d hardware may or may not support snoop control. When snoop
> > > > control is available, intel-iommu promotes No-Snoop transactions on
> > > > PCIe to be cache coherent. That allows KVM to handle things like the
> > > > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > > > KVM must implement a hardware visible WBINVD for the guest.
> > > >
> > > > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > > > privileged for a reason. Allowing an arbitrary user to enable
> > > > physical WBINVD gives them a more access to the hardware. Previously,
> > > > this has only been enabled for guests supporting legacy PCI device
> > > > assignment. In such cases it's necessary for proper guest execution.
> > > > We therefore create a new KVM-VFIO virtual device. The user can add
> > > > and remove VFIO groups to this device via file descriptors. KVM
> > > > makes use of the VFIO external user interface to validate that the
> > > > user has access to physical hardware and gets the coherency state of
> > > > the IOMMU from VFIO. This provides equivalent functionality to
> > > > legacy KVM assignment, while keeping (nearly) all the bits isolated.
> > >
> > >
> > > So how is the isolation handled then?
> >
> > By isolation above, I'm only talking about the intrusive-ness of the
> > code and "leaking" vfio knowledge into KVM. Nothing to do with device
> > isolation.
> >
> > > How is this better than a ioctl to grant WBINVD to guest?
> > > kvm char device can be opened by any user,
> > > so any user can grant itself these priveledges.
> > > What did I miss?
> >
> > With this interface the caller must have physical access to one or more
> > devices via vfio groups and each of those groups must be configured into
> > one or more IOMMU domains. Furthermore, at least one of the IOMMU
> > domains must not include the IOMMU_CAP_CACHE_COHERENCY capability. So
> > it's actually quite a significantly higher hurdle than an ioctl open to
> > anyone and the number of VMs on a given host capable of doing this is
> > bound by the number of IOMMU groups. We do not however verify that a
> > vfio device is actually in use by the VM, but I don't think there's a
> > way to do that from KVM and I'm not sure that it's important to do so.
> > I believe having access to physical hardware is already a sufficient
> > granting of privilege to enable things like WBINVD. Thanks,
> >
> > Alex
>
> Fair enough, but how about revoking the priveledge?
> For example, device can be removed by hotplug - does
> priveledge remain?
> Is this important at all?

That's where the reference comes in through the vfio external user
interface. KVM holds a reference to the group that maintains iommu
protection for the group. Therefore there's no revocation, KVM needs to
release the reference either under direction of the user, closing of the
KVM device fd, or teardown of the KVM VM.

Unused devices can be hotplugged from the group; used devices will be
blocked by the user's reference to the device. Hot-added devices
devices will join the group and be added to the IOMMU domain. If a
non-vfio driver is attached to a member of the group, we'll BUG_ON, just
like we do for the normal user case. Eventually this needs to be
converted to tracking and killing user processes, but that will have the
effect of killing the VM and closing all the file descriptors,
automatically cleaning up the KVM held group reference.

I suppose the corner case you might be looking for is the scenario where
the user enables WBINVD via a group and never uses the device. The
device is then hot-removed. Should the user still have WBINVD?
Probably not. Is it worth worrying about? Probably not. The user
still has privileges to the IOMMU domain, even though it has no devices.
If the device is re-added, it will go back into the same IOMMU domain,
so there's no opportunity for another user to gain privilege. If we
wanted to track such a situation we'd need to introduce some mechanism
for KVM to be notified of group changes to re-evaluate the coherency
flags. That all seems like negotiation between KVM and vfio in the
external user interface that isn't exposed to the user, should we decide
to handle it. Thanks,

Alex

> > > > The one intrusion is the resulting flag indicating the coherency
> > > > state. For this RFC it's placed on the x86 kvm_arch struct, however
> > > > I know POWER has interest in using the VFIO external user interface,
> > > > and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> > > > care about No-Snoop handling as well or the code can be #ifdef'd.
> > > >
> > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > ---
> > > > Documentation/virtual/kvm/devices/vfio.txt | 22 +++
> > > > arch/x86/include/asm/kvm_host.h | 1
> > > > arch/x86/kvm/Makefile | 2
> > > > arch/x86/kvm/vmx.c | 5 -
> > > > arch/x86/kvm/x86.c | 5 -
> > > > include/linux/kvm_host.h | 1
> > > > include/uapi/linux/kvm.h | 4
> > > > virt/kvm/kvm_main.c | 3
> > > > virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
> > > > 9 files changed, 275 insertions(+), 5 deletions(-)
> > > > create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
> > > > create mode 100644 virt/kvm/vfio.c
> > > >
> > > > diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> > > > new file mode 100644
> > > > index 0000000..831e6a6
> > > > --- /dev/null
> > > > +++ b/Documentation/virtual/kvm/devices/vfio.txt
> > > > @@ -0,0 +1,22 @@
> > > > +VFIO virtual device
> > > > +===================
> > > > +
> > > > +Device types supported:
> > > > + KVM_DEV_TYPE_VFIO
> > > > +
> > > > +Only one VFIO instance may be created per VM. The created device
> > > > +tracks VFIO groups in use by the VM and features of those groups
> > > > +important to the correctness and acceleration of the VM. As groups
> > > > +are enabled and disabled for use by the VM, KVM should be updated
> > > > +about their presence. When registered with KVM, a reference to the
> > > > +VFIO-group is held by KVM.
> > > > +
> > > > +Groups:
> > > > + KVM_DEV_VFIO_ADD_GROUP
> > > > + KVM_DEV_VFIO_DEL_GROUP
> > > > +
> > > > +Each takes a int32_t file descriptor for kvm_device_attr.addr and
> > > > +does not support any group device kvm_device_attr.attr.
> > > > +
> > > > +RFC - Should we use Group KVM_DEV_VFIO_GROUP with Attributes
> > > > + KVM_DEV_VFIO_GROUP_ADD & DEL?
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index c76ff74..5b9350d 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -588,6 +588,7 @@ struct kvm_arch {
> > > >
> > > > spinlock_t pvclock_gtod_sync_lock;
> > > > bool use_master_clock;
> > > > + bool vfio_noncoherent;
> > > > u64 master_kernel_ns;
> > > > cycle_t master_cycle_now;
> > > >
> > > > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > > > index bf4fb04..25d22b2 100644
> > > > --- a/arch/x86/kvm/Makefile
> > > > +++ b/arch/x86/kvm/Makefile
> > > > @@ -9,7 +9,7 @@ KVM := ../../../virt/kvm
> > > >
> > > > kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
> > > > $(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
> > > > - $(KVM)/eventfd.o $(KVM)/irqchip.o
> > > > + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > > > kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
> > > > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> > > >
> > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > > index 1f1da43..94f7786 100644
> > > > --- a/arch/x86/kvm/vmx.c
> > > > +++ b/arch/x86/kvm/vmx.c
> > > > @@ -7395,8 +7395,9 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > > > */
> > > > if (is_mmio)
> > > > ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> > > > - else if (vcpu->kvm->arch.iommu_domain &&
> > > > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > > > + else if (vcpu->kvm->arch.vfio_noncoherent ||
> > > > + vcpu->kvm->arch.iommu_domain &&
> > > > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > > > ret = kvm_get_guest_memory_type(vcpu, gfn) <<
> > > > VMX_EPT_MT_EPTE_SHIFT;
> > > > else
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index e5ca72a..406ba6f 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -2715,8 +2715,9 @@ static void wbinvd_ipi(void *garbage)
> > > >
> > > > static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> > > > {
> > > > - return vcpu->kvm->arch.iommu_domain &&
> > > > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
> > > > + return vcpu->kvm->arch.vfio_noncoherent ||
> > > > + (vcpu->kvm->arch.iommu_domain &&
> > > > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY));
> > > > }
> > > >
> > > > void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index ca645a0..615f0c3 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);
> > > >
> > > > extern struct kvm_device_ops kvm_mpic_ops;
> > > > extern struct kvm_device_ops kvm_xics_ops;
> > > > +extern struct kvm_device_ops kvm_vfio_ops;
> > > >
> > > > #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> > > >
> > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > index 99c2533..8869616 100644
> > > > --- a/include/uapi/linux/kvm.h
> > > > +++ b/include/uapi/linux/kvm.h
> > > > @@ -843,6 +843,10 @@ struct kvm_device_attr {
> > > > #define KVM_DEV_TYPE_FSL_MPIC_20 1
> > > > #define KVM_DEV_TYPE_FSL_MPIC_42 2
> > > > #define KVM_DEV_TYPE_XICS 3
> > > > +#define KVM_DEV_TYPE_VFIO 4
> > > > +
> > > > +#define KVM_DEV_VFIO_ADD_GROUP 1
> > > > +#define KVM_DEV_VFIO_DEL_GROUP 2
> > > >
> > > > /*
> > > > * ioctls for VM fds
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index d9cad4d..1a20425 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -2269,6 +2269,9 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> > > > ops = &kvm_xics_ops;
> > > > break;
> > > > #endif
> > > > + case KVM_DEV_TYPE_VFIO:
> > > > + ops = &kvm_vfio_ops;
> > > > + break;
> > > > default:
> > > > return -ENODEV;
> > > > }
> > > > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> > > > new file mode 100644
> > > > index 0000000..9a2faff
> > > > --- /dev/null
> > > > +++ b/virt/kvm/vfio.c
> > > > @@ -0,0 +1,237 @@
> > > > +/*
> > > > + * VFIO bridge
> > > > + *
> > > > + * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
> > > > + * Author: Alex Williamson <[email protected]>
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or modify
> > > > + * it under the terms of the GNU General Public License version 2 as
> > > > + * published by the Free Software Foundation.
> > > > + */
> > > > +
> > > > +#include <linux/errno.h>
> > > > +#include <linux/file.h>
> > > > +#include <linux/kvm_host.h>
> > > > +#include <linux/list.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/mutex.h>
> > > > +#include <linux/slab.h>
> > > > +#include <linux/vfio.h>
> > > > +
> > > > +struct kvm_vfio_group {
> > > > + struct list_head node;
> > > > + struct vfio_group *vfio_group;
> > > > +};
> > > > +
> > > > +struct kvm_vfio {
> > > > + struct list_head group_list;
> > > > + struct mutex lock;
> > > > +};
> > > > +
> > > > +static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
> > > > +{
> > > > + struct vfio_group *vfio_group;
> > > > + struct vfio_group *(*fn)(struct file *);
> > > > +
> > > > + fn = symbol_get(vfio_group_get_external_user);
> > > > + if (!fn)
> > > > + return ERR_PTR(-EINVAL);
> > > > +
> > > > + vfio_group = fn(filep);
> > > > +
> > > > + symbol_put(vfio_group_get_external_user);
> > > > +
> > > > + return vfio_group;
> > > > +}
> > > > +
> > > > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > > > +{
> > > > + void (*fn)(struct vfio_group *);
> > > > +
> > > > + fn = symbol_get(vfio_group_put_external_user);
> > > > + if (!fn)
> > > > + return;
> > > > +
> > > > + fn(vfio_group);
> > > > +
> > > > + symbol_put(vfio_group_put_external_user);
> > > > +}
> > > > +
> > > > +static int kvm_vfio_external_user_check_extension(struct vfio_group *vfio_group,
> > > > + unsigned long arg)
> > > > +{
> > > > + int (*fn)(struct vfio_group *, unsigned long);
> > > > + int ret;
> > > > +
> > > > + fn = symbol_get(vfio_external_user_check_extension);
> > > > + if (!fn)
> > > > + return -EINVAL;
> > > > +
> > > > + ret = fn(vfio_group, arg);
> > > > +
> > > > + symbol_put(vfio_group_put_external_user);
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > > +static void kvm_vfio_update_iommu_coherency(struct kvm_device *dev)
> > > > +{
> > > > + struct kvm_vfio *kv = dev->private;
> > > > + bool coherent = true;
> > > > + struct kvm_vfio_group *kvg;
> > > > +
> > > > + mutex_lock(&kv->lock);
> > > > +
> > > > + list_for_each_entry(kvg, &kv->group_list, node) {
> > > > + if (!kvm_vfio_external_user_check_extension(kvg->vfio_group,
> > > > + VFIO_IOMMU_CAP_CACHE_COHERENCY)) {
> > > > + coherent = false;
> > > > + break;
> > > > + }
> > > > + }
> > > > +
> > > > + mutex_unlock(&kv->lock);
> > > > +
> > > > + dev->kvm->arch.vfio_noncoherent = !coherent;
> > > > +}
> > > > +
> > > > +static int kvm_vfio_set_attr(struct kvm_device *dev,
> > > > + struct kvm_device_attr *attr)
> > > > +{
> > > > + struct kvm_vfio *kv = dev->private;
> > > > + struct fd f;
> > > > + struct vfio_group *vfio_group;
> > > > + struct kvm_vfio_group *kvg;
> > > > + int ret;
> > > > +
> > > > + switch (attr->group) {
> > > > + case KVM_DEV_VFIO_ADD_GROUP:
> > > > + f = fdget(attr->addr);
> > > > + if (!f.file)
> > > > + return -EBADF;
> > > > +
> > > > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > > > + fdput(f);
> > > > +
> > > > + if (IS_ERR(vfio_group))
> > > > + return PTR_ERR(vfio_group);
> > > > +
> > > > + mutex_lock(&kv->lock);
> > > > +
> > > > + list_for_each_entry(kvg, &kv->group_list, node) {
> > > > + if (kvg->vfio_group == vfio_group) {
> > > > + mutex_unlock(&kv->lock);
> > > > + kvm_vfio_group_put_external_user(vfio_group);
> > > > + return -EEXIST;
> > > > + }
> > > > + }
> > > > +
> > > > + kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
> > > > + if (!kvg) {
> > > > + mutex_unlock(&kv->lock);
> > > > + kvm_vfio_group_put_external_user(vfio_group);
> > > > + return -ENOMEM;
> > > > + }
> > > > +
> > > > + list_add_tail(&kvg->node, &kv->group_list);
> > > > + kvg->vfio_group = vfio_group;
> > > > +
> > > > + mutex_unlock(&kv->lock);
> > > > +
> > > > + kvm_vfio_update_iommu_coherency(dev);
> > > > +
> > > > + return 0;
> > > > +
> > > > + case KVM_DEV_VFIO_DEL_GROUP:
> > > > + f = fdget(attr->addr);
> > > > + if (!f.file)
> > > > + return -EBADF;
> > > > +
> > > > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > > > + fdput(f);
> > > > +
> > > > + if (IS_ERR(vfio_group))
> > > > + return PTR_ERR(vfio_group);
> > > > +
> > > > + ret = -ENOENT;
> > > > +
> > > > + mutex_lock(&kv->lock);
> > > > +
> > > > + list_for_each_entry(kvg, &kv->group_list, node) {
> > > > + if (kvg->vfio_group != vfio_group)
> > > > + continue;
> > > > +
> > > > + list_del(&kvg->node);
> > > > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > > > + kfree(kvg);
> > > > + ret = 0;
> > > > + break;
> > > > + }
> > > > +
> > > > + mutex_unlock(&kv->lock);
> > > > + kvm_vfio_group_put_external_user(vfio_group);
> > > > +
> > > > + kvm_vfio_update_iommu_coherency(dev);
> > > > +
> > > > + return ret;
> > > > + }
> > > > +
> > > > + return -ENXIO;
> > > > +}
> > > > +
> > > > +static int kvm_vfio_has_attr(struct kvm_device *dev,
> > > > + struct kvm_device_attr *attr)
> > > > +{
> > > > + switch (attr->group) {
> > > > + case KVM_DEV_VFIO_ADD_GROUP:
> > > > + case KVM_DEV_VFIO_DEL_GROUP:
> > > > + return 0;
> > > > + }
> > > > +
> > > > + return -ENXIO;
> > > > +}
> > > > +
> > > > +static void kvm_vfio_destroy(struct kvm_device *dev)
> > > > +{
> > > > + struct kvm_vfio *kv = dev->private;
> > > > + struct kvm_vfio_group *kvg, *tmp;
> > > > +
> > > > + list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> > > > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > > > + list_del(&kvg->node);
> > > > + kfree(kvg);
> > > > + }
> > > > +
> > > > + dev->kvm->arch.vfio_noncoherent = false;
> > > > + kfree(kv);
> > > > +}
> > > > +
> > > > +static int kvm_vfio_create(struct kvm_device *dev, u32 type)
> > > > +{
> > > > + struct kvm_device *tmp;
> > > > + struct kvm_vfio *kv;
> > > > +
> > > > + /* Only one VFIO "device" per VM */
> > > > + list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
> > > > + if (tmp->ops == &kvm_vfio_ops)
> > > > + return -EBUSY;
> > > > +
> > > > + kv = kzalloc(sizeof(*kv), GFP_KERNEL);
> > > > + if (!kv)
> > > > + return -ENOMEM;
> > > > +
> > > > + INIT_LIST_HEAD(&kv->group_list);
> > > > + mutex_init(&kv->lock);
> > > > +
> > > > + dev->private = kv;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +struct kvm_device_ops kvm_vfio_ops = {
> > > > + .name = "kvm-vfio",
> > > > + .create = kvm_vfio_create,
> > > > + .destroy = kvm_vfio_destroy,
> > > > + .set_attr = kvm_vfio_set_attr,
> > > > + .has_attr = kvm_vfio_has_attr,
> > > > +};
> >
> >


2013-09-13 16:25:51

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Fri, 2013-09-13 at 18:49 +1000, Alexey Kardashevskiy wrote:
> On 09/13/2013 07:23 AM, Alex Williamson wrote:
> > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > other, but there's any important point where that breaks down. Intel
> > VT-d hardware may or may not support snoop control. When snoop
> > control is available, intel-iommu promotes No-Snoop transactions on
> > PCIe to be cache coherent. That allows KVM to handle things like the
> > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > KVM must implement a hardware visible WBINVD for the guest.
> >
> > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > privileged for a reason. Allowing an arbitrary user to enable
> > physical WBINVD gives them a more access to the hardware. Previously,
> > this has only been enabled for guests supporting legacy PCI device
> > assignment. In such cases it's necessary for proper guest execution.
> > We therefore create a new KVM-VFIO virtual device. The user can add
> > and remove VFIO groups to this device via file descriptors. KVM
> > makes use of the VFIO external user interface to validate that the
> > user has access to physical hardware and gets the coherency state of
> > the IOMMU from VFIO. This provides equivalent functionality to
> > legacy KVM assignment, while keeping (nearly) all the bits isolated.
> >
> > The one intrusion is the resulting flag indicating the coherency
> > state. For this RFC it's placed on the x86 kvm_arch struct, however
> > I know POWER has interest in using the VFIO external user interface,
> > and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> > care about No-Snoop handling as well or the code can be #ifdef'd.
>
>
> POWER does not support (at least boos3s - "server", not sure about others)
> this cache-non-coherent stuff at all.

Then it's easy for your IOMMU API interface to return always cache
coherent or never cache coherent or whatever ;)

> Regarding reusing this device with external API for POWER - I posted a
> patch which introduces KVM device to link KVM with IOMMU but besides the
> list of groups registered in KVM, it also provides the way to find a group
> by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So
> in my case kvm_vfio_group struct needs LIOBN and it would be nice to have
> there window_size too (for a quick boundary check). I am not sure we want
> to mix everything here.
>
> It is in "[PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel
> handling" if you are interested (kvmppc_spapr_tce_iommu_device).

Yes, I stole the code to get the vfio symbols from your code. The
convergence I was hoping to achieve is that KVM doesn't really want to
know about VFIO and vica versa. We can therefore at least limit the
intrusion by sharing a common device. Obviously for you it will need
some extra interfaces to associate an LIOBN to a group, but we keep both
the kernel an userspace cleaner by avoiding duplication where we can.
Is this really not extensible to your usage? Thanks,

Alex

2013-09-15 12:40:42

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On 09/14/2013 02:25 AM, Alex Williamson wrote:
> On Fri, 2013-09-13 at 18:49 +1000, Alexey Kardashevskiy wrote:
>> On 09/13/2013 07:23 AM, Alex Williamson wrote:
>>> So far we've succeeded at making KVM and VFIO mostly unaware of each
>>> other, but there's any important point where that breaks down. Intel
>>> VT-d hardware may or may not support snoop control. When snoop
>>> control is available, intel-iommu promotes No-Snoop transactions on
>>> PCIe to be cache coherent. That allows KVM to handle things like the
>>> x86 WBINVD opcode as a nop. When the hardware does not support this,
>>> KVM must implement a hardware visible WBINVD for the guest.
>>>
>>> We could simply let userspace tell KVM how to handle WBINVD, but it's
>>> privileged for a reason. Allowing an arbitrary user to enable
>>> physical WBINVD gives them a more access to the hardware. Previously,
>>> this has only been enabled for guests supporting legacy PCI device
>>> assignment. In such cases it's necessary for proper guest execution.
>>> We therefore create a new KVM-VFIO virtual device. The user can add
>>> and remove VFIO groups to this device via file descriptors. KVM
>>> makes use of the VFIO external user interface to validate that the
>>> user has access to physical hardware and gets the coherency state of
>>> the IOMMU from VFIO. This provides equivalent functionality to
>>> legacy KVM assignment, while keeping (nearly) all the bits isolated.
>>>
>>> The one intrusion is the resulting flag indicating the coherency
>>> state. For this RFC it's placed on the x86 kvm_arch struct, however
>>> I know POWER has interest in using the VFIO external user interface,
>>> and I'm hoping we can share a common KVM-VFIO device. Perhaps they
>>> care about No-Snoop handling as well or the code can be #ifdef'd.
>>
>>
>> POWER does not support (at least boos3s - "server", not sure about others)
>> this cache-non-coherent stuff at all.
>
> Then it's easy for your IOMMU API interface to return always cache
> coherent or never cache coherent or whatever ;)
>
>> Regarding reusing this device with external API for POWER - I posted a
>> patch which introduces KVM device to link KVM with IOMMU but besides the
>> list of groups registered in KVM, it also provides the way to find a group
>> by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So
>> in my case kvm_vfio_group struct needs LIOBN and it would be nice to have
>> there window_size too (for a quick boundary check). I am not sure we want
>> to mix everything here.
>>
>> It is in "[PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel
>> handling" if you are interested (kvmppc_spapr_tce_iommu_device).
>
> Yes, I stole the code to get the vfio symbols from your code. The
> convergence I was hoping to achieve is that KVM doesn't really want to
> know about VFIO and vica versa. We can therefore at least limit the
> intrusion by sharing a common device. Obviously for you it will need
> some extra interfaces to associate an LIOBN to a group, but we keep both
> the kernel an userspace cleaner by avoiding duplication where we can.
> Is this really not extensible to your usage?

Well, I do not have a good taste for this kind of stuff so I cannot tell
for sure. I can reuse this device and hack it to do whatever I need but how?

There are 2 things I need from KVM device:
1. associate IOMMU group with LIOBN
2. get a pointer to an IOMMU group by LIOBN (used to get ppc's IOMMU table
pointer which is an IOMMU data of an IOMMU group so we could take a
shortcut here).

There are 3 possible interfaces to use:
A. set/get attributes
B. ioctl
C. additional API

What I posted a week ago uses A for 1 and C for 2. Now we move this to
virt/kvm/vfio.c.
A for 1 is more or less ok but how exactly? Yet another attribute? Platform
specific "bus ID"? In your patch attr->addr is not really an address (to be
accessed with get_user()) but just an fd.

For 2 - there are already A and B interfaces so we do not want C, right?
What kind of a good device has a backdoor? :) But A and B do not have
access control to prevent the user space from receiving a IOMMU group
pointer (which it would not be able to use anyway but still). Do we care
about this (I do not)? And using B (ioctls) within the kernel - I cannot
say I saw many usages of this.

I am pretty sure we will spend some time (not much) arguing about these
things and when we agree on something, then some of KVM maintainers will
step in and say that there is 1001st way of doing this and - goto start.
And after all, this still won't be a device as it does not emulate anything
present in the real hardware, this is just yet another interface to KVM and
some ways of using it won't be natural for somebody. /me sighs.



--
Alexey

2013-09-25 20:19:16

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Sun, 2013-09-15 at 22:40 +1000, Alexey Kardashevskiy wrote:
> On 09/14/2013 02:25 AM, Alex Williamson wrote:
> > On Fri, 2013-09-13 at 18:49 +1000, Alexey Kardashevskiy wrote:
> >> On 09/13/2013 07:23 AM, Alex Williamson wrote:
> >>> So far we've succeeded at making KVM and VFIO mostly unaware of each
> >>> other, but there's any important point where that breaks down. Intel
> >>> VT-d hardware may or may not support snoop control. When snoop
> >>> control is available, intel-iommu promotes No-Snoop transactions on
> >>> PCIe to be cache coherent. That allows KVM to handle things like the
> >>> x86 WBINVD opcode as a nop. When the hardware does not support this,
> >>> KVM must implement a hardware visible WBINVD for the guest.
> >>>
> >>> We could simply let userspace tell KVM how to handle WBINVD, but it's
> >>> privileged for a reason. Allowing an arbitrary user to enable
> >>> physical WBINVD gives them a more access to the hardware. Previously,
> >>> this has only been enabled for guests supporting legacy PCI device
> >>> assignment. In such cases it's necessary for proper guest execution.
> >>> We therefore create a new KVM-VFIO virtual device. The user can add
> >>> and remove VFIO groups to this device via file descriptors. KVM
> >>> makes use of the VFIO external user interface to validate that the
> >>> user has access to physical hardware and gets the coherency state of
> >>> the IOMMU from VFIO. This provides equivalent functionality to
> >>> legacy KVM assignment, while keeping (nearly) all the bits isolated.
> >>>
> >>> The one intrusion is the resulting flag indicating the coherency
> >>> state. For this RFC it's placed on the x86 kvm_arch struct, however
> >>> I know POWER has interest in using the VFIO external user interface,
> >>> and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> >>> care about No-Snoop handling as well or the code can be #ifdef'd.
> >>
> >>
> >> POWER does not support (at least boos3s - "server", not sure about others)
> >> this cache-non-coherent stuff at all.
> >
> > Then it's easy for your IOMMU API interface to return always cache
> > coherent or never cache coherent or whatever ;)
> >
> >> Regarding reusing this device with external API for POWER - I posted a
> >> patch which introduces KVM device to link KVM with IOMMU but besides the
> >> list of groups registered in KVM, it also provides the way to find a group
> >> by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So
> >> in my case kvm_vfio_group struct needs LIOBN and it would be nice to have
> >> there window_size too (for a quick boundary check). I am not sure we want
> >> to mix everything here.
> >>
> >> It is in "[PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel
> >> handling" if you are interested (kvmppc_spapr_tce_iommu_device).
> >
> > Yes, I stole the code to get the vfio symbols from your code. The
> > convergence I was hoping to achieve is that KVM doesn't really want to
> > know about VFIO and vica versa. We can therefore at least limit the
> > intrusion by sharing a common device. Obviously for you it will need
> > some extra interfaces to associate an LIOBN to a group, but we keep both
> > the kernel an userspace cleaner by avoiding duplication where we can.
> > Is this really not extensible to your usage?
>
> Well, I do not have a good taste for this kind of stuff so I cannot tell
> for sure. I can reuse this device and hack it to do whatever I need but how?
>
> There are 2 things I need from KVM device:
> 1. associate IOMMU group with LIOBN

Does QEMU know this? We could set another attribute to specify the
LIOBN for a group.

> 2. get a pointer to an IOMMU group by LIOBN (used to get ppc's IOMMU table
> pointer which is an IOMMU data of an IOMMU group so we could take a
> shortcut here).

Once we have a VFIO device with a VFIO group added and the LIOBN
attribute set, isn't this just a matter of some access code?

> There are 3 possible interfaces to use:
> A. set/get attributes
> B. ioctl
> C. additional API

I think we need to differentiate user interfaces from kernel interfaces.
Within the kernel, we make no guarantees about interfaces and we can
change them to meet our needs. That includes the vfio external user
interface. For userspace, we need to be careful not to break things. I
suggest we use the set/get/has attribute interface that's already part
of KVM for the user interface.

> What I posted a week ago uses A for 1 and C for 2. Now we move this to
> virt/kvm/vfio.c.

I don't care where it lives, other than we both have a need for it, so
it seems like the core of it should not live in architecture specific
directories.

> A for 1 is more or less ok but how exactly? Yet another attribute? Platform
> specific "bus ID"? In your patch attr->addr is not really an address (to be
> accessed with get_user()) but just an fd.

Is that a problem?

> For 2 - there are already A and B interfaces so we do not want C, right?
> What kind of a good device has a backdoor? :) But A and B do not have
> access control to prevent the user space from receiving a IOMMU group
> pointer (which it would not be able to use anyway but still). Do we care
> about this (I do not)? And using B (ioctls) within the kernel - I cannot
> say I saw many usages of this.

For kernel internal things we don't want to invent new ioctls or use the
KVM device get_attr interface. Note that I didn't provide a get_attr
interface for anything there now. I think we just want to create a
kernel internal interface, ie. function calls.

> I am pretty sure we will spend some time (not much) arguing about these
> things and when we agree on something, then some of KVM maintainers will
> step in and say that there is 1001st way of doing this and - goto start.
> And after all, this still won't be a device as it does not emulate anything
> present in the real hardware, this is just yet another interface to KVM and
> some ways of using it won't be natural for somebody. /me sighs.

And thus this RFC with somewhat rough code to try to get buyin. Thanks,

Alex

2013-09-26 04:31:35

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On 09/26/2013 06:19 AM, Alex Williamson wrote:
> On Sun, 2013-09-15 at 22:40 +1000, Alexey Kardashevskiy wrote:
>> On 09/14/2013 02:25 AM, Alex Williamson wrote:
>>> On Fri, 2013-09-13 at 18:49 +1000, Alexey Kardashevskiy wrote:
>>>> On 09/13/2013 07:23 AM, Alex Williamson wrote:
>>>>> So far we've succeeded at making KVM and VFIO mostly unaware of each
>>>>> other, but there's any important point where that breaks down. Intel
>>>>> VT-d hardware may or may not support snoop control. When snoop
>>>>> control is available, intel-iommu promotes No-Snoop transactions on
>>>>> PCIe to be cache coherent. That allows KVM to handle things like the
>>>>> x86 WBINVD opcode as a nop. When the hardware does not support this,
>>>>> KVM must implement a hardware visible WBINVD for the guest.
>>>>>
>>>>> We could simply let userspace tell KVM how to handle WBINVD, but it's
>>>>> privileged for a reason. Allowing an arbitrary user to enable
>>>>> physical WBINVD gives them a more access to the hardware. Previously,
>>>>> this has only been enabled for guests supporting legacy PCI device
>>>>> assignment. In such cases it's necessary for proper guest execution.
>>>>> We therefore create a new KVM-VFIO virtual device. The user can add
>>>>> and remove VFIO groups to this device via file descriptors. KVM
>>>>> makes use of the VFIO external user interface to validate that the
>>>>> user has access to physical hardware and gets the coherency state of
>>>>> the IOMMU from VFIO. This provides equivalent functionality to
>>>>> legacy KVM assignment, while keeping (nearly) all the bits isolated.
>>>>>
>>>>> The one intrusion is the resulting flag indicating the coherency
>>>>> state. For this RFC it's placed on the x86 kvm_arch struct, however
>>>>> I know POWER has interest in using the VFIO external user interface,
>>>>> and I'm hoping we can share a common KVM-VFIO device. Perhaps they
>>>>> care about No-Snoop handling as well or the code can be #ifdef'd.
>>>>
>>>>
>>>> POWER does not support (at least boos3s - "server", not sure about others)
>>>> this cache-non-coherent stuff at all.
>>>
>>> Then it's easy for your IOMMU API interface to return always cache
>>> coherent or never cache coherent or whatever ;)
>>>
>>>> Regarding reusing this device with external API for POWER - I posted a
>>>> patch which introduces KVM device to link KVM with IOMMU but besides the
>>>> list of groups registered in KVM, it also provides the way to find a group
>>>> by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So
>>>> in my case kvm_vfio_group struct needs LIOBN and it would be nice to have
>>>> there window_size too (for a quick boundary check). I am not sure we want
>>>> to mix everything here.
>>>>
>>>> It is in "[PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel
>>>> handling" if you are interested (kvmppc_spapr_tce_iommu_device).
>>>
>>> Yes, I stole the code to get the vfio symbols from your code. The
>>> convergence I was hoping to achieve is that KVM doesn't really want to
>>> know about VFIO and vica versa. We can therefore at least limit the
>>> intrusion by sharing a common device. Obviously for you it will need
>>> some extra interfaces to associate an LIOBN to a group, but we keep both
>>> the kernel an userspace cleaner by avoiding duplication where we can.
>>> Is this really not extensible to your usage?
>>
>> Well, I do not have a good taste for this kind of stuff so I cannot tell
>> for sure. I can reuse this device and hack it to do whatever I need but how?
>>
>> There are 2 things I need from KVM device:
>> 1. associate IOMMU group with LIOBN
>
> Does QEMU know this? We could set another attribute to specify the
> LIOBN for a group.

QEMU knows as QEMU decides on LOIBN number. And yes, we could do that.


>> 2. get a pointer to an IOMMU group by LIOBN (used to get ppc's IOMMU table
>> pointer which is an IOMMU data of an IOMMU group so we could take a
>> shortcut here).
>
> Once we have a VFIO device with a VFIO group added and the LIOBN
> attribute set, isn't this just a matter of some access code?


The lookup function will be called from what we call a realmode on PPC64,
i.e. when MMU is off and kvm.ko code is not available. So we will either
have to put this lookup function to a separate virt/kvm/vfio_rm.c or
compile the whole thing into the kernel image but this is not a big issue
anyway.

You can have a look for example at book3s_64_vio_hv.o vs. book3s_64_vio.o
to get a better picture if you like.


>> There are 3 possible interfaces to use:
>> A. set/get attributes
>> B. ioctl
>> C. additional API
>
> I think we need to differentiate user interfaces from kernel interfaces.
> Within the kernel, we make no guarantees about interfaces and we can
> change them to meet our needs. That includes the vfio external user
> interface. For userspace, we need to be careful not to break things. I
> suggest we use the set/get/has attribute interface that's already part
> of KVM for the user interface.
>
>> What I posted a week ago uses A for 1 and C for 2. Now we move this to
>> virt/kvm/vfio.c.
>
> I don't care where it lives, other than we both have a need for it, so
> it seems like the core of it should not live in architecture specific
> directories.
>
>> A for 1 is more or less ok but how exactly? Yet another attribute? Platform
>> specific "bus ID"? In your patch attr->addr is not really an address (to be
>> accessed with get_user()) but just an fd.
>
> Is that a problem?

Not for me but I have a bad taste :)


>> For 2 - there are already A and B interfaces so we do not want C, right?
>> What kind of a good device has a backdoor? :) But A and B do not have
>> access control to prevent the user space from receiving a IOMMU group
>> pointer (which it would not be able to use anyway but still). Do we care
>> about this (I do not)? And using B (ioctls) within the kernel - I cannot
>> say I saw many usages of this.
>
> For kernel internal things we don't want to invent new ioctls or use the
> KVM device get_attr interface. Note that I didn't provide a get_attr
> interface for anything there now. I think we just want to create a
> kernel internal interface, ie. function calls.
>
>> I am pretty sure we will spend some time (not much) arguing about these
>> things and when we agree on something, then some of KVM maintainers will
>> step in and say that there is 1001st way of doing this and - goto start.
>> And after all, this still won't be a device as it does not emulate anything
>> present in the real hardware, this is just yet another interface to KVM and
>> some ways of using it won't be natural for somebody. /me sighs.
>
> And thus this RFC with somewhat rough code to try to get buyin. Thanks,


Well, I do not see any big issue (including no-mmu real mode) with the VFIO
KVM device as you posted it.


--
Alexey

2013-09-29 13:16:38

by Gleb Natapov

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> So far we've succeeded at making KVM and VFIO mostly unaware of each
> other, but there's any important point where that breaks down. Intel
> VT-d hardware may or may not support snoop control. When snoop
> control is available, intel-iommu promotes No-Snoop transactions on
> PCIe to be cache coherent. That allows KVM to handle things like the
> x86 WBINVD opcode as a nop. When the hardware does not support this,
> KVM must implement a hardware visible WBINVD for the guest.
>
> We could simply let userspace tell KVM how to handle WBINVD, but it's
> privileged for a reason. Allowing an arbitrary user to enable
> physical WBINVD gives them a more access to the hardware. Previously,
> this has only been enabled for guests supporting legacy PCI device
> assignment. In such cases it's necessary for proper guest execution.
> We therefore create a new KVM-VFIO virtual device. The user can add
> and remove VFIO groups to this device via file descriptors. KVM
> makes use of the VFIO external user interface to validate that the
> user has access to physical hardware and gets the coherency state of
> the IOMMU from VFIO. This provides equivalent functionality to
> legacy KVM assignment, while keeping (nearly) all the bits isolated.
>
Looks good overall to me, one things though: to use legacy device
assignment one needs root permission, so only root user can enable
WBINVD emulation. Who does this permission checking here? Is only root
allowed to create non coherent group with vfio?

> The one intrusion is the resulting flag indicating the coherency
> state. For this RFC it's placed on the x86 kvm_arch struct, however
> I know POWER has interest in using the VFIO external user interface,
> and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> care about No-Snoop handling as well or the code can be #ifdef'd.
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
> Documentation/virtual/kvm/devices/vfio.txt | 22 +++
> arch/x86/include/asm/kvm_host.h | 1
> arch/x86/kvm/Makefile | 2
> arch/x86/kvm/vmx.c | 5 -
> arch/x86/kvm/x86.c | 5 -
> include/linux/kvm_host.h | 1
> include/uapi/linux/kvm.h | 4
> virt/kvm/kvm_main.c | 3
> virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
> 9 files changed, 275 insertions(+), 5 deletions(-)
> create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
> create mode 100644 virt/kvm/vfio.c
>
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> new file mode 100644
> index 0000000..831e6a6
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -0,0 +1,22 @@
> +VFIO virtual device
> +===================
> +
> +Device types supported:
> + KVM_DEV_TYPE_VFIO
> +
> +Only one VFIO instance may be created per VM. The created device
> +tracks VFIO groups in use by the VM and features of those groups
> +important to the correctness and acceleration of the VM. As groups
> +are enabled and disabled for use by the VM, KVM should be updated
> +about their presence. When registered with KVM, a reference to the
> +VFIO-group is held by KVM.
> +
> +Groups:
> + KVM_DEV_VFIO_ADD_GROUP
> + KVM_DEV_VFIO_DEL_GROUP
> +
> +Each takes a int32_t file descriptor for kvm_device_attr.addr and
> +does not support any group device kvm_device_attr.attr.
> +
> +RFC - Should we use Group KVM_DEV_VFIO_GROUP with Attributes
> + KVM_DEV_VFIO_GROUP_ADD & DEL?
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c76ff74..5b9350d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -588,6 +588,7 @@ struct kvm_arch {
>
> spinlock_t pvclock_gtod_sync_lock;
> bool use_master_clock;
> + bool vfio_noncoherent;
> u64 master_kernel_ns;
> cycle_t master_cycle_now;
>
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index bf4fb04..25d22b2 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -9,7 +9,7 @@ KVM := ../../../virt/kvm
>
> kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
> $(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
> - $(KVM)/eventfd.o $(KVM)/irqchip.o
> + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
> kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 1f1da43..94f7786 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7395,8 +7395,9 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> */
> if (is_mmio)
> ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> - else if (vcpu->kvm->arch.iommu_domain &&
> - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> + else if (vcpu->kvm->arch.vfio_noncoherent ||
> + vcpu->kvm->arch.iommu_domain &&
> + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> ret = kvm_get_guest_memory_type(vcpu, gfn) <<
> VMX_EPT_MT_EPTE_SHIFT;
> else
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e5ca72a..406ba6f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2715,8 +2715,9 @@ static void wbinvd_ipi(void *garbage)
>
> static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> {
> - return vcpu->kvm->arch.iommu_domain &&
> - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
> + return vcpu->kvm->arch.vfio_noncoherent ||
> + (vcpu->kvm->arch.iommu_domain &&
> + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY));
> }
>
> void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ca645a0..615f0c3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);
>
> extern struct kvm_device_ops kvm_mpic_ops;
> extern struct kvm_device_ops kvm_xics_ops;
> +extern struct kvm_device_ops kvm_vfio_ops;
>
> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 99c2533..8869616 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -843,6 +843,10 @@ struct kvm_device_attr {
> #define KVM_DEV_TYPE_FSL_MPIC_20 1
> #define KVM_DEV_TYPE_FSL_MPIC_42 2
> #define KVM_DEV_TYPE_XICS 3
> +#define KVM_DEV_TYPE_VFIO 4
> +
> +#define KVM_DEV_VFIO_ADD_GROUP 1
> +#define KVM_DEV_VFIO_DEL_GROUP 2
>
> /*
> * ioctls for VM fds
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d9cad4d..1a20425 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2269,6 +2269,9 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> ops = &kvm_xics_ops;
> break;
> #endif
> + case KVM_DEV_TYPE_VFIO:
> + ops = &kvm_vfio_ops;
> + break;
> default:
> return -ENODEV;
> }
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> new file mode 100644
> index 0000000..9a2faff
> --- /dev/null
> +++ b/virt/kvm/vfio.c
> @@ -0,0 +1,237 @@
> +/*
> + * VFIO bridge
> + *
> + * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/file.h>
> +#include <linux/kvm_host.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/slab.h>
> +#include <linux/vfio.h>
> +
> +struct kvm_vfio_group {
> + struct list_head node;
> + struct vfio_group *vfio_group;
> +};
> +
> +struct kvm_vfio {
> + struct list_head group_list;
> + struct mutex lock;
> +};
> +
> +static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
> +{
> + struct vfio_group *vfio_group;
> + struct vfio_group *(*fn)(struct file *);
> +
> + fn = symbol_get(vfio_group_get_external_user);
> + if (!fn)
> + return ERR_PTR(-EINVAL);
> +
> + vfio_group = fn(filep);
> +
> + symbol_put(vfio_group_get_external_user);
> +
> + return vfio_group;
> +}
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> + void (*fn)(struct vfio_group *);
> +
> + fn = symbol_get(vfio_group_put_external_user);
> + if (!fn)
> + return;
> +
> + fn(vfio_group);
> +
> + symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_check_extension(struct vfio_group *vfio_group,
> + unsigned long arg)
> +{
> + int (*fn)(struct vfio_group *, unsigned long);
> + int ret;
> +
> + fn = symbol_get(vfio_external_user_check_extension);
> + if (!fn)
> + return -EINVAL;
> +
> + ret = fn(vfio_group, arg);
> +
> + symbol_put(vfio_group_put_external_user);
> +
> + return ret;
> +}
> +
> +static void kvm_vfio_update_iommu_coherency(struct kvm_device *dev)
> +{
> + struct kvm_vfio *kv = dev->private;
> + bool coherent = true;
> + struct kvm_vfio_group *kvg;
> +
> + mutex_lock(&kv->lock);
> +
> + list_for_each_entry(kvg, &kv->group_list, node) {
> + if (!kvm_vfio_external_user_check_extension(kvg->vfio_group,
> + VFIO_IOMMU_CAP_CACHE_COHERENCY)) {
> + coherent = false;
> + break;
> + }
> + }
> +
> + mutex_unlock(&kv->lock);
> +
> + dev->kvm->arch.vfio_noncoherent = !coherent;
> +}
> +
> +static int kvm_vfio_set_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + struct kvm_vfio *kv = dev->private;
> + struct fd f;
> + struct vfio_group *vfio_group;
> + struct kvm_vfio_group *kvg;
> + int ret;
> +
> + switch (attr->group) {
> + case KVM_DEV_VFIO_ADD_GROUP:
> + f = fdget(attr->addr);
> + if (!f.file)
> + return -EBADF;
> +
> + vfio_group = kvm_vfio_group_get_external_user(f.file);
> + fdput(f);
> +
> + if (IS_ERR(vfio_group))
> + return PTR_ERR(vfio_group);
> +
> + mutex_lock(&kv->lock);
> +
> + list_for_each_entry(kvg, &kv->group_list, node) {
> + if (kvg->vfio_group == vfio_group) {
> + mutex_unlock(&kv->lock);
> + kvm_vfio_group_put_external_user(vfio_group);
> + return -EEXIST;
> + }
> + }
> +
> + kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
> + if (!kvg) {
> + mutex_unlock(&kv->lock);
> + kvm_vfio_group_put_external_user(vfio_group);
> + return -ENOMEM;
> + }
> +
> + list_add_tail(&kvg->node, &kv->group_list);
> + kvg->vfio_group = vfio_group;
> +
> + mutex_unlock(&kv->lock);
> +
> + kvm_vfio_update_iommu_coherency(dev);
> +
> + return 0;
> +
> + case KVM_DEV_VFIO_DEL_GROUP:
> + f = fdget(attr->addr);
> + if (!f.file)
> + return -EBADF;
> +
> + vfio_group = kvm_vfio_group_get_external_user(f.file);
> + fdput(f);
> +
> + if (IS_ERR(vfio_group))
> + return PTR_ERR(vfio_group);
> +
> + ret = -ENOENT;
> +
> + mutex_lock(&kv->lock);
> +
> + list_for_each_entry(kvg, &kv->group_list, node) {
> + if (kvg->vfio_group != vfio_group)
> + continue;
> +
> + list_del(&kvg->node);
> + kvm_vfio_group_put_external_user(kvg->vfio_group);
> + kfree(kvg);
> + ret = 0;
> + break;
> + }
> +
> + mutex_unlock(&kv->lock);
> + kvm_vfio_group_put_external_user(vfio_group);
> +
> + kvm_vfio_update_iommu_coherency(dev);
> +
> + return ret;
> + }
> +
> + return -ENXIO;
> +}
> +
> +static int kvm_vfio_has_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + switch (attr->group) {
> + case KVM_DEV_VFIO_ADD_GROUP:
> + case KVM_DEV_VFIO_DEL_GROUP:
> + return 0;
> + }
> +
> + return -ENXIO;
> +}
> +
> +static void kvm_vfio_destroy(struct kvm_device *dev)
> +{
> + struct kvm_vfio *kv = dev->private;
> + struct kvm_vfio_group *kvg, *tmp;
> +
> + list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> + kvm_vfio_group_put_external_user(kvg->vfio_group);
> + list_del(&kvg->node);
> + kfree(kvg);
> + }
> +
> + dev->kvm->arch.vfio_noncoherent = false;
> + kfree(kv);
> +}
> +
> +static int kvm_vfio_create(struct kvm_device *dev, u32 type)
> +{
> + struct kvm_device *tmp;
> + struct kvm_vfio *kv;
> +
> + /* Only one VFIO "device" per VM */
> + list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
> + if (tmp->ops == &kvm_vfio_ops)
> + return -EBUSY;
> +
> + kv = kzalloc(sizeof(*kv), GFP_KERNEL);
> + if (!kv)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&kv->group_list);
> + mutex_init(&kv->lock);
> +
> + dev->private = kv;
> +
> + return 0;
> +}
> +
> +struct kvm_device_ops kvm_vfio_ops = {
> + .name = "kvm-vfio",
> + .create = kvm_vfio_create,
> + .destroy = kvm_vfio_destroy,
> + .set_attr = kvm_vfio_set_attr,
> + .has_attr = kvm_vfio_has_attr,
> +};

--
Gleb.

2013-09-29 13:52:39

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Sun, 2013-09-29 at 16:16 +0300, Gleb Natapov wrote:
> On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > other, but there's any important point where that breaks down. Intel
> > VT-d hardware may or may not support snoop control. When snoop
> > control is available, intel-iommu promotes No-Snoop transactions on
> > PCIe to be cache coherent. That allows KVM to handle things like the
> > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > KVM must implement a hardware visible WBINVD for the guest.
> >
> > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > privileged for a reason. Allowing an arbitrary user to enable
> > physical WBINVD gives them a more access to the hardware. Previously,
> > this has only been enabled for guests supporting legacy PCI device
> > assignment. In such cases it's necessary for proper guest execution.
> > We therefore create a new KVM-VFIO virtual device. The user can add
> > and remove VFIO groups to this device via file descriptors. KVM
> > makes use of the VFIO external user interface to validate that the
> > user has access to physical hardware and gets the coherency state of
> > the IOMMU from VFIO. This provides equivalent functionality to
> > legacy KVM assignment, while keeping (nearly) all the bits isolated.
> >
> Looks good overall to me, one things though: to use legacy device
> assignment one needs root permission, so only root user can enable
> WBINVD emulation.

That's not entirely accurate, legacy device assignment can be used by a
non-root user, libvirt does this all the time. The part that requires
root access is opening the pci-sysfs config file, the rest can be
managed via file permissions on the remaining sysfs files.

> Who does this permission checking here? Is only root
> allowed to create non coherent group with vfio?

With vfio the user is granted permission by giving them access to the
vfio group file (/dev/vfio/$GROUP) and binding all the devices in the
group to vfio. That enables the user to create a container (~iommu
domain) with the group attached to it. Only then will the vfio external
user interface provide a reference to the group and enable this wbinvd
support. So, wbinvd emulation should only be available to a user that
"own" a vfio group and has it configured for use with this interface.
Thanks,

Alex

> > The one intrusion is the resulting flag indicating the coherency
> > state. For this RFC it's placed on the x86 kvm_arch struct, however
> > I know POWER has interest in using the VFIO external user interface,
> > and I'm hoping we can share a common KVM-VFIO device. Perhaps they
> > care about No-Snoop handling as well or the code can be #ifdef'd.
> >
> > Signed-off-by: Alex Williamson <[email protected]>
> > ---
> > Documentation/virtual/kvm/devices/vfio.txt | 22 +++
> > arch/x86/include/asm/kvm_host.h | 1
> > arch/x86/kvm/Makefile | 2
> > arch/x86/kvm/vmx.c | 5 -
> > arch/x86/kvm/x86.c | 5 -
> > include/linux/kvm_host.h | 1
> > include/uapi/linux/kvm.h | 4
> > virt/kvm/kvm_main.c | 3
> > virt/kvm/vfio.c | 237 ++++++++++++++++++++++++++++
> > 9 files changed, 275 insertions(+), 5 deletions(-)
> > create mode 100644 Documentation/virtual/kvm/devices/vfio.txt
> > create mode 100644 virt/kvm/vfio.c
> >
> > diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> > new file mode 100644
> > index 0000000..831e6a6
> > --- /dev/null
> > +++ b/Documentation/virtual/kvm/devices/vfio.txt
> > @@ -0,0 +1,22 @@
> > +VFIO virtual device
> > +===================
> > +
> > +Device types supported:
> > + KVM_DEV_TYPE_VFIO
> > +
> > +Only one VFIO instance may be created per VM. The created device
> > +tracks VFIO groups in use by the VM and features of those groups
> > +important to the correctness and acceleration of the VM. As groups
> > +are enabled and disabled for use by the VM, KVM should be updated
> > +about their presence. When registered with KVM, a reference to the
> > +VFIO-group is held by KVM.
> > +
> > +Groups:
> > + KVM_DEV_VFIO_ADD_GROUP
> > + KVM_DEV_VFIO_DEL_GROUP
> > +
> > +Each takes a int32_t file descriptor for kvm_device_attr.addr and
> > +does not support any group device kvm_device_attr.attr.
> > +
> > +RFC - Should we use Group KVM_DEV_VFIO_GROUP with Attributes
> > + KVM_DEV_VFIO_GROUP_ADD & DEL?
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index c76ff74..5b9350d 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -588,6 +588,7 @@ struct kvm_arch {
> >
> > spinlock_t pvclock_gtod_sync_lock;
> > bool use_master_clock;
> > + bool vfio_noncoherent;
> > u64 master_kernel_ns;
> > cycle_t master_cycle_now;
> >
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index bf4fb04..25d22b2 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -9,7 +9,7 @@ KVM := ../../../virt/kvm
> >
> > kvm-y += $(KVM)/kvm_main.o $(KVM)/ioapic.o \
> > $(KVM)/coalesced_mmio.o $(KVM)/irq_comm.o \
> > - $(KVM)/eventfd.o $(KVM)/irqchip.o
> > + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
> > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> >
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 1f1da43..94f7786 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7395,8 +7395,9 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > */
> > if (is_mmio)
> > ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> > - else if (vcpu->kvm->arch.iommu_domain &&
> > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > + else if (vcpu->kvm->arch.vfio_noncoherent ||
> > + vcpu->kvm->arch.iommu_domain &&
> > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY))
> > ret = kvm_get_guest_memory_type(vcpu, gfn) <<
> > VMX_EPT_MT_EPTE_SHIFT;
> > else
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e5ca72a..406ba6f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2715,8 +2715,9 @@ static void wbinvd_ipi(void *garbage)
> >
> > static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> > {
> > - return vcpu->kvm->arch.iommu_domain &&
> > - !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
> > + return vcpu->kvm->arch.vfio_noncoherent ||
> > + (vcpu->kvm->arch.iommu_domain &&
> > + !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY));
> > }
> >
> > void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ca645a0..615f0c3 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1065,6 +1065,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);
> >
> > extern struct kvm_device_ops kvm_mpic_ops;
> > extern struct kvm_device_ops kvm_xics_ops;
> > +extern struct kvm_device_ops kvm_vfio_ops;
> >
> > #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 99c2533..8869616 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -843,6 +843,10 @@ struct kvm_device_attr {
> > #define KVM_DEV_TYPE_FSL_MPIC_20 1
> > #define KVM_DEV_TYPE_FSL_MPIC_42 2
> > #define KVM_DEV_TYPE_XICS 3
> > +#define KVM_DEV_TYPE_VFIO 4
> > +
> > +#define KVM_DEV_VFIO_ADD_GROUP 1
> > +#define KVM_DEV_VFIO_DEL_GROUP 2
> >
> > /*
> > * ioctls for VM fds
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d9cad4d..1a20425 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2269,6 +2269,9 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> > ops = &kvm_xics_ops;
> > break;
> > #endif
> > + case KVM_DEV_TYPE_VFIO:
> > + ops = &kvm_vfio_ops;
> > + break;
> > default:
> > return -ENODEV;
> > }
> > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> > new file mode 100644
> > index 0000000..9a2faff
> > --- /dev/null
> > +++ b/virt/kvm/vfio.c
> > @@ -0,0 +1,237 @@
> > +/*
> > + * VFIO bridge
> > + *
> > + * Copyright (C) 2013 Red Hat, Inc. All rights reserved.
> > + * Author: Alex Williamson <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/errno.h>
> > +#include <linux/file.h>
> > +#include <linux/kvm_host.h>
> > +#include <linux/list.h>
> > +#include <linux/module.h>
> > +#include <linux/mutex.h>
> > +#include <linux/slab.h>
> > +#include <linux/vfio.h>
> > +
> > +struct kvm_vfio_group {
> > + struct list_head node;
> > + struct vfio_group *vfio_group;
> > +};
> > +
> > +struct kvm_vfio {
> > + struct list_head group_list;
> > + struct mutex lock;
> > +};
> > +
> > +static struct vfio_group *kvm_vfio_group_get_external_user(struct file *filep)
> > +{
> > + struct vfio_group *vfio_group;
> > + struct vfio_group *(*fn)(struct file *);
> > +
> > + fn = symbol_get(vfio_group_get_external_user);
> > + if (!fn)
> > + return ERR_PTR(-EINVAL);
> > +
> > + vfio_group = fn(filep);
> > +
> > + symbol_put(vfio_group_get_external_user);
> > +
> > + return vfio_group;
> > +}
> > +
> > +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> > +{
> > + void (*fn)(struct vfio_group *);
> > +
> > + fn = symbol_get(vfio_group_put_external_user);
> > + if (!fn)
> > + return;
> > +
> > + fn(vfio_group);
> > +
> > + symbol_put(vfio_group_put_external_user);
> > +}
> > +
> > +static int kvm_vfio_external_user_check_extension(struct vfio_group *vfio_group,
> > + unsigned long arg)
> > +{
> > + int (*fn)(struct vfio_group *, unsigned long);
> > + int ret;
> > +
> > + fn = symbol_get(vfio_external_user_check_extension);
> > + if (!fn)
> > + return -EINVAL;
> > +
> > + ret = fn(vfio_group, arg);
> > +
> > + symbol_put(vfio_group_put_external_user);
> > +
> > + return ret;
> > +}
> > +
> > +static void kvm_vfio_update_iommu_coherency(struct kvm_device *dev)
> > +{
> > + struct kvm_vfio *kv = dev->private;
> > + bool coherent = true;
> > + struct kvm_vfio_group *kvg;
> > +
> > + mutex_lock(&kv->lock);
> > +
> > + list_for_each_entry(kvg, &kv->group_list, node) {
> > + if (!kvm_vfio_external_user_check_extension(kvg->vfio_group,
> > + VFIO_IOMMU_CAP_CACHE_COHERENCY)) {
> > + coherent = false;
> > + break;
> > + }
> > + }
> > +
> > + mutex_unlock(&kv->lock);
> > +
> > + dev->kvm->arch.vfio_noncoherent = !coherent;
> > +}
> > +
> > +static int kvm_vfio_set_attr(struct kvm_device *dev,
> > + struct kvm_device_attr *attr)
> > +{
> > + struct kvm_vfio *kv = dev->private;
> > + struct fd f;
> > + struct vfio_group *vfio_group;
> > + struct kvm_vfio_group *kvg;
> > + int ret;
> > +
> > + switch (attr->group) {
> > + case KVM_DEV_VFIO_ADD_GROUP:
> > + f = fdget(attr->addr);
> > + if (!f.file)
> > + return -EBADF;
> > +
> > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > + fdput(f);
> > +
> > + if (IS_ERR(vfio_group))
> > + return PTR_ERR(vfio_group);
> > +
> > + mutex_lock(&kv->lock);
> > +
> > + list_for_each_entry(kvg, &kv->group_list, node) {
> > + if (kvg->vfio_group == vfio_group) {
> > + mutex_unlock(&kv->lock);
> > + kvm_vfio_group_put_external_user(vfio_group);
> > + return -EEXIST;
> > + }
> > + }
> > +
> > + kvg = kzalloc(sizeof(*kvg), GFP_KERNEL);
> > + if (!kvg) {
> > + mutex_unlock(&kv->lock);
> > + kvm_vfio_group_put_external_user(vfio_group);
> > + return -ENOMEM;
> > + }
> > +
> > + list_add_tail(&kvg->node, &kv->group_list);
> > + kvg->vfio_group = vfio_group;
> > +
> > + mutex_unlock(&kv->lock);
> > +
> > + kvm_vfio_update_iommu_coherency(dev);
> > +
> > + return 0;
> > +
> > + case KVM_DEV_VFIO_DEL_GROUP:
> > + f = fdget(attr->addr);
> > + if (!f.file)
> > + return -EBADF;
> > +
> > + vfio_group = kvm_vfio_group_get_external_user(f.file);
> > + fdput(f);
> > +
> > + if (IS_ERR(vfio_group))
> > + return PTR_ERR(vfio_group);
> > +
> > + ret = -ENOENT;
> > +
> > + mutex_lock(&kv->lock);
> > +
> > + list_for_each_entry(kvg, &kv->group_list, node) {
> > + if (kvg->vfio_group != vfio_group)
> > + continue;
> > +
> > + list_del(&kvg->node);
> > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > + kfree(kvg);
> > + ret = 0;
> > + break;
> > + }
> > +
> > + mutex_unlock(&kv->lock);
> > + kvm_vfio_group_put_external_user(vfio_group);
> > +
> > + kvm_vfio_update_iommu_coherency(dev);
> > +
> > + return ret;
> > + }
> > +
> > + return -ENXIO;
> > +}
> > +
> > +static int kvm_vfio_has_attr(struct kvm_device *dev,
> > + struct kvm_device_attr *attr)
> > +{
> > + switch (attr->group) {
> > + case KVM_DEV_VFIO_ADD_GROUP:
> > + case KVM_DEV_VFIO_DEL_GROUP:
> > + return 0;
> > + }
> > +
> > + return -ENXIO;
> > +}
> > +
> > +static void kvm_vfio_destroy(struct kvm_device *dev)
> > +{
> > + struct kvm_vfio *kv = dev->private;
> > + struct kvm_vfio_group *kvg, *tmp;
> > +
> > + list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> > + kvm_vfio_group_put_external_user(kvg->vfio_group);
> > + list_del(&kvg->node);
> > + kfree(kvg);
> > + }
> > +
> > + dev->kvm->arch.vfio_noncoherent = false;
> > + kfree(kv);
> > +}
> > +
> > +static int kvm_vfio_create(struct kvm_device *dev, u32 type)
> > +{
> > + struct kvm_device *tmp;
> > + struct kvm_vfio *kv;
> > +
> > + /* Only one VFIO "device" per VM */
> > + list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
> > + if (tmp->ops == &kvm_vfio_ops)
> > + return -EBUSY;
> > +
> > + kv = kzalloc(sizeof(*kv), GFP_KERNEL);
> > + if (!kv)
> > + return -ENOMEM;
> > +
> > + INIT_LIST_HEAD(&kv->group_list);
> > + mutex_init(&kv->lock);
> > +
> > + dev->private = kv;
> > +
> > + return 0;
> > +}
> > +
> > +struct kvm_device_ops kvm_vfio_ops = {
> > + .name = "kvm-vfio",
> > + .create = kvm_vfio_create,
> > + .destroy = kvm_vfio_destroy,
> > + .set_attr = kvm_vfio_set_attr,
> > + .has_attr = kvm_vfio_has_attr,
> > +};
>
> --
> Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2013-09-29 14:44:21

by Gleb Natapov

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Sun, Sep 29, 2013 at 07:52:28AM -0600, Alex Williamson wrote:
> On Sun, 2013-09-29 at 16:16 +0300, Gleb Natapov wrote:
> > On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> > > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > > other, but there's any important point where that breaks down. Intel
> > > VT-d hardware may or may not support snoop control. When snoop
> > > control is available, intel-iommu promotes No-Snoop transactions on
> > > PCIe to be cache coherent. That allows KVM to handle things like the
> > > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > > KVM must implement a hardware visible WBINVD for the guest.
> > >
> > > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > > privileged for a reason. Allowing an arbitrary user to enable
> > > physical WBINVD gives them a more access to the hardware. Previously,
> > > this has only been enabled for guests supporting legacy PCI device
> > > assignment. In such cases it's necessary for proper guest execution.
> > > We therefore create a new KVM-VFIO virtual device. The user can add
> > > and remove VFIO groups to this device via file descriptors. KVM
> > > makes use of the VFIO external user interface to validate that the
> > > user has access to physical hardware and gets the coherency state of
> > > the IOMMU from VFIO. This provides equivalent functionality to
> > > legacy KVM assignment, while keeping (nearly) all the bits isolated.
> > >
> > Looks good overall to me, one things though: to use legacy device
> > assignment one needs root permission, so only root user can enable
> > WBINVD emulation.
>
> That's not entirely accurate, legacy device assignment can be used by a
> non-root user, libvirt does this all the time. The part that requires
> root access is opening the pci-sysfs config file, the rest can be
> managed via file permissions on the remaining sysfs files.
>
So how libvirt manages to do that as non-root user if pci-sysfs config
file needs root permission. I didn't mean to say that legacy code
checks for root explicitly, what I meant is that at some point root
permission is needed.

> > Who does this permission checking here? Is only root
> > allowed to create non coherent group with vfio?
>
> With vfio the user is granted permission by giving them access to the
> vfio group file (/dev/vfio/$GROUP) and binding all the devices in the
> group to vfio. That enables the user to create a container (~iommu
> domain) with the group attached to it. Only then will the vfio external
> user interface provide a reference to the group and enable this wbinvd
> support. So, wbinvd emulation should only be available to a user that
> "own" a vfio group and has it configured for use with this interface.
What is the default permission of /dev/vfio/$GROUP?

--
Gleb.

2013-09-29 15:56:08

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Sun, 2013-09-29 at 17:44 +0300, Gleb Natapov wrote:
> On Sun, Sep 29, 2013 at 07:52:28AM -0600, Alex Williamson wrote:
> > On Sun, 2013-09-29 at 16:16 +0300, Gleb Natapov wrote:
> > > On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> > > > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > > > other, but there's any important point where that breaks down. Intel
> > > > VT-d hardware may or may not support snoop control. When snoop
> > > > control is available, intel-iommu promotes No-Snoop transactions on
> > > > PCIe to be cache coherent. That allows KVM to handle things like the
> > > > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > > > KVM must implement a hardware visible WBINVD for the guest.
> > > >
> > > > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > > > privileged for a reason. Allowing an arbitrary user to enable
> > > > physical WBINVD gives them a more access to the hardware. Previously,
> > > > this has only been enabled for guests supporting legacy PCI device
> > > > assignment. In such cases it's necessary for proper guest execution.
> > > > We therefore create a new KVM-VFIO virtual device. The user can add
> > > > and remove VFIO groups to this device via file descriptors. KVM
> > > > makes use of the VFIO external user interface to validate that the
> > > > user has access to physical hardware and gets the coherency state of
> > > > the IOMMU from VFIO. This provides equivalent functionality to
> > > > legacy KVM assignment, while keeping (nearly) all the bits isolated.
> > > >
> > > Looks good overall to me, one things though: to use legacy device
> > > assignment one needs root permission, so only root user can enable
> > > WBINVD emulation.
> >
> > That's not entirely accurate, legacy device assignment can be used by a
> > non-root user, libvirt does this all the time. The part that requires
> > root access is opening the pci-sysfs config file, the rest can be
> > managed via file permissions on the remaining sysfs files.
> >
> So how libvirt manages to do that as non-root user if pci-sysfs config
> file needs root permission. I didn't mean to say that legacy code
> checks for root explicitly, what I meant is that at some point root
> permission is needed.

Yes, libvirt needs admin permission for legacy to bind to pci-stub,
change permission on sysfs files and pass an opened pci config sysfs
file descriptor. For vfio libvirt needs admin permission to bind to
vfio-pci and change group file permission. From that perspective the
admin requirement is similar.

> > > Who does this permission checking here? Is only root
> > > allowed to create non coherent group with vfio?
> >
> > With vfio the user is granted permission by giving them access to the
> > vfio group file (/dev/vfio/$GROUP) and binding all the devices in the
> > group to vfio. That enables the user to create a container (~iommu
> > domain) with the group attached to it. Only then will the vfio external
> > user interface provide a reference to the group and enable this wbinvd
> > support. So, wbinvd emulation should only be available to a user that
> > "own" a vfio group and has it configured for use with this interface.
> What is the default permission of /dev/vfio/$GROUP?

It's 600. Thanks,

Alex

2013-09-30 13:23:38

by Gleb Natapov

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

On Sun, Sep 29, 2013 at 09:55:59AM -0600, Alex Williamson wrote:
> On Sun, 2013-09-29 at 17:44 +0300, Gleb Natapov wrote:
> > On Sun, Sep 29, 2013 at 07:52:28AM -0600, Alex Williamson wrote:
> > > On Sun, 2013-09-29 at 16:16 +0300, Gleb Natapov wrote:
> > > > On Thu, Sep 12, 2013 at 03:23:15PM -0600, Alex Williamson wrote:
> > > > > So far we've succeeded at making KVM and VFIO mostly unaware of each
> > > > > other, but there's any important point where that breaks down. Intel
> > > > > VT-d hardware may or may not support snoop control. When snoop
> > > > > control is available, intel-iommu promotes No-Snoop transactions on
> > > > > PCIe to be cache coherent. That allows KVM to handle things like the
> > > > > x86 WBINVD opcode as a nop. When the hardware does not support this,
> > > > > KVM must implement a hardware visible WBINVD for the guest.
> > > > >
> > > > > We could simply let userspace tell KVM how to handle WBINVD, but it's
> > > > > privileged for a reason. Allowing an arbitrary user to enable
> > > > > physical WBINVD gives them a more access to the hardware. Previously,
> > > > > this has only been enabled for guests supporting legacy PCI device
> > > > > assignment. In such cases it's necessary for proper guest execution.
> > > > > We therefore create a new KVM-VFIO virtual device. The user can add
> > > > > and remove VFIO groups to this device via file descriptors. KVM
> > > > > makes use of the VFIO external user interface to validate that the
> > > > > user has access to physical hardware and gets the coherency state of
> > > > > the IOMMU from VFIO. This provides equivalent functionality to
> > > > > legacy KVM assignment, while keeping (nearly) all the bits isolated.
> > > > >
> > > > Looks good overall to me, one things though: to use legacy device
> > > > assignment one needs root permission, so only root user can enable
> > > > WBINVD emulation.
> > >
> > > That's not entirely accurate, legacy device assignment can be used by a
> > > non-root user, libvirt does this all the time. The part that requires
> > > root access is opening the pci-sysfs config file, the rest can be
> > > managed via file permissions on the remaining sysfs files.
> > >
> > So how libvirt manages to do that as non-root user if pci-sysfs config
> > file needs root permission. I didn't mean to say that legacy code
> > checks for root explicitly, what I meant is that at some point root
> > permission is needed.
>
> Yes, libvirt needs admin permission for legacy to bind to pci-stub,
> change permission on sysfs files and pass an opened pci config sysfs
> file descriptor. For vfio libvirt needs admin permission to bind to
> vfio-pci and change group file permission. From that perspective the
> admin requirement is similar.
>
Yes, certainly appears so. Thanks.

--
Gleb.