2009-10-02 20:19:21

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 0/4] KVM: xinterface

(Applies to kvm.git/master:083e9e10)

For details, please read the patch headers.

[ Changelog:

v2:
*) We now re-use the vmfd as the binding token, instead of creating
a new separate namespace
*) Added support for switch_to(mm), which is much faster
*) Added support for memslot-cache for exploiting slot locality
*) Added support for scatter-gather access
*) Added support for xioevent interface

v1:
*) Initial release
]

This series is included in upstream AlacrityVM and is well tested and
known to work properly. Comments?

Kind Regards,
-Greg

---

Gregory Haskins (4):
KVM: add scatterlist support to xinterface
KVM: add io services to xinterface
KVM: introduce "xinterface" API for external interaction with guests
mm: export use_mm() and unuse_mm() to modules


arch/x86/kvm/Makefile | 2
include/linux/kvm_host.h | 3
include/linux/kvm_xinterface.h | 165 +++++++++++
kernel/fork.c | 1
mm/mmu_context.c | 3
virt/kvm/kvm_main.c | 24 ++
virt/kvm/xinterface.c | 587 ++++++++++++++++++++++++++++++++++++++++
7 files changed, 784 insertions(+), 1 deletions(-)
create mode 100644 include/linux/kvm_xinterface.h
create mode 100644 virt/kvm/xinterface.c

--
Signature


2009-10-02 20:19:37

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 1/4] mm: export use_mm() and unuse_mm() to modules

We want to use these functions from withing KVM, which may be built as
a module.

Signed-off-by: Gregory Haskins <[email protected]>
---

mm/mmu_context.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index ded9081..f31ba20 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -6,6 +6,7 @@
#include <linux/mm.h>
#include <linux/mmu_context.h>
#include <linux/sched.h>
+#include <linux/module.h>

#include <asm/mmu_context.h>

@@ -37,6 +38,7 @@ void use_mm(struct mm_struct *mm)
if (active_mm != mm)
mmdrop(active_mm);
}
+EXPORT_SYMBOL_GPL(use_mm);

/*
* unuse_mm
@@ -56,3 +58,4 @@ void unuse_mm(struct mm_struct *mm)
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
}
+EXPORT_SYMBOL_GPL(unuse_mm);

2009-10-02 20:19:40

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

What: xinterface is a mechanism that allows kernel modules external to
the kvm.ko proper to interface with a running guest. It accomplishes
this by creating an abstracted interface which does not expose any
private details of the guest or its related KVM structures, and provides
a mechanism to find and bind to this interface at run-time.

Why: There are various subsystems that would like to interact with a KVM
guest which are ideally suited to exist outside the domain of the kvm.ko
core logic. For instance, external pci-passthrough, virtual-bus, and
virtio-net modules are currently under development. In order for these
modules to successfully interact with the guest, they need, at the very
least, various interfaces for signaling IO events, pointer translation,
and possibly memory mapping.

The signaling case is covered by the recent introduction of the
irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the
other cases. Note that today we only expose pointer-translation related
functions, but more could be added at a future date as needs arise.

Example usage: QEMU instantiates a guest, and an external module "foo"
that desires the ability to interface with the guest (say via
open("/dev/foo")). QEMU may then pass the kvmfd to foo via an
ioctl, such as: ioctl(foofd, FOO_SET_VMID, &kvmfd). Upon receipt, the
foo module can issue kvm_xinterface_bind(kvmfd) to acquire
the proper context. Internally, the struct kvm* and associated
struct module* will remain pinned at least until the foo module calls
kvm_xinterface_put().

Signed-off-by: Gregory Haskins <[email protected]>
---

arch/x86/kvm/Makefile | 2
include/linux/kvm_host.h | 3
include/linux/kvm_xinterface.h | 114 +++++++++++
kernel/fork.c | 1
virt/kvm/kvm_main.c | 24 ++
virt/kvm/xinterface.c | 409 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 552 insertions(+), 1 deletions(-)
create mode 100644 include/linux/kvm_xinterface.h
create mode 100644 virt/kvm/xinterface.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31a7035..0449d6e 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I.

kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
coalesced_mmio.o irq_comm.o eventfd.o \
- assigned-dev.o)
+ assigned-dev.o xinterface.o)
kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o)

kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b985a29..7cc1afb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -362,6 +362,9 @@ void kvm_arch_sync_events(struct kvm *kvm);
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);

+struct kvm_xinterface *
+kvm_xinterface_alloc(struct kvm *kvm, struct module *owner);
+
int kvm_is_mmio_pfn(pfn_t pfn);

struct kvm_irq_ack_notifier {
diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h
new file mode 100644
index 0000000..01f092b
--- /dev/null
+++ b/include/linux/kvm_xinterface.h
@@ -0,0 +1,114 @@
+#ifndef __KVM_XINTERFACE_H
+#define __KVM_XINTERFACE_H
+
+/*
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/file.h>
+
+struct kvm_xinterface;
+struct kvm_xvmap;
+
+struct kvm_xinterface_ops {
+ unsigned long (*copy_to)(struct kvm_xinterface *intf,
+ unsigned long gpa, const void *src,
+ unsigned long len);
+ unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
+ unsigned long gpa, unsigned long len);
+ struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
+ unsigned long gpa,
+ unsigned long len);
+ void (*release)(struct kvm_xinterface *);
+};
+
+struct kvm_xinterface {
+ struct module *owner;
+ struct kref kref;
+ const struct kvm_xinterface_ops *ops;
+};
+
+static inline void
+kvm_xinterface_get(struct kvm_xinterface *intf)
+{
+ kref_get(&intf->kref);
+}
+
+static inline void
+_kvm_xinterface_release(struct kref *kref)
+{
+ struct kvm_xinterface *intf;
+ struct module *owner;
+
+ intf = container_of(kref, struct kvm_xinterface, kref);
+
+ owner = intf->owner;
+ rmb();
+
+ intf->ops->release(intf);
+ module_put(owner);
+}
+
+static inline void
+kvm_xinterface_put(struct kvm_xinterface *intf)
+{
+ kref_put(&intf->kref, _kvm_xinterface_release);
+}
+
+struct kvm_xvmap_ops {
+ void (*release)(struct kvm_xvmap *vmap);
+};
+
+struct kvm_xvmap {
+ struct kref kref;
+ const struct kvm_xvmap_ops *ops;
+ struct kvm_xinterface *intf;
+ void *addr;
+ size_t len;
+};
+
+static inline void
+kvm_xvmap_init(struct kvm_xvmap *vmap, const struct kvm_xvmap_ops *ops,
+ struct kvm_xinterface *intf)
+{
+ memset(vmap, 0, sizeof(vmap));
+ kref_init(&vmap->kref);
+ vmap->ops = ops;
+ vmap->intf = intf;
+
+ kvm_xinterface_get(intf);
+}
+
+static inline void
+kvm_xvmap_get(struct kvm_xvmap *vmap)
+{
+ kref_get(&vmap->kref);
+}
+
+static inline void
+_kvm_xvmap_release(struct kref *kref)
+{
+ struct kvm_xvmap *vmap;
+ struct kvm_xinterface *intf;
+
+ vmap = container_of(kref, struct kvm_xvmap, kref);
+
+ intf = vmap->intf;
+ rmb();
+
+ vmap->ops->release(vmap);
+ kvm_xinterface_put(intf);
+}
+
+static inline void
+kvm_xvmap_put(struct kvm_xvmap *vmap)
+{
+ kref_put(&vmap->kref, _kvm_xvmap_release);
+}
+
+struct kvm_xinterface *kvm_xinterface_bind(int fd);
+
+#endif /* __KVM_XINTERFACE_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1020977..6290e95 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -167,6 +167,7 @@ void __put_task_struct(struct task_struct *tsk)
if (!profile_handoff_task(tsk))
free_task(tsk);
}
+EXPORT_SYMBOL_GPL(__put_task_struct);

/*
* macro override instead of weak attribute alias, to workaround
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9e776d9..0fae69c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -43,6 +43,7 @@
#include <linux/swap.h>
#include <linux/bitops.h>
#include <linux/spinlock.h>
+#include <linux/kvm_xinterface.h>

#include <asm/processor.h>
#include <asm/io.h>
@@ -2098,3 +2099,26 @@ void kvm_exit(void)
__free_page(bad_page);
}
EXPORT_SYMBOL_GPL(kvm_exit);
+
+struct kvm_xinterface *
+kvm_xinterface_bind(int fd)
+{
+ struct kvm_xinterface *intf;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+
+ if (file->f_op != &kvm_vm_fops) {
+ fput(file);
+ return ERR_PTR(-EINVAL);
+ }
+
+ intf = kvm_xinterface_alloc(file->private_data, file->f_op->owner);
+
+ fput(file);
+
+ return intf;
+}
+EXPORT_SYMBOL_GPL(kvm_xinterface_bind);
diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
new file mode 100644
index 0000000..3b586c5
--- /dev/null
+++ b/virt/kvm/xinterface.c
@@ -0,0 +1,409 @@
+/*
+ * KVM module interface - Allows external modules to interface with a guest
+ *
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <linux/mmu_context.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_xinterface.h>
+
+struct _xinterface {
+ struct kvm *kvm;
+ struct task_struct *task;
+ struct mm_struct *mm;
+ struct kvm_xinterface intf;
+ struct kvm_memory_slot *slotcache[NR_CPUS];
+};
+
+struct _xvmap {
+ struct kvm_memory_slot *memslot;
+ unsigned long npages;
+ struct kvm_xvmap vmap;
+};
+
+static struct _xinterface *
+to_intf(struct kvm_xinterface *intf)
+{
+ return container_of(intf, struct _xinterface, intf);
+}
+
+#define _gfn_to_hva(gfn, memslot) \
+ (memslot->userspace_addr + (gfn - memslot->base_gfn) * PAGE_SIZE)
+
+/*
+ * gpa_to_hva() - translate a guest-physical to host-virtual using
+ * a per-cpu cache of the memslot.
+ *
+ * The gfn_to_memslot() call is relatively expensive, and the gpa access
+ * patterns exhibit a high degree of locality. Therefore, lets cache
+ * the last slot used on a per-cpu basis to optimize the lookup
+ *
+ * assumes slots_lock held for read
+ */
+static unsigned long
+gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
+{
+ int cpu = get_cpu();
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot = _intf->slotcache[cpu];
+ unsigned long addr = 0;
+
+ if (!memslot
+ || gfn < memslot->base_gfn
+ || gfn >= memslot->base_gfn + memslot->npages) {
+
+ memslot = gfn_to_memslot(_intf->kvm, gfn);
+ if (!memslot)
+ goto out;
+
+ _intf->slotcache[cpu] = memslot;
+ }
+
+ addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa);
+
+out:
+ put_cpu();
+
+ return addr;
+}
+
+/*------------------------------------------------------------------------*/
+
+static void *
+_vmap(struct _xinterface *_intf, unsigned long addr, unsigned long offset,
+ unsigned long npages)
+{
+ struct task_struct *p = _intf->task;
+ struct mm_struct *mm = _intf->mm;
+ struct page **page_list;
+ void *ptr = NULL;
+ int ret;
+
+ if (npages > (PAGE_SIZE / sizeof(struct page *)))
+ return NULL;
+
+ page_list = (struct page **) __get_free_page(GFP_KERNEL);
+ if (!page_list)
+ return NULL;
+
+ down_write(&mm->mmap_sem);
+
+ ret = get_user_pages(p, mm, addr, npages, 1, 0, page_list, NULL);
+ if (ret < 0)
+ goto out;
+
+ ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL);
+ if (ptr)
+ mm->locked_vm += npages;
+
+ ptr = ptr + offset;
+
+out:
+ up_write(&mm->mmap_sem);
+
+ free_page((unsigned long)page_list);
+
+ return ptr;
+}
+
+static void
+_vunmap(struct _xinterface *_intf, void *addr, size_t npages)
+{
+ down_write(&_intf->mm->mmap_sem);
+
+ vunmap((void *)((unsigned long)addr & PAGE_MASK));
+ _intf->mm->locked_vm -= npages;
+
+ up_write(&_intf->mm->mmap_sem);
+}
+
+static void
+xvmap_release(struct kvm_xvmap *vmap)
+{
+ struct _xvmap *_xvmap = container_of(vmap, struct _xvmap, vmap);
+ struct _xinterface *_intf = to_intf(_xvmap->vmap.intf);
+
+ _vunmap(_intf, _xvmap->vmap.addr, _xvmap->npages);
+ kfree(_xvmap);
+}
+
+const static struct kvm_xvmap_ops _xvmap_ops = {
+ .release = xvmap_release,
+};
+
+/*------------------------------------------------------------------------*/
+
+/*
+ * This function is invoked in the cases where a process context other
+ * than _intf->mm tries to copy data. Otherwise, we use copy_to_user()
+ */
+static unsigned long
+_slow_copy_to_user(struct _xinterface *_intf, unsigned long dst,
+ const void *src, unsigned long n)
+{
+ struct task_struct *p = _intf->task;
+ struct mm_struct *mm = _intf->mm;
+
+ while (n) {
+ unsigned long offset = offset_in_page(dst);
+ unsigned long len = PAGE_SIZE - offset;
+ int ret;
+ struct page *pg;
+ void *maddr;
+
+ if (len > n)
+ len = n;
+
+ down_read(&mm->mmap_sem);
+ ret = get_user_pages(p, mm, dst, 1, 1, 0, &pg, NULL);
+
+ if (ret != 1) {
+ up_read(&mm->mmap_sem);
+ break;
+ }
+
+ maddr = kmap_atomic(pg, KM_USER0);
+ memcpy(maddr + offset, src, len);
+ kunmap_atomic(maddr, KM_USER0);
+ set_page_dirty_lock(pg);
+ put_page(pg);
+ up_read(&mm->mmap_sem);
+
+ src += len;
+ dst += len;
+ n -= len;
+ }
+
+ return n;
+}
+
+static unsigned long
+xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa,
+ const void *src, unsigned long n)
+{
+ struct _xinterface *_intf = to_intf(intf);
+ unsigned long dst;
+ bool kthread = !current->mm;
+
+ down_read(&_intf->kvm->slots_lock);
+
+ dst = gpa_to_hva(_intf, gpa);
+ if (!dst)
+ goto out;
+
+ if (kthread)
+ use_mm(_intf->mm);
+
+ if (kthread || _intf->mm == current->mm)
+ n = copy_to_user((void *)dst, src, n);
+ else
+ n = _slow_copy_to_user(_intf, dst, src, n);
+
+ if (kthread)
+ unuse_mm(_intf->mm);
+
+out:
+ up_read(&_intf->kvm->slots_lock);
+
+ return n;
+}
+
+/*
+ * This function is invoked in the cases where a process context other
+ * than _intf->mm tries to copy data. Otherwise, we use copy_from_user()
+ */
+static unsigned long
+_slow_copy_from_user(struct _xinterface *_intf, void *dst,
+ unsigned long src, unsigned long n)
+{
+ struct task_struct *p = _intf->task;
+ struct mm_struct *mm = _intf->mm;
+
+ while (n) {
+ unsigned long offset = offset_in_page(src);
+ unsigned long len = PAGE_SIZE - offset;
+ int ret;
+ struct page *pg;
+ void *maddr;
+
+ if (len > n)
+ len = n;
+
+ down_read(&mm->mmap_sem);
+ ret = get_user_pages(p, mm, src, 1, 1, 0, &pg, NULL);
+
+ if (ret != 1) {
+ up_read(&mm->mmap_sem);
+ break;
+ }
+
+ maddr = kmap_atomic(pg, KM_USER0);
+ memcpy(dst, maddr + offset, len);
+ kunmap_atomic(maddr, KM_USER0);
+ put_page(pg);
+ up_read(&mm->mmap_sem);
+
+ src += len;
+ dst += len;
+ n -= len;
+ }
+
+ return n;
+}
+
+static unsigned long
+xinterface_copy_from(struct kvm_xinterface *intf, void *dst,
+ unsigned long gpa, unsigned long n)
+{
+ struct _xinterface *_intf = to_intf(intf);
+ unsigned long src;
+ bool kthread = !current->mm;
+
+ down_read(&_intf->kvm->slots_lock);
+
+ src = gpa_to_hva(_intf, gpa);
+ if (!src)
+ goto out;
+
+ if (kthread)
+ use_mm(_intf->mm);
+
+ if (kthread || _intf->mm == current->mm)
+ n = copy_from_user(dst, (void *)src, n);
+ else
+ n = _slow_copy_from_user(_intf, dst, src, n);
+
+ if (kthread)
+ unuse_mm(_intf->mm);
+
+out:
+ up_read(&_intf->kvm->slots_lock);
+
+ return n;
+}
+
+static struct kvm_xvmap *
+xinterface_vmap(struct kvm_xinterface *intf,
+ unsigned long gpa,
+ unsigned long len)
+{
+ struct _xinterface *_intf = to_intf(intf);
+ struct _xvmap *_xvmap;
+ struct kvm_memory_slot *memslot;
+ struct kvm *kvm = _intf->kvm;
+ int ret = -EINVAL;
+ void *addr = NULL;
+ off_t offset = offset_in_page(gpa);
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ unsigned long npages;
+
+ down_read(&kvm->slots_lock);
+
+ memslot = gfn_to_memslot(kvm, gfn);
+ if (!memslot)
+ goto fail;
+
+ /* Check if the request walks off the end of the slot */
+ if ((offset + len) > (memslot->npages << PAGE_SHIFT))
+ goto fail;
+
+ npages = PAGE_ALIGN(len + offset) >> PAGE_SHIFT;
+
+ addr = _vmap(_intf, _gfn_to_hva(gfn, memslot), offset, npages);
+ if (!addr) {
+ ret = -EFAULT;
+ goto fail;
+ }
+
+ _xvmap = kzalloc(sizeof(*_xvmap), GFP_KERNEL);
+ if (!_xvmap) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ _xvmap->memslot = memslot;
+ _xvmap->npages = npages;
+
+ kvm_xvmap_init(&_xvmap->vmap, &_xvmap_ops, intf);
+ _xvmap->vmap.addr = addr;
+ _xvmap->vmap.len = len;
+
+ up_read(&kvm->slots_lock);
+
+ return &_xvmap->vmap;
+
+fail:
+ if (addr)
+ _vunmap(_intf, addr, len);
+
+ up_read(&kvm->slots_lock);
+
+ return ERR_PTR(ret);
+}
+
+static void
+xinterface_release(struct kvm_xinterface *intf)
+{
+ struct _xinterface *_intf = to_intf(intf);
+
+ mmput(_intf->mm);
+ put_task_struct(_intf->task);
+ kvm_put_kvm(_intf->kvm);
+ kfree(_intf);
+}
+
+struct kvm_xinterface_ops _xinterface_ops = {
+ .copy_to = xinterface_copy_to,
+ .copy_from = xinterface_copy_from,
+ .vmap = xinterface_vmap,
+ .release = xinterface_release,
+};
+
+struct kvm_xinterface *
+kvm_xinterface_alloc(struct kvm *kvm, struct module *owner)
+{
+ struct _xinterface *_intf;
+ struct kvm_xinterface *intf;
+
+ _intf = kzalloc(sizeof(*_intf), GFP_KERNEL);
+ if (!_intf)
+ return ERR_PTR(-ENOMEM);
+
+ intf = &_intf->intf;
+
+ __module_get(owner);
+ intf->owner = owner;
+ kref_init(&intf->kref);
+ intf->ops = &_xinterface_ops;
+
+ kvm_get_kvm(kvm);
+ _intf->kvm = kvm;
+
+ _intf->task = current;
+ get_task_struct(_intf->task);
+
+ _intf->mm = get_task_mm(_intf->task);
+
+ return intf;
+}

2009-10-02 20:19:42

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 3/4] KVM: add io services to xinterface

We want to add a more efficient way to get PIO signals out of the guest,
so we add an "xioevent" interface. This allows a client to register
for notifications when a specific MMIO/PIO address is touched by
the guest. This is an alternative interface to ioeventfd, which is
performance limited by io-bus scaling and eventfd wait-queue based
notification mechanism. This also has the advantage of retaining
the full PIO data payload and passing it to the recipient.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/kvm_xinterface.h | 47 ++++++++++++++++++
virt/kvm/xinterface.c | 106 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 153 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h
index 01f092b..684b6f8 100644
--- a/include/linux/kvm_xinterface.h
+++ b/include/linux/kvm_xinterface.h
@@ -12,6 +12,16 @@

struct kvm_xinterface;
struct kvm_xvmap;
+struct kvm_xioevent;
+
+enum {
+ kvm_xioevent_flag_nr_pio,
+ kvm_xioevent_flag_nr_max,
+};
+
+#define KVM_XIOEVENT_FLAG_PIO (1 << kvm_xioevent_flag_nr_pio)
+
+#define KVM_XIOEVENT_VALID_FLAG_MASK ((1 << kvm_xioevent_flag_nr_max) - 1)

struct kvm_xinterface_ops {
unsigned long (*copy_to)(struct kvm_xinterface *intf,
@@ -22,6 +32,10 @@ struct kvm_xinterface_ops {
struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
unsigned long gpa,
unsigned long len);
+ struct kvm_xioevent* (*ioevent)(struct kvm_xinterface *intf,
+ u64 addr,
+ unsigned long len,
+ unsigned long flags);
void (*release)(struct kvm_xinterface *);
};

@@ -109,6 +123,39 @@ kvm_xvmap_put(struct kvm_xvmap *vmap)
kref_put(&vmap->kref, _kvm_xvmap_release);
}

+struct kvm_xioevent_ops {
+ void (*deassign)(struct kvm_xioevent *ioevent);
+};
+
+struct kvm_xioevent {
+ const struct kvm_xioevent_ops *ops;
+ struct kvm_xinterface *intf;
+ void (*signal)(struct kvm_xioevent *ioevent, const void *val);
+ void *priv;
+};
+
+static inline void
+kvm_xioevent_init(struct kvm_xioevent *ioevent,
+ const struct kvm_xioevent_ops *ops,
+ struct kvm_xinterface *intf)
+{
+ memset(ioevent, 0, sizeof(vmap));
+ ioevent->ops = ops;
+ ioevent->intf = intf;
+
+ kvm_xinterface_get(intf);
+}
+
+static inline void
+kvm_xioevent_deassign(struct kvm_xioevent *ioevent)
+{
+ struct kvm_xinterface *intf = ioevent->intf;
+ rmb();
+
+ ioevent->ops->deassign(ioevent);
+ kvm_xinterface_put(intf);
+}
+
struct kvm_xinterface *kvm_xinterface_bind(int fd);

#endif /* __KVM_XINTERFACE_H */
diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
index 3b586c5..c356835 100644
--- a/virt/kvm/xinterface.c
+++ b/virt/kvm/xinterface.c
@@ -28,6 +28,8 @@
#include <linux/kvm_host.h>
#include <linux/kvm_xinterface.h>

+#include "iodev.h"
+
struct _xinterface {
struct kvm *kvm;
struct task_struct *task;
@@ -42,6 +44,14 @@ struct _xvmap {
struct kvm_xvmap vmap;
};

+struct _ioevent {
+ u64 addr;
+ int length;
+ struct kvm_io_bus *bus;
+ struct kvm_io_device dev;
+ struct kvm_xioevent ioevent;
+};
+
static struct _xinterface *
to_intf(struct kvm_xinterface *intf)
{
@@ -362,6 +372,101 @@ fail:
return ERR_PTR(ret);
}

+/* MMIO/PIO writes trigger an event if the addr/val match */
+static int
+ioevent_write(struct kvm_io_device *dev, gpa_t addr, int len, const void *val)
+{
+ struct _ioevent *p = container_of(dev, struct _ioevent, dev);
+ struct kvm_xioevent *ioevent = &p->ioevent;
+
+ if (!(addr == p->addr && len == p->length))
+ return -EOPNOTSUPP;
+
+ if (!ioevent->signal)
+ return 0;
+
+ ioevent->signal(ioevent, val);
+ return 0;
+}
+
+static const struct kvm_io_device_ops ioevent_device_ops = {
+ .write = ioevent_write,
+};
+
+static void
+ioevent_deassign(struct kvm_xioevent *ioevent)
+{
+ struct _ioevent *p = container_of(ioevent, struct _ioevent, ioevent);
+ struct _xinterface *_intf = to_intf(ioevent->intf);
+ struct kvm *kvm = _intf->kvm;
+
+ kvm_io_bus_unregister_dev(kvm, p->bus, &p->dev);
+ kfree(p);
+}
+
+static const struct kvm_xioevent_ops ioevent_intf_ops = {
+ .deassign = ioevent_deassign,
+};
+
+static struct kvm_xioevent*
+xinterface_ioevent(struct kvm_xinterface *intf,
+ u64 addr,
+ unsigned long len,
+ unsigned long flags)
+{
+ struct _xinterface *_intf = to_intf(intf);
+ struct kvm *kvm = _intf->kvm;
+ int pio = flags & KVM_XIOEVENT_FLAG_PIO;
+ struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
+ struct _ioevent *p;
+ int ret;
+
+ /* must be natural-word sized */
+ switch (len) {
+ case 1:
+ case 2:
+ case 4:
+ case 8:
+ break;
+ default:
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* check for range overflow */
+ if (addr + len < addr)
+ return ERR_PTR(-EINVAL);
+
+ /* check for extra flags that we don't understand */
+ if (flags & ~KVM_XIOEVENT_VALID_FLAG_MASK)
+ return ERR_PTR(-EINVAL);
+
+ p = kzalloc(sizeof(*p), GFP_KERNEL);
+ if (!p) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ p->addr = addr;
+ p->length = len;
+ p->bus = bus;
+
+ kvm_iodevice_init(&p->dev, &ioevent_device_ops);
+
+ ret = kvm_io_bus_register_dev(kvm, bus, &p->dev);
+ if (ret < 0)
+ goto fail;
+
+ kvm_xioevent_init(&p->ioevent, &ioevent_intf_ops, intf);
+
+ return &p->ioevent;
+
+fail:
+ kfree(p);
+
+ return ERR_PTR(ret);
+
+}
+
static void
xinterface_release(struct kvm_xinterface *intf)
{
@@ -377,6 +482,7 @@ struct kvm_xinterface_ops _xinterface_ops = {
.copy_to = xinterface_copy_to,
.copy_from = xinterface_copy_from,
.vmap = xinterface_vmap,
+ .ioevent = xinterface_ioevent,
.release = xinterface_release,
};

2009-10-02 20:19:55

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 4/4] KVM: add scatterlist support to xinterface

This allows a scatter-gather approach to IO, which will be useful for
building high performance interfaces, like zero-copy and low-latency
copy (avoiding multiple calls to copy_to/from).

The interface is based on the existing scatterlist infrastructure. The
caller is expected to pass in a scatterlist with its "dma" field
populated with valid GPAs. The xinterface will then populate each
entry by translating the GPA to a page*.

The caller signifies completion by simply performing a put_page() on
each page returned in the list.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/kvm_xinterface.h | 4 ++
virt/kvm/xinterface.c | 72 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h
index 684b6f8..eefb575 100644
--- a/include/linux/kvm_xinterface.h
+++ b/include/linux/kvm_xinterface.h
@@ -9,6 +9,7 @@
#include <linux/kref.h>
#include <linux/module.h>
#include <linux/file.h>
+#include <linux/scatterlist.h>

struct kvm_xinterface;
struct kvm_xvmap;
@@ -36,6 +37,9 @@ struct kvm_xinterface_ops {
u64 addr,
unsigned long len,
unsigned long flags);
+ unsigned long (*sgmap)(struct kvm_xinterface *intf,
+ struct scatterlist *sgl, int nents,
+ unsigned long flags);
void (*release)(struct kvm_xinterface *);
};

diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
index c356835..16729f6 100644
--- a/virt/kvm/xinterface.c
+++ b/virt/kvm/xinterface.c
@@ -467,6 +467,77 @@ fail:

}

+static unsigned long
+xinterface_sgmap(struct kvm_xinterface *intf,
+ struct scatterlist *sgl, int nents,
+ unsigned long flags)
+{
+ struct _xinterface *_intf = to_intf(intf);
+ struct task_struct *p = _intf->task;
+ struct mm_struct *mm = _intf->mm;
+ struct kvm *kvm = _intf->kvm;
+ struct kvm_memory_slot *memslot = NULL;
+ bool kthread = !current->mm;
+ int ret;
+ struct scatterlist *sg;
+ int i;
+
+ down_read(&kvm->slots_lock);
+
+ if (kthread)
+ use_mm(_intf->mm);
+
+ for_each_sg(sgl, sg, nents, i) {
+ unsigned long gpa = sg_dma_address(sg);
+ unsigned long len = sg_dma_len(sg);
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ off_t offset = offset_in_page(gpa);
+ unsigned long hva;
+ struct page *pg;
+
+ /* ensure that we do not have more than one page per entry */
+ if ((PAGE_ALIGN(len + offset) >> PAGE_SHIFT) != 1) {
+ ret = -EINVAL;
+ break;
+ }
+
+ /* check for a memslot-cache miss */
+ if (!memslot
+ || gfn < memslot->base_gfn
+ || gfn >= memslot->base_gfn + memslot->npages) {
+ memslot = gfn_to_memslot(kvm, gfn);
+ if (!memslot) {
+ ret = -EFAULT;
+ break;
+ }
+ }
+
+ hva = (memslot->userspace_addr +
+ (gfn - memslot->base_gfn) * PAGE_SIZE);
+
+ if (kthread || current->mm == mm)
+ ret = get_user_pages_fast(hva, 1, 1, &pg);
+ else
+ ret = get_user_pages(p, mm, hva, 1, 1, 0, &pg, NULL);
+
+ if (ret != 1) {
+ if (ret >= 0)
+ ret = -EFAULT;
+ break;
+ }
+
+ sg_set_page(sg, pg, len, offset);
+ ret = 0;
+ }
+
+ if (kthread)
+ unuse_mm(_intf->mm);
+
+ up_read(&kvm->slots_lock);
+
+ return ret;
+}
+
static void
xinterface_release(struct kvm_xinterface *intf)
{
@@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = {
.copy_from = xinterface_copy_from,
.vmap = xinterface_vmap,
.ioevent = xinterface_ioevent,
+ .sgmap = xinterface_sgmap,
.release = xinterface_release,
};

2009-10-03 20:06:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On Fri, Oct 02, 2009 at 04:19:27PM -0400, Gregory Haskins wrote:
> What: xinterface is a mechanism that allows kernel modules external to
> the kvm.ko proper to interface with a running guest. It accomplishes
> this by creating an abstracted interface which does not expose any
> private details of the guest or its related KVM structures, and provides
> a mechanism to find and bind to this interface at run-time.
>
> Why: There are various subsystems that would like to interact with a KVM
> guest which are ideally suited to exist outside the domain of the kvm.ko
> core logic. For instance, external pci-passthrough, virtual-bus, and
> virtio-net modules are currently under development. In order for these
> modules to successfully interact with the guest, they need, at the very
> least, various interfaces for signaling IO events, pointer translation,
> and possibly memory mapping.
>
> The signaling case is covered by the recent introduction of the
> irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the
> other cases. Note that today we only expose pointer-translation related
> functions, but more could be added at a future date as needs arise.
>
> Example usage: QEMU instantiates a guest, and an external module "foo"
> that desires the ability to interface with the guest (say via
> open("/dev/foo")). QEMU may then pass the kvmfd to foo via an
> ioctl, such as: ioctl(foofd, FOO_SET_VMID, &kvmfd). Upon receipt, the
> foo module can issue kvm_xinterface_bind(kvmfd) to acquire
> the proper context. Internally, the struct kvm* and associated
> struct module* will remain pinned at least until the foo module calls
> kvm_xinterface_put().

> --- /dev/null
> +++ b/virt/kvm/xinterface.c
> @@ -0,0 +1,409 @@
> +/*
> + * KVM module interface - Allows external modules to interface with a guest
> + *
> + * Copyright 2009 Novell. All Rights Reserved.
> + *
> + * Author:
> + * Gregory Haskins <[email protected]>
> + *
> + * This file is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software Foundation,
> + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/highmem.h>
> +#include <linux/module.h>
> +#include <linux/mmu_context.h>
> +#include <linux/kvm_host.h>
> +#include <linux/kvm_xinterface.h>
> +
> +struct _xinterface {
> + struct kvm *kvm;
> + struct task_struct *task;
> + struct mm_struct *mm;
> + struct kvm_xinterface intf;
> + struct kvm_memory_slot *slotcache[NR_CPUS];
> +};
> +
> +struct _xvmap {
> + struct kvm_memory_slot *memslot;
> + unsigned long npages;
> + struct kvm_xvmap vmap;
> +};
> +
> +static struct _xinterface *
> +to_intf(struct kvm_xinterface *intf)
> +{
> + return container_of(intf, struct _xinterface, intf);
> +}
> +
> +#define _gfn_to_hva(gfn, memslot) \
> + (memslot->userspace_addr + (gfn - memslot->base_gfn) * PAGE_SIZE)
> +
> +/*
> + * gpa_to_hva() - translate a guest-physical to host-virtual using
> + * a per-cpu cache of the memslot.
> + *
> + * The gfn_to_memslot() call is relatively expensive, and the gpa access
> + * patterns exhibit a high degree of locality. Therefore, lets cache
> + * the last slot used on a per-cpu basis to optimize the lookup
> + *
> + * assumes slots_lock held for read
> + */
> +static unsigned long
> +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
> +{
> + int cpu = get_cpu();
> + unsigned long gfn = gpa >> PAGE_SHIFT;
> + struct kvm_memory_slot *memslot = _intf->slotcache[cpu];
> + unsigned long addr = 0;
> +
> + if (!memslot
> + || gfn < memslot->base_gfn
> + || gfn >= memslot->base_gfn + memslot->npages) {
> +
> + memslot = gfn_to_memslot(_intf->kvm, gfn);
> + if (!memslot)
> + goto out;
> +
> + _intf->slotcache[cpu] = memslot;
> + }
> +
> + addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa);
> +
> +out:
> + put_cpu();
> +
> + return addr;

Please optimize gfn_to_memslot() instead, so everybody benefits. It
shows very often on profiles.

> +
> + page_list = (struct page **) __get_free_page(GFP_KERNEL);
> + if (!page_list)
> + return NULL;
> +
> + down_write(&mm->mmap_sem);
> +
> + ret = get_user_pages(p, mm, addr, npages, 1, 0, page_list, NULL);
> + if (ret < 0)
> + goto out;
> +
> + ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL);
> + if (ptr)
> + mm->locked_vm += npages;

Why don't you use gfn_to_page (here and elsewhere in the patch).

2009-10-04 10:25:49

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On 10/02/2009 10:19 PM, Gregory Haskins wrote:
> What: xinterface is a mechanism that allows kernel modules external to
> the kvm.ko proper to interface with a running guest. It accomplishes
> this by creating an abstracted interface which does not expose any
> private details of the guest or its related KVM structures, and provides
> a mechanism to find and bind to this interface at run-time.
>

If this is needed, it should be done as a virt_address_space to which
kvm and other modules bind, instead of as something that kvm exports and
other modules import. The virt_address_space can be identified by an fd
and passed around to kvm and other modules.

> Why: There are various subsystems that would like to interact with a KVM
> guest which are ideally suited to exist outside the domain of the kvm.ko
> core logic. For instance, external pci-passthrough, virtual-bus, and
> virtio-net modules are currently under development. In order for these
> modules to successfully interact with the guest, they need, at the very
> least, various interfaces for signaling IO events, pointer translation,
> and possibly memory mapping.
>
> The signaling case is covered by the recent introduction of the
> irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the
> other cases. Note that today we only expose pointer-translation related
> functions, but more could be added at a future date as needs arise.
>
> Example usage: QEMU instantiates a guest, and an external module "foo"
> that desires the ability to interface with the guest (say via
> open("/dev/foo")). QEMU may then pass the kvmfd to foo via an
> ioctl, such as: ioctl(foofd, FOO_SET_VMID,&kvmfd). Upon receipt, the
> foo module can issue kvm_xinterface_bind(kvmfd) to acquire
> the proper context. Internally, the struct kvm* and associated
> struct module* will remain pinned at least until the foo module calls
> kvm_xinterface_put().
>
>

So, under my suggestion above, you'd call
sys_create_virt_address_space(), populate it, and pass the result to kvm
and to foo. This allows the use of virt_address_space without kvm and
doesn't require foo to interact with kvm.

> +struct kvm_xinterface_ops {
> + unsigned long (*copy_to)(struct kvm_xinterface *intf,
> + unsigned long gpa, const void *src,
> + unsigned long len);
> + unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
> + unsigned long gpa, unsigned long len);
> + struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
> + unsigned long gpa,
> + unsigned long len);
>

How would vmap() work with live migration?

> +
> +static inline void
> +_kvm_xinterface_release(struct kref *kref)
> +{
> + struct kvm_xinterface *intf;
> + struct module *owner;
> +
> + intf = container_of(kref, struct kvm_xinterface, kref);
> +
> + owner = intf->owner;
> + rmb();
>

Why rmb?

> +
> + intf->ops->release(intf);
> + module_put(owner);
> +}
> +
>
> +
> +/*
> + * gpa_to_hva() - translate a guest-physical to host-virtual using
> + * a per-cpu cache of the memslot.
> + *
> + * The gfn_to_memslot() call is relatively expensive, and the gpa access
> + * patterns exhibit a high degree of locality. Therefore, lets cache
> + * the last slot used on a per-cpu basis to optimize the lookup
> + *
> + * assumes slots_lock held for read
> + */
> +static unsigned long
> +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
> +{
> + int cpu = get_cpu();
> + unsigned long gfn = gpa>> PAGE_SHIFT;
> + struct kvm_memory_slot *memslot = _intf->slotcache[cpu];
> + unsigned long addr = 0;
> +
> + if (!memslot
> + || gfn< memslot->base_gfn
> + || gfn>= memslot->base_gfn + memslot->npages) {
> +
> + memslot = gfn_to_memslot(_intf->kvm, gfn);
> + if (!memslot)
> + goto out;
> +
> + _intf->slotcache[cpu] = memslot;
> + }
> +
> + addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa);
> +
> +out:
> + put_cpu();
> +
> + return addr;
> +}
>


A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better
results.

> +static unsigned long
> +xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa,
> + const void *src, unsigned long n)
> +{
> + struct _xinterface *_intf = to_intf(intf);
> + unsigned long dst;
> + bool kthread = !current->mm;
> +
> + down_read(&_intf->kvm->slots_lock);
> +
> + dst = gpa_to_hva(_intf, gpa);
> + if (!dst)
> + goto out;
> +
> + if (kthread)
> + use_mm(_intf->mm);
> +
> + if (kthread || _intf->mm == current->mm)
> + n = copy_to_user((void *)dst, src, n);
> + else
> + n = _slow_copy_to_user(_intf, dst, src, n);
>

Can't you switch the mm temporarily instead of this?


--
error compiling committee.c: too many arguments to function

2009-10-04 10:26:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] KVM: add io services to xinterface

On 10/02/2009 10:19 PM, Gregory Haskins wrote:
> We want to add a more efficient way to get PIO signals out of the guest,
> so we add an "xioevent" interface. This allows a client to register
> for notifications when a specific MMIO/PIO address is touched by
> the guest. This is an alternative interface to ioeventfd, which is
> performance limited by io-bus scaling and eventfd wait-queue based
> notification mechanism. This also has the advantage of retaining
> the full PIO data payload and passing it to the recipient.
>
>

Can you detail the problems with io-bus scaling and eventfd
wait-queues? Maybe we should fix these instead.

--
error compiling committee.c: too many arguments to function

2009-10-04 10:28:46

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] KVM: add scatterlist support to xinterface

On 10/02/2009 10:19 PM, Gregory Haskins wrote:
> This allows a scatter-gather approach to IO, which will be useful for
> building high performance interfaces, like zero-copy and low-latency
> copy (avoiding multiple calls to copy_to/from).
>
> The interface is based on the existing scatterlist infrastructure. The
> caller is expected to pass in a scatterlist with its "dma" field
> populated with valid GPAs. The xinterface will then populate each
> entry by translating the GPA to a page*.
>
> The caller signifies completion by simply performing a put_page() on
> each page returned in the list.
>
> Signed-off-by: Gregory Haskins<[email protected]>
> ---
>
> include/linux/kvm_xinterface.h | 4 ++
> virt/kvm/xinterface.c | 72 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 76 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h
> index 684b6f8..eefb575 100644
> --- a/include/linux/kvm_xinterface.h
> +++ b/include/linux/kvm_xinterface.h
> @@ -9,6 +9,7 @@
> #include<linux/kref.h>
> #include<linux/module.h>
> #include<linux/file.h>
> +#include<linux/scatterlist.h>
>
> struct kvm_xinterface;
> struct kvm_xvmap;
> @@ -36,6 +37,9 @@ struct kvm_xinterface_ops {
> u64 addr,
> unsigned long len,
> unsigned long flags);
> + unsigned long (*sgmap)(struct kvm_xinterface *intf,
> + struct scatterlist *sgl, int nents,
> + unsigned long flags);
> void (*release)(struct kvm_xinterface *);
> };
>
> diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
> index c356835..16729f6 100644
> --- a/virt/kvm/xinterface.c
> +++ b/virt/kvm/xinterface.c
> @@ -467,6 +467,77 @@ fail:
>
> }
>
> +static unsigned long
> +xinterface_sgmap(struct kvm_xinterface *intf,
> + struct scatterlist *sgl, int nents,
> + unsigned long flags)
> +{
> + struct _xinterface *_intf = to_intf(intf);
> + struct task_struct *p = _intf->task;
> + struct mm_struct *mm = _intf->mm;
> + struct kvm *kvm = _intf->kvm;
> + struct kvm_memory_slot *memslot = NULL;
> + bool kthread = !current->mm;
> + int ret;
> + struct scatterlist *sg;
> + int i;
> +
> + down_read(&kvm->slots_lock);
> +
> + if (kthread)
> + use_mm(_intf->mm);
> +
> + for_each_sg(sgl, sg, nents, i) {
> + unsigned long gpa = sg_dma_address(sg);
> + unsigned long len = sg_dma_len(sg);
> + unsigned long gfn = gpa>> PAGE_SHIFT;
> + off_t offset = offset_in_page(gpa);
> + unsigned long hva;
> + struct page *pg;
> +
> + /* ensure that we do not have more than one page per entry */
> + if ((PAGE_ALIGN(len + offset)>> PAGE_SHIFT) != 1) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + /* check for a memslot-cache miss */
> + if (!memslot
> + || gfn< memslot->base_gfn
> + || gfn>= memslot->base_gfn + memslot->npages) {
> + memslot = gfn_to_memslot(kvm, gfn);
> + if (!memslot) {
> + ret = -EFAULT;
> + break;
> + }
> + }
> +
> + hva = (memslot->userspace_addr +
> + (gfn - memslot->base_gfn) * PAGE_SIZE);
> +
> + if (kthread || current->mm == mm)
> + ret = get_user_pages_fast(hva, 1, 1,&pg);
> + else
> + ret = get_user_pages(p, mm, hva, 1, 1, 0,&pg, NULL);
>

One of these needs the mm semaphore.

> +
> + if (ret != 1) {
> + if (ret>= 0)
> + ret = -EFAULT;
> + break;
> + }
> +
> + sg_set_page(sg, pg, len, offset);
> + ret = 0;
> + }
> +
> + if (kthread)
> + unuse_mm(_intf->mm);
> +
> + up_read(&kvm->slots_lock);
> +
> + return ret;
> +}
> +
> static void
> xinterface_release(struct kvm_xinterface *intf)
> {
> @@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = {
> .copy_from = xinterface_copy_from,
> .vmap = xinterface_vmap,
> .ioevent = xinterface_ioevent,
> + .sgmap = xinterface_sgmap,
> .release = xinterface_release,
> };
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
error compiling committee.c: too many arguments to function

2009-10-05 23:35:09

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Hi Marcelo!

Marcelo Tosatti wrote:
> On Fri, Oct 02, 2009 at 04:19:27PM -0400, Gregory Haskins wrote:
>> What: xinterface is a mechanism that allows kernel modules external to
>> the kvm.ko proper to interface with a running guest. It accomplishes
>> this by creating an abstracted interface which does not expose any
>> private details of the guest or its related KVM structures, and provides
>> a mechanism to find and bind to this interface at run-time.
>>
>> Why: There are various subsystems that would like to interact with a KVM
>> guest which are ideally suited to exist outside the domain of the kvm.ko
>> core logic. For instance, external pci-passthrough, virtual-bus, and
>> virtio-net modules are currently under development. In order for these
>> modules to successfully interact with the guest, they need, at the very
>> least, various interfaces for signaling IO events, pointer translation,
>> and possibly memory mapping.
>>
>> The signaling case is covered by the recent introduction of the
>> irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the
>> other cases. Note that today we only expose pointer-translation related
>> functions, but more could be added at a future date as needs arise.
>>
>> Example usage: QEMU instantiates a guest, and an external module "foo"
>> that desires the ability to interface with the guest (say via
>> open("/dev/foo")). QEMU may then pass the kvmfd to foo via an
>> ioctl, such as: ioctl(foofd, FOO_SET_VMID, &kvmfd). Upon receipt, the
>> foo module can issue kvm_xinterface_bind(kvmfd) to acquire
>> the proper context. Internally, the struct kvm* and associated
>> struct module* will remain pinned at least until the foo module calls
>> kvm_xinterface_put().
>
>> --- /dev/null
>> +++ b/virt/kvm/xinterface.c
>> @@ -0,0 +1,409 @@
>> +/*
>> + * KVM module interface - Allows external modules to interface with a guest
>> + *
>> + * Copyright 2009 Novell. All Rights Reserved.
>> + *
>> + * Author:
>> + * Gregory Haskins <[email protected]>
>> + *
>> + * This file is free software; you can redistribute it and/or modify
>> + * it under the terms of version 2 of the GNU General Public License
>> + * as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software Foundation,
>> + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/highmem.h>
>> +#include <linux/module.h>
>> +#include <linux/mmu_context.h>
>> +#include <linux/kvm_host.h>
>> +#include <linux/kvm_xinterface.h>
>> +
>> +struct _xinterface {
>> + struct kvm *kvm;
>> + struct task_struct *task;
>> + struct mm_struct *mm;
>> + struct kvm_xinterface intf;
>> + struct kvm_memory_slot *slotcache[NR_CPUS];
>> +};
>> +
>> +struct _xvmap {
>> + struct kvm_memory_slot *memslot;
>> + unsigned long npages;
>> + struct kvm_xvmap vmap;
>> +};
>> +
>> +static struct _xinterface *
>> +to_intf(struct kvm_xinterface *intf)
>> +{
>> + return container_of(intf, struct _xinterface, intf);
>> +}
>> +
>> +#define _gfn_to_hva(gfn, memslot) \
>> + (memslot->userspace_addr + (gfn - memslot->base_gfn) * PAGE_SIZE)
>> +
>> +/*
>> + * gpa_to_hva() - translate a guest-physical to host-virtual using
>> + * a per-cpu cache of the memslot.
>> + *
>> + * The gfn_to_memslot() call is relatively expensive, and the gpa access
>> + * patterns exhibit a high degree of locality. Therefore, lets cache
>> + * the last slot used on a per-cpu basis to optimize the lookup
>> + *
>> + * assumes slots_lock held for read
>> + */
>> +static unsigned long
>> +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
>> +{
>> + int cpu = get_cpu();
>> + unsigned long gfn = gpa >> PAGE_SHIFT;
>> + struct kvm_memory_slot *memslot = _intf->slotcache[cpu];
>> + unsigned long addr = 0;
>> +
>> + if (!memslot
>> + || gfn < memslot->base_gfn
>> + || gfn >= memslot->base_gfn + memslot->npages) {
>> +
>> + memslot = gfn_to_memslot(_intf->kvm, gfn);
>> + if (!memslot)
>> + goto out;
>> +
>> + _intf->slotcache[cpu] = memslot;
>> + }
>> +
>> + addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa);
>> +
>> +out:
>> + put_cpu();
>> +
>> + return addr;
>
> Please optimize gfn_to_memslot() instead, so everybody benefits. It
> shows very often on profiles.

Yeah, its not a bad idea. The reason why I did it here is because the
requirements for sync (kvm-vcpu) vs async (xinterface) access is
slightly different. Sync is probably optimal with per-vcpu caching,
whereas async is optimal with per-cpu.

That said, we could probably build the entire algorithm to be per-cpu as
a compromise and still gain benefits. Perhaps I will split this out as
a separate patch for v3.

>
>> +
>> + page_list = (struct page **) __get_free_page(GFP_KERNEL);
>> + if (!page_list)
>> + return NULL;
>> +
>> + down_write(&mm->mmap_sem);
>> +
>> + ret = get_user_pages(p, mm, addr, npages, 1, 0, page_list, NULL);
>> + if (ret < 0)
>> + goto out;
>> +
>> + ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL);
>> + if (ptr)
>> + mm->locked_vm += npages;
>
> Why don't you use gfn_to_page (here and elsewhere in the patch).

Primarily ignorance, I suspect ;)

The truth is I ported this from one of our other connectors, which was
more userspace oriented and thus gup() made sense and gtp() was not an
option. That said, it probably doesn't matter a ton in the vmap case,
because that is slow-path. However, I will definitely look to change
over to the gtp() variant, especially if it affects any fast path code.

Thanks Marcelo,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-05 23:57:53

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Avi Kivity wrote:
> On 10/02/2009 10:19 PM, Gregory Haskins wrote:
>> What: xinterface is a mechanism that allows kernel modules external to
>> the kvm.ko proper to interface with a running guest. It accomplishes
>> this by creating an abstracted interface which does not expose any
>> private details of the guest or its related KVM structures, and provides
>> a mechanism to find and bind to this interface at run-time.
>>
>
> If this is needed, it should be done as a virt_address_space to which
> kvm and other modules bind, instead of as something that kvm exports and
> other modules import. The virt_address_space can be identified by an fd
> and passed around to kvm and other modules.

IIUC, what you are proposing is something similar to generalizing the
vbus::memctx object. I had considered doing something like that in the
early design phase of vbus, but then decided it would be a hard-sell to
the mm crowd, and difficult to generalize.

What do you propose as the interface to program the object?

>
>> Why: There are various subsystems that would like to interact with a KVM
>> guest which are ideally suited to exist outside the domain of the kvm.ko
>> core logic. For instance, external pci-passthrough, virtual-bus, and
>> virtio-net modules are currently under development. In order for these
>> modules to successfully interact with the guest, they need, at the very
>> least, various interfaces for signaling IO events, pointer translation,
>> and possibly memory mapping.
>>
>> The signaling case is covered by the recent introduction of the
>> irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the
>> other cases. Note that today we only expose pointer-translation related
>> functions, but more could be added at a future date as needs arise.
>>
>> Example usage: QEMU instantiates a guest, and an external module "foo"
>> that desires the ability to interface with the guest (say via
>> open("/dev/foo")). QEMU may then pass the kvmfd to foo via an
>> ioctl, such as: ioctl(foofd, FOO_SET_VMID,&kvmfd). Upon receipt, the
>> foo module can issue kvm_xinterface_bind(kvmfd) to acquire
>> the proper context. Internally, the struct kvm* and associated
>> struct module* will remain pinned at least until the foo module calls
>> kvm_xinterface_put().
>>
>>
>
> So, under my suggestion above, you'd call
> sys_create_virt_address_space(), populate it, and pass the result to kvm
> and to foo. This allows the use of virt_address_space without kvm and
> doesn't require foo to interact with kvm.

The problem I see here is that the only way I can think to implement
this generally is something that looks very kvm-esque (slots-to-pages
kind of translation). Is there a way you can think of that does not
involve a kvm.ko originated vtable that is also not kvm centric?

>
>> +struct kvm_xinterface_ops {
>> + unsigned long (*copy_to)(struct kvm_xinterface *intf,
>> + unsigned long gpa, const void *src,
>> + unsigned long len);
>> + unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
>> + unsigned long gpa, unsigned long len);
>> + struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
>> + unsigned long gpa,
>> + unsigned long len);
>>
>
> How would vmap() work with live migration?

vmap represents shmem regions, and is a per-guest-instance resource. So
my plan there is that the new and old guest instance would each have the
vmap region instated at the same GPA location (assumption: gpas are
stable across migration), and any state relevant data local to the shmem
(like ring head/tail position) is conveyed in the serialized stream for
the device model.

>
>> +
>> +static inline void
>> +_kvm_xinterface_release(struct kref *kref)
>> +{
>> + struct kvm_xinterface *intf;
>> + struct module *owner;
>> +
>> + intf = container_of(kref, struct kvm_xinterface, kref);
>> +
>> + owner = intf->owner;
>> + rmb();
>>
>
> Why rmb?

the intf->ops->release() line may invalidate the intf pointer, so we
want to ensure that the read completes before the release() is called.

TBH: I'm not 100% its needed, but I was being conservative.

>
>> +
>> + intf->ops->release(intf);
>> + module_put(owner);
>> +}
>> +
>>
>> +
>> +/*
>> + * gpa_to_hva() - translate a guest-physical to host-virtual using
>> + * a per-cpu cache of the memslot.
>> + *
>> + * The gfn_to_memslot() call is relatively expensive, and the gpa access
>> + * patterns exhibit a high degree of locality. Therefore, lets cache
>> + * the last slot used on a per-cpu basis to optimize the lookup
>> + *
>> + * assumes slots_lock held for read
>> + */
>> +static unsigned long
>> +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
>> +{
>> + int cpu = get_cpu();
>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>> + struct kvm_memory_slot *memslot = _intf->slotcache[cpu];
>> + unsigned long addr = 0;
>> +
>> + if (!memslot
>> + || gfn< memslot->base_gfn
>> + || gfn>= memslot->base_gfn + memslot->npages) {
>> +
>> + memslot = gfn_to_memslot(_intf->kvm, gfn);
>> + if (!memslot)
>> + goto out;
>> +
>> + _intf->slotcache[cpu] = memslot;
>> + }
>> +
>> + addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa);
>> +
>> +out:
>> + put_cpu();
>> +
>> + return addr;
>> +}
>>
>
>
> A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better
> results.

per-vcpu will not work well here, unfortunately, since this is an
external interface mechanism. The callers will generally be from a
kthread or some other non-vcpu related context. Even if we could figure
out a vcpu to use as a basis, we would require some kind of
heavier-weight synchronization which would not be as desirable.

Therefore, I opted to go per-cpu and use the presumably lighterweight
get_cpu/put_cpu() instead.


>
>> +static unsigned long
>> +xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa,
>> + const void *src, unsigned long n)
>> +{
>> + struct _xinterface *_intf = to_intf(intf);
>> + unsigned long dst;
>> + bool kthread = !current->mm;
>> +
>> + down_read(&_intf->kvm->slots_lock);
>> +
>> + dst = gpa_to_hva(_intf, gpa);
>> + if (!dst)
>> + goto out;
>> +
>> + if (kthread)
>> + use_mm(_intf->mm);
>> +
>> + if (kthread || _intf->mm == current->mm)
>> + n = copy_to_user((void *)dst, src, n);
>> + else
>> + n = _slow_copy_to_user(_intf, dst, src, n);
>>
>
> Can't you switch the mm temporarily instead of this?

Thats actually what I do for the fast-path (use_mm() does a switch_to()
internally).

The slow-path is only there for completeness for when switching is not
possible (such as if called with an mm already active i.e.
process-context). In practice, however, this doesnt happen. Virtually
100% of the calls in vbus hit the fast-path here, and I suspect most
xinterface clients would find the same conditions as well.

Thanks Avi,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-05 23:58:59

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] KVM: add scatterlist support to xinterface

Avi Kivity wrote:
> On 10/02/2009 10:19 PM, Gregory Haskins wrote:
>> This allows a scatter-gather approach to IO, which will be useful for
>> building high performance interfaces, like zero-copy and low-latency
>> copy (avoiding multiple calls to copy_to/from).
>>
>> The interface is based on the existing scatterlist infrastructure. The
>> caller is expected to pass in a scatterlist with its "dma" field
>> populated with valid GPAs. The xinterface will then populate each
>> entry by translating the GPA to a page*.
>>
>> The caller signifies completion by simply performing a put_page() on
>> each page returned in the list.
>>
>> Signed-off-by: Gregory Haskins<[email protected]>
>> ---
>>
>> include/linux/kvm_xinterface.h | 4 ++
>> virt/kvm/xinterface.c | 72
>> ++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 76 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/kvm_xinterface.h
>> b/include/linux/kvm_xinterface.h
>> index 684b6f8..eefb575 100644
>> --- a/include/linux/kvm_xinterface.h
>> +++ b/include/linux/kvm_xinterface.h
>> @@ -9,6 +9,7 @@
>> #include<linux/kref.h>
>> #include<linux/module.h>
>> #include<linux/file.h>
>> +#include<linux/scatterlist.h>
>>
>> struct kvm_xinterface;
>> struct kvm_xvmap;
>> @@ -36,6 +37,9 @@ struct kvm_xinterface_ops {
>> u64 addr,
>> unsigned long len,
>> unsigned long flags);
>> + unsigned long (*sgmap)(struct kvm_xinterface *intf,
>> + struct scatterlist *sgl, int nents,
>> + unsigned long flags);
>> void (*release)(struct kvm_xinterface *);
>> };
>>
>> diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c
>> index c356835..16729f6 100644
>> --- a/virt/kvm/xinterface.c
>> +++ b/virt/kvm/xinterface.c
>> @@ -467,6 +467,77 @@ fail:
>>
>> }
>>
>> +static unsigned long
>> +xinterface_sgmap(struct kvm_xinterface *intf,
>> + struct scatterlist *sgl, int nents,
>> + unsigned long flags)
>> +{
>> + struct _xinterface *_intf = to_intf(intf);
>> + struct task_struct *p = _intf->task;
>> + struct mm_struct *mm = _intf->mm;
>> + struct kvm *kvm = _intf->kvm;
>> + struct kvm_memory_slot *memslot = NULL;
>> + bool kthread = !current->mm;
>> + int ret;
>> + struct scatterlist *sg;
>> + int i;
>> +
>> + down_read(&kvm->slots_lock);
>> +
>> + if (kthread)
>> + use_mm(_intf->mm);
>> +
>> + for_each_sg(sgl, sg, nents, i) {
>> + unsigned long gpa = sg_dma_address(sg);
>> + unsigned long len = sg_dma_len(sg);
>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>> + off_t offset = offset_in_page(gpa);
>> + unsigned long hva;
>> + struct page *pg;
>> +
>> + /* ensure that we do not have more than one page per entry */
>> + if ((PAGE_ALIGN(len + offset)>> PAGE_SHIFT) != 1) {
>> + ret = -EINVAL;
>> + break;
>> + }
>> +
>> + /* check for a memslot-cache miss */
>> + if (!memslot
>> + || gfn< memslot->base_gfn
>> + || gfn>= memslot->base_gfn + memslot->npages) {
>> + memslot = gfn_to_memslot(kvm, gfn);
>> + if (!memslot) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> + }
>> +
>> + hva = (memslot->userspace_addr +
>> + (gfn - memslot->base_gfn) * PAGE_SIZE);
>> +
>> + if (kthread || current->mm == mm)
>> + ret = get_user_pages_fast(hva, 1, 1,&pg);
>> + else
>> + ret = get_user_pages(p, mm, hva, 1, 1, 0,&pg, NULL);
>>
>
> One of these needs the mm semaphore.

Indeed. Good catch.

>
>> +
>> + if (ret != 1) {
>> + if (ret>= 0)
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + sg_set_page(sg, pg, len, offset);
>> + ret = 0;
>> + }
>> +
>> + if (kthread)
>> + unuse_mm(_intf->mm);
>> +
>> + up_read(&kvm->slots_lock);
>> +
>> + return ret;
>> +}
>> +
>> static void
>> xinterface_release(struct kvm_xinterface *intf)
>> {
>> @@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = {
>> .copy_from = xinterface_copy_from,
>> .vmap = xinterface_vmap,
>> .ioevent = xinterface_ioevent,
>> + .sgmap = xinterface_sgmap,
>> .release = xinterface_release,
>> };
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>



Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-06 09:35:46

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On 10/06/2009 01:57 AM, Gregory Haskins wrote:
> Avi Kivity wrote:
>
>> On 10/02/2009 10:19 PM, Gregory Haskins wrote:
>>
>>> What: xinterface is a mechanism that allows kernel modules external to
>>> the kvm.ko proper to interface with a running guest. It accomplishes
>>> this by creating an abstracted interface which does not expose any
>>> private details of the guest or its related KVM structures, and provides
>>> a mechanism to find and bind to this interface at run-time.
>>>
>>>
>> If this is needed, it should be done as a virt_address_space to which
>> kvm and other modules bind, instead of as something that kvm exports and
>> other modules import. The virt_address_space can be identified by an fd
>> and passed around to kvm and other modules.
>>
> IIUC, what you are proposing is something similar to generalizing the
> vbus::memctx object. I had considered doing something like that in the
> early design phase of vbus, but then decided it would be a hard-sell to
> the mm crowd, and difficult to generalize.
>
> What do you propose as the interface to program the object?
>

Something like the current kvm interfaces, de-warted. It will be a hard
sell indeed, for good reasons.

>> So, under my suggestion above, you'd call
>> sys_create_virt_address_space(), populate it, and pass the result to kvm
>> and to foo. This allows the use of virt_address_space without kvm and
>> doesn't require foo to interact with kvm.
>>
> The problem I see here is that the only way I can think to implement
> this generally is something that looks very kvm-esque (slots-to-pages
> kind of translation). Is there a way you can think of that does not
> involve a kvm.ko originated vtable that is also not kvm centric?
>

slots would be one implementation, if you can think of others then you'd
add them.

If you can't, I think it indicates that the whole thing isn't necessary
and we're better off with slots and virtual memory. The only thing
missing is dma, which you don't deal with anyway.

>>> +struct kvm_xinterface_ops {
>>> + unsigned long (*copy_to)(struct kvm_xinterface *intf,
>>> + unsigned long gpa, const void *src,
>>> + unsigned long len);
>>> + unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
>>> + unsigned long gpa, unsigned long len);
>>> + struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
>>> + unsigned long gpa,
>>> + unsigned long len);
>>>
>>>
>> How would vmap() work with live migration?
>>
> vmap represents shmem regions, and is a per-guest-instance resource. So
> my plan there is that the new and old guest instance would each have the
> vmap region instated at the same GPA location (assumption: gpas are
> stable across migration), and any state relevant data local to the shmem
> (like ring head/tail position) is conveyed in the serialized stream for
> the device model.
>

You'd have to copy the entire range since you don't know what the guest
might put there. I guess it's acceptable for small areas.

>>> +
>>> +static inline void
>>> +_kvm_xinterface_release(struct kref *kref)
>>> +{
>>> + struct kvm_xinterface *intf;
>>> + struct module *owner;
>>> +
>>> + intf = container_of(kref, struct kvm_xinterface, kref);
>>> +
>>> + owner = intf->owner;
>>> + rmb();
>>>
>>>
>> Why rmb?
>>
> the intf->ops->release() line may invalidate the intf pointer, so we
> want to ensure that the read completes before the release() is called.
>
> TBH: I'm not 100% its needed, but I was being conservative.
>

rmb()s are only needed if an external agent can issue writes, otherwise
you'd need one after every statement.




>>
>> A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better
>> results.
>>
> per-vcpu will not work well here, unfortunately, since this is an
> external interface mechanism. The callers will generally be from a
> kthread or some other non-vcpu related context. Even if we could figure
> out a vcpu to use as a basis, we would require some kind of
> heavier-weight synchronization which would not be as desirable.
>
> Therefore, I opted to go per-cpu and use the presumably lighterweight
> get_cpu/put_cpu() instead.
>

This just assumes a low context switch rate.

How about a gfn_to_pfn_cached(..., struct gfn_to_pfn_cache *cache)?
Each user can place it in a natural place.

>>> +static unsigned long
>>> +xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa,
>>> + const void *src, unsigned long n)
>>> +{
>>> + struct _xinterface *_intf = to_intf(intf);
>>> + unsigned long dst;
>>> + bool kthread = !current->mm;
>>> +
>>> + down_read(&_intf->kvm->slots_lock);
>>> +
>>> + dst = gpa_to_hva(_intf, gpa);
>>> + if (!dst)
>>> + goto out;
>>> +
>>> + if (kthread)
>>> + use_mm(_intf->mm);
>>> +
>>> + if (kthread || _intf->mm == current->mm)
>>> + n = copy_to_user((void *)dst, src, n);
>>> + else
>>> + n = _slow_copy_to_user(_intf, dst, src, n);
>>>
>>>
>> Can't you switch the mm temporarily instead of this?
>>
> Thats actually what I do for the fast-path (use_mm() does a switch_to()
> internally).
>
> The slow-path is only there for completeness for when switching is not
> possible (such as if called with an mm already active i.e.
> process-context).

Still, why can't you switch temporarily?

> In practice, however, this doesnt happen. Virtually
> 100% of the calls in vbus hit the fast-path here, and I suspect most
> xinterface clients would find the same conditions as well.
>

So you have 100% untested code here.

--
error compiling committee.c: too many arguments to function

2009-10-06 13:32:52

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Avi Kivity wrote:
> On 10/06/2009 01:57 AM, Gregory Haskins wrote:
>> Avi Kivity wrote:
>>
>>> On 10/02/2009 10:19 PM, Gregory Haskins wrote:
>>>
>>>> What: xinterface is a mechanism that allows kernel modules external to
>>>> the kvm.ko proper to interface with a running guest. It accomplishes
>>>> this by creating an abstracted interface which does not expose any
>>>> private details of the guest or its related KVM structures, and
>>>> provides
>>>> a mechanism to find and bind to this interface at run-time.
>>>>
>>>>
>>> If this is needed, it should be done as a virt_address_space to which
>>> kvm and other modules bind, instead of as something that kvm exports and
>>> other modules import. The virt_address_space can be identified by an fd
>>> and passed around to kvm and other modules.
>>>
>> IIUC, what you are proposing is something similar to generalizing the
>> vbus::memctx object. I had considered doing something like that in the
>> early design phase of vbus, but then decided it would be a hard-sell to
>> the mm crowd, and difficult to generalize.
>>
>> What do you propose as the interface to program the object?
>>
>
> Something like the current kvm interfaces, de-warted. It will be a hard
> sell indeed, for good reasons.

I am not convinced generalizing this at this point is the best step
forward, since we only have a KVM client. Let me put some more thought
into it.


>
>>> So, under my suggestion above, you'd call
>>> sys_create_virt_address_space(), populate it, and pass the result to kvm
>>> and to foo. This allows the use of virt_address_space without kvm and
>>> doesn't require foo to interact with kvm.
>>>
>> The problem I see here is that the only way I can think to implement
>> this generally is something that looks very kvm-esque (slots-to-pages
>> kind of translation). Is there a way you can think of that does not
>> involve a kvm.ko originated vtable that is also not kvm centric?
>>
>
> slots would be one implementation, if you can think of others then you'd
> add them.

I'm more interested in *how* you'd add them more than "if" we would add
them. What I am getting at are the logistics of such a beast.

For instance, would I have /dev/slots-vas with ioctls for adding slots,
and /dev/foo-vas for adding foos? And each one would instantiate a
different vas_struct object with its own vas_struct->ops? Or were you
thinking of something different.

>
> If you can't, I think it indicates that the whole thing isn't necessary
> and we're better off with slots and virtual memory.

I'm not sure if we are talking about the same thing yet, but if we are,
there are uses of a generalized interface outside of slots/virtual
memory (Ira's physical box being a good example).

In any case, I think the best approach is what I already proposed.
KVM's arrangement of memory is going to tend to be KVM specific, and
what better place to implement the interface than close to the kvm.ko core.

> The only thing missing is dma, which you don't deal with anyway.
>

Afaict I do support dma in the generalized vbus::memctx, though I do not
use it on anything related to KVM or xinterface. Can you elaborate on
the problem here? Does the SG interface in 4/4 help get us closer to
what you envision?

>>>> +struct kvm_xinterface_ops {
>>>> + unsigned long (*copy_to)(struct kvm_xinterface *intf,
>>>> + unsigned long gpa, const void *src,
>>>> + unsigned long len);
>>>> + unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst,
>>>> + unsigned long gpa, unsigned long len);
>>>> + struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf,
>>>> + unsigned long gpa,
>>>> + unsigned long len);
>>>>
>>>>
>>> How would vmap() work with live migration?
>>>
>> vmap represents shmem regions, and is a per-guest-instance resource. So
>> my plan there is that the new and old guest instance would each have the
>> vmap region instated at the same GPA location (assumption: gpas are
>> stable across migration), and any state relevant data local to the shmem
>> (like ring head/tail position) is conveyed in the serialized stream for
>> the device model.
>>
>
> You'd have to copy the entire range since you don't know what the guest
> might put there. I guess it's acceptable for small areas.

? The vmap is presumably part of an ABI between guest and host, so the
host should always know what structure is present within the region, and
what is relevant from within that structure to migrate once that state
is "frozen".

These regions (for vbus, anyway) equate to things like virtqueue
metadata, and presumably the same problem exists for virtio-net in
userspace as it does here, since that is another form of a "vmap". So
whatever solution works for virtio-net migrating its virtqueues in
userspace should be close to what will work here. The primary
difference is the location of the serializer.

>
>>>> +
>>>> +static inline void
>>>> +_kvm_xinterface_release(struct kref *kref)
>>>> +{
>>>> + struct kvm_xinterface *intf;
>>>> + struct module *owner;
>>>> +
>>>> + intf = container_of(kref, struct kvm_xinterface, kref);
>>>> +
>>>> + owner = intf->owner;
>>>> + rmb();
>>>>
>>>>
>>> Why rmb?
>>>
>> the intf->ops->release() line may invalidate the intf pointer, so we
>> want to ensure that the read completes before the release() is called.
>>
>> TBH: I'm not 100% its needed, but I was being conservative.
>>
>
> rmb()s are only needed if an external agent can issue writes, otherwise
> you'd need one after every statement.

I was following lessons learned here:

http://lkml.org/lkml/2009/7/7/175

Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
David Howells in case he has more insight.

>
>
>
>
>>>
>>> A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better
>>> results.
>>>
>> per-vcpu will not work well here, unfortunately, since this is an
>> external interface mechanism. The callers will generally be from a
>> kthread or some other non-vcpu related context. Even if we could figure
>> out a vcpu to use as a basis, we would require some kind of
>> heavier-weight synchronization which would not be as desirable.
>>
>> Therefore, I opted to go per-cpu and use the presumably lighterweight
>> get_cpu/put_cpu() instead.
>>
>
> This just assumes a low context switch rate.

It primarily assumes a low _migration_ rate, since you do not typically
have two contexts on the same cpu pounding on the memslots. And even if
you did, there's a good chance for locality between the threads, since
the IO activity is likely related. For the odd times where locality
fails to yield a hit, the time-slice or migration rate should be
sufficiently infrequent enough to still yield high 90s hit rates for
when it matters. For low-volume, the lookup is in the noise so I am not
as concerned.

IOW: Where the lookup hurts the most is trying to walk an SG list, since
each pointer is usually within the same slot. For this case, at least,
this cache helps immensely, at least according to profiles.

>
> How about a gfn_to_pfn_cached(..., struct gfn_to_pfn_cache *cache)?
> Each user can place it in a natural place.

Sounds good. I will incorporate this into the split patch.

>
>>>> +static unsigned long
>>>> +xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa,
>>>> + const void *src, unsigned long n)
>>>> +{
>>>> + struct _xinterface *_intf = to_intf(intf);
>>>> + unsigned long dst;
>>>> + bool kthread = !current->mm;
>>>> +
>>>> + down_read(&_intf->kvm->slots_lock);
>>>> +
>>>> + dst = gpa_to_hva(_intf, gpa);
>>>> + if (!dst)
>>>> + goto out;
>>>> +
>>>> + if (kthread)
>>>> + use_mm(_intf->mm);
>>>> +
>>>> + if (kthread || _intf->mm == current->mm)
>>>> + n = copy_to_user((void *)dst, src, n);
>>>> + else
>>>> + n = _slow_copy_to_user(_intf, dst, src, n);
>>>>
>>>>
>>> Can't you switch the mm temporarily instead of this?
>>>
>> Thats actually what I do for the fast-path (use_mm() does a switch_to()
>> internally).
>>
>> The slow-path is only there for completeness for when switching is not
>> possible (such as if called with an mm already active i.e.
>> process-context).
>
> Still, why can't you switch temporarily?

I am not an mm expert, but iiuc you cannot call switch_to() from
anything other than kthread context. Thats what the doc says, anyway.

>
>> In practice, however, this doesnt happen. Virtually
>> 100% of the calls in vbus hit the fast-path here, and I suspect most
>> xinterface clients would find the same conditions as well.
>>
>
> So you have 100% untested code here.
>

Actually, no. Before Michael enlightened me recently regarding
switch_to/use_mm, the '_slow_xx' functions were my _only_ path. So they
have indeed multiple months (and multiple GB) of testing, as it turns
out. I only recently optimized them away to "backup" duty.

Thanks Avi,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-06 14:22:48

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Gregory Haskins wrote:
> Avi Kivity wrote:
>> On 10/06/2009 01:57 AM, Gregory Haskins wrote:
>>> Avi Kivity wrote:
>>>
>>>> On 10/02/2009 10:19 PM, Gregory Haskins wrote:

>>>>> +
>>>>> +static inline void
>>>>> +_kvm_xinterface_release(struct kref *kref)
>>>>> +{
>>>>> + struct kvm_xinterface *intf;
>>>>> + struct module *owner;
>>>>> +
>>>>> + intf = container_of(kref, struct kvm_xinterface, kref);
>>>>> +
>>>>> + owner = intf->owner;
>>>>> + rmb();
>>>>>
>>>>>
>>>> Why rmb?
>>>>
>>> the intf->ops->release() line may invalidate the intf pointer, so we
>>> want to ensure that the read completes before the release() is called.
>>>
>>> TBH: I'm not 100% its needed, but I was being conservative.
>>>
>> rmb()s are only needed if an external agent can issue writes, otherwise
>> you'd need one after every statement.
>
> I was following lessons learned here:
>
> http://lkml.org/lkml/2009/7/7/175
>
> Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
> David Howells in case he has more insight.

BTW: In case it is not clear, the rationale as I understand it is we
worry about the case where one cpu reorders the read to be after the
->release(), and another cpu might grab the memory that was kfree()'d
within the ->release() and scribble something else on it before the read
completes.

I know rmb() typically needs to be paired with wmb() to be correct, so
you are probably right to say that the rmb() itself is not appropriate.
This problem in general makes my head hurt, which is why I said I am
not 100% sure of what is required. As David mentions, perhaps
"smp_mb()" is more appropriate for this application. I also speculate
barrier() may be all that we need.

Kind Regards,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-06 16:20:31

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On 10/06/2009 03:31 PM, Gregory Haskins wrote:
>
>> slots would be one implementation, if you can think of others then you'd
>> add them.
>>
> I'm more interested in *how* you'd add them more than "if" we would add
> them. What I am getting at are the logistics of such a beast.
>

Add alternative ioctls, or have one ioctl with a 'type' field.

> For instance, would I have /dev/slots-vas with ioctls for adding slots,
> and /dev/foo-vas for adding foos? And each one would instantiate a
> different vas_struct object with its own vas_struct->ops? Or were you
> thinking of something different.
>

I think a single /dev/foo is sufficient, unless some of those address
spaces are backed by real devices.

>> If you can't, I think it indicates that the whole thing isn't necessary
>> and we're better off with slots and virtual memory.
>>
> I'm not sure if we are talking about the same thing yet, but if we are,
> there are uses of a generalized interface outside of slots/virtual
> memory (Ira's physical box being a good example).
>

I'm not sure Ira's case is not best supported by virtual memory.

> In any case, I think the best approach is what I already proposed.
> KVM's arrangement of memory is going to tend to be KVM specific, and
> what better place to implement the interface than close to the kvm.ko core.
>
>
>> The only thing missing is dma, which you don't deal with anyway.
>>
>>
> Afaict I do support dma in the generalized vbus::memctx, though I do not
> use it on anything related to KVM or xinterface. Can you elaborate on
> the problem here? Does the SG interface in 4/4 help get us closer to
> what you envision?
>

There's no dma method in there. Mapping to physical addresses is
equivalent to get_user_pages(), so it doesn't add anything over virtual
memory slots.

>> You'd have to copy the entire range since you don't know what the guest
>> might put there. I guess it's acceptable for small areas.
>>
> ? The vmap is presumably part of an ABI between guest and host, so the
> host should always know what structure is present within the region, and
> what is relevant from within that structure to migrate once that state
> is "frozen".
>

I was thinking of the case where the page is shared with some other
(guest-private) data. But dirtying that data would be tracked by kvm,
so it isn't a problem.

> These regions (for vbus, anyway) equate to things like virtqueue
> metadata, and presumably the same problem exists for virtio-net in
> userspace as it does here, since that is another form of a "vmap". So
> whatever solution works for virtio-net migrating its virtqueues in
> userspace should be close to what will work here. The primary
> difference is the location of the serializer.
>

Actually, virtio doesn't serialize this data, instead it marks the page
dirty and lets normal RAM migration move it.


>> rmb()s are only needed if an external agent can issue writes, otherwise
>> you'd need one after every statement.
>>
> I was following lessons learned here:
>
> http://lkml.org/lkml/2009/7/7/175
>
> Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
> David Howells in case he has more insight.
>

I'm not sure I understand the reference. Callers and callees don't need
memory barriers since they're guaranteed program order. You only need
memory barriers when you have an external agent (a device, or another
cpu). What external agent can touch things during ->release()?

>>>> A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better
>>>> results.
>>>>
>>>>
>>> per-vcpu will not work well here, unfortunately, since this is an
>>> external interface mechanism. The callers will generally be from a
>>> kthread or some other non-vcpu related context. Even if we could figure
>>> out a vcpu to use as a basis, we would require some kind of
>>> heavier-weight synchronization which would not be as desirable.
>>>
>>> Therefore, I opted to go per-cpu and use the presumably lighterweight
>>> get_cpu/put_cpu() instead.
>>>
>>>
>> This just assumes a low context switch rate.
>>
> It primarily assumes a low _migration_ rate, since you do not typically
> have two contexts on the same cpu pounding on the memslots.

Why not? If you're overcommitted, you will, especially if you have
multiple guests doing request/reply interaction (perhaps with each other).

> And even if
> you did, there's a good chance for locality between the threads, since
> the IO activity is likely related. For the odd times where locality
> fails to yield a hit, the time-slice or migration rate should be
> sufficiently infrequent enough to still yield high 90s hit rates for
> when it matters. For low-volume, the lookup is in the noise so I am not
> as concerned.
>
> IOW: Where the lookup hurts the most is trying to walk an SG list, since
> each pointer is usually within the same slot. For this case, at least,
> this cache helps immensely, at least according to profiles.
>

I'm wary of adding per-cpu data where it's easier to have a per-caller
or per-vcpu cache.

>> How about a gfn_to_pfn_cached(..., struct gfn_to_pfn_cache *cache)?
>> Each user can place it in a natural place.
>>
> Sounds good. I will incorporate this into the split patch.
>

Note that for slots, typically most accesses hit just one slot (the
largest). For guests below 4G, there's only one large slot, for guests
above 4G there are two.

>>>>> +static unsigned long
>>>>> +xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa,
>>>>> + const void *src, unsigned long n)
>>>>> +{
>>>>> + struct _xinterface *_intf = to_intf(intf);
>>>>> + unsigned long dst;
>>>>> + bool kthread = !current->mm;
>>>>> +
>>>>> + down_read(&_intf->kvm->slots_lock);
>>>>> +
>>>>> + dst = gpa_to_hva(_intf, gpa);
>>>>> + if (!dst)
>>>>> + goto out;
>>>>> +
>>>>> + if (kthread)
>>>>> + use_mm(_intf->mm);
>>>>> +
>>>>> + if (kthread || _intf->mm == current->mm)
>>>>> + n = copy_to_user((void *)dst, src, n);
>>>>> + else
>>>>> + n = _slow_copy_to_user(_intf, dst, src, n);
>>>>>
>>>>>
>>>>>
>>>> Can't you switch the mm temporarily instead of this?
>>>>
>>>>
>>> Thats actually what I do for the fast-path (use_mm() does a switch_to()
>>> internally).
>>>
>>> The slow-path is only there for completeness for when switching is not
>>> possible (such as if called with an mm already active i.e.
>>> process-context).
>>>
>> Still, why can't you switch temporarily?
>>
> I am not an mm expert, but iiuc you cannot call switch_to() from
> anything other than kthread context. Thats what the doc says, anyway.
>

Can you provide a pointer? I don't see why this limitation exists.

>>> In practice, however, this doesnt happen. Virtually
>>> 100% of the calls in vbus hit the fast-path here, and I suspect most
>>> xinterface clients would find the same conditions as well.
>>>
>>>
>> So you have 100% untested code here.
>>
>>
> Actually, no. Before Michael enlightened me recently regarding
> switch_to/use_mm, the '_slow_xx' functions were my _only_ path. So they
> have indeed multiple months (and multiple GB) of testing, as it turns
> out. I only recently optimized them away to "backup" duty.
>

I meant, unexercised. This will bitrot quickly.

--
error compiling committee.c: too many arguments to function

2009-10-06 16:24:28

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On 10/06/2009 04:22 PM, Gregory Haskins wrote:
>>>>>> +
>>>>>> +static inline void
>>>>>> +_kvm_xinterface_release(struct kref *kref)
>>>>>> +{
>>>>>> + struct kvm_xinterface *intf;
>>>>>> + struct module *owner;
>>>>>> +
>>>>>> + intf = container_of(kref, struct kvm_xinterface, kref);
>>>>>> +
>>>>>> + owner = intf->owner;
>>>>>> + rmb();
>>>>>>
>>>>>>
>>>>>>
>>>>> Why rmb?
>>>>>
>>>>>
>>>> the intf->ops->release() line may invalidate the intf pointer, so we
>>>> want to ensure that the read completes before the release() is called.
>>>>
>>>> TBH: I'm not 100% its needed, but I was being conservative.
>>>>
>>>>
>>> rmb()s are only needed if an external agent can issue writes, otherwise
>>> you'd need one after every statement.
>>>
>> I was following lessons learned here:
>>
>> http://lkml.org/lkml/2009/7/7/175
>>
>> Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
>> David Howells in case he has more insight.
>>
> BTW: In case it is not clear, the rationale as I understand it is we
> worry about the case where one cpu reorders the read to be after the
> ->release(), and another cpu might grab the memory that was kfree()'d
> within the ->release() and scribble something else on it before the read
> completes.
>
> I know rmb() typically needs to be paired with wmb() to be correct, so
> you are probably right to say that the rmb() itself is not appropriate.
> This problem in general makes my head hurt, which is why I said I am
> not 100% sure of what is required. As David mentions, perhaps
> "smp_mb()" is more appropriate for this application. I also speculate
> barrier() may be all that we need.
>

barrier() is the operation for a compiler barrier. But it's unneeded
here - unless the compiler can prove that ->release(intf) will not
modify intf->owner it is not allowed to move the access afterwards. An
indirect function call is generally a barrier() since the compiler can't
assume memory has not been modified.

--
error compiling committee.c: too many arguments to function

2009-10-06 16:58:50

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Avi Kivity wrote:
> On 10/06/2009 03:31 PM, Gregory Haskins wrote:
>>
>>> slots would be one implementation, if you can think of others then you'd
>>> add them.
>>>
>> I'm more interested in *how* you'd add them more than "if" we would add
>> them. What I am getting at are the logistics of such a beast.
>>
>
> Add alternative ioctls, or have one ioctl with a 'type' field.
>
>> For instance, would I have /dev/slots-vas with ioctls for adding slots,
>> and /dev/foo-vas for adding foos? And each one would instantiate a
>> different vas_struct object with its own vas_struct->ops? Or were you
>> thinking of something different.
>>
>
> I think a single /dev/foo is sufficient, unless some of those address
> spaces are backed by real devices.
>
>>> If you can't, I think it indicates that the whole thing isn't necessary
>>> and we're better off with slots and virtual memory.
>>>
>> I'm not sure if we are talking about the same thing yet, but if we are,
>> there are uses of a generalized interface outside of slots/virtual
>> memory (Ira's physical box being a good example).
>>
>
> I'm not sure Ira's case is not best supported by virtual memory.

Perhaps, but there are surely some cases where the memory is not
pageable, but is accessible indirectly through some DMA controller. So
if we require it to be pagable we will limit the utility of the
interface, though admittedly it will probably cover most cases.

>
>> In any case, I think the best approach is what I already proposed.
>> KVM's arrangement of memory is going to tend to be KVM specific, and
>> what better place to implement the interface than close to the kvm.ko
>> core.
>>
>>
>>> The only thing missing is dma, which you don't deal with anyway.
>>>
>>>
>> Afaict I do support dma in the generalized vbus::memctx, though I do not
>> use it on anything related to KVM or xinterface. Can you elaborate on
>> the problem here? Does the SG interface in 4/4 help get us closer to
>> what you envision?
>>
>
> There's no dma method in there. Mapping to physical addresses is
> equivalent to get_user_pages(), so it doesn't add anything over virtual
> memory slots.

I am not following you at all. What kind of interface do we need for
DMA? Since the KVM guest is pagable, the support for DMA would come
from building a scatterlist (patch 4/4) and passing the resulting pages
out using standard sg mechanisms, like sg_dma_address(). This is what I
do today to support zero-copy in AlacrityVM.

What more do we need?

>
>>> You'd have to copy the entire range since you don't know what the guest
>>> might put there. I guess it's acceptable for small areas.
>>>
>> ? The vmap is presumably part of an ABI between guest and host, so the
>> host should always know what structure is present within the region, and
>> what is relevant from within that structure to migrate once that state
>> is "frozen".
>>
>
> I was thinking of the case where the page is shared with some other
> (guest-private) data. But dirtying that data would be tracked by kvm,
> so it isn't a problem.

Ok.

>
>> These regions (for vbus, anyway) equate to things like virtqueue
>> metadata, and presumably the same problem exists for virtio-net in
>> userspace as it does here, since that is another form of a "vmap". So
>> whatever solution works for virtio-net migrating its virtqueues in
>> userspace should be close to what will work here. The primary
>> difference is the location of the serializer.
>>
>
> Actually, virtio doesn't serialize this data, instead it marks the page
> dirty and lets normal RAM migration move it.

Ok, so its effectively serializing the entire region indirectly. That
works too.

>
>
>>> rmb()s are only needed if an external agent can issue writes, otherwise
>>> you'd need one after every statement.
>>>
>> I was following lessons learned here:
>>
>> http://lkml.org/lkml/2009/7/7/175
>>
>> Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
>> David Howells in case he has more insight.
>>
>
> I'm not sure I understand the reference. Callers and callees don't need
> memory barriers since they're guaranteed program order. You only need
> memory barriers when you have an external agent (a device, or another
> cpu). What external agent can touch things during ->release()?

We already have a different subthread started on this, so I'll pick this
up there.

>
>>>>> A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better
>>>>> results.
>>>>>
>>>>>
>>>> per-vcpu will not work well here, unfortunately, since this is an
>>>> external interface mechanism. The callers will generally be from a
>>>> kthread or some other non-vcpu related context. Even if we could
>>>> figure
>>>> out a vcpu to use as a basis, we would require some kind of
>>>> heavier-weight synchronization which would not be as desirable.
>>>>
>>>> Therefore, I opted to go per-cpu and use the presumably lighterweight
>>>> get_cpu/put_cpu() instead.
>>>>
>>>>
>>> This just assumes a low context switch rate.
>>>
>> It primarily assumes a low _migration_ rate, since you do not typically
>> have two contexts on the same cpu pounding on the memslots.
>
> Why not? If you're overcommitted, you will, especially if you have
> multiple guests doing request/reply interaction (perhaps with each other).

Right, but you can probably do a *lot* of gfn_to_memslot() calls within
your timeslice to still make it highly worthwhile. We are talking
microseconds per lookup vs milliseconds per context switch.

>
>> And even if
>> you did, there's a good chance for locality between the threads, since
>> the IO activity is likely related. For the odd times where locality
>> fails to yield a hit, the time-slice or migration rate should be
>> sufficiently infrequent enough to still yield high 90s hit rates for
>> when it matters. For low-volume, the lookup is in the noise so I am not
>> as concerned.
>>
>> IOW: Where the lookup hurts the most is trying to walk an SG list, since
>> each pointer is usually within the same slot. For this case, at least,
>> this cache helps immensely, at least according to profiles.
>>
>
> I'm wary of adding per-cpu data where it's easier to have a per-caller
> or per-vcpu cache.

I like your per-caller suggestion. That makes the most sense to me.

>
>>> How about a gfn_to_pfn_cached(..., struct gfn_to_pfn_cache *cache)?
>>> Each user can place it in a natural place.
>>>
>> Sounds good. I will incorporate this into the split patch.
>>
>
> Note that for slots, typically most accesses hit just one slot (the
> largest). For guests below 4G, there's only one large slot, for guests
> above 4G there are two.

Right, I find that the cache misses are pretty infrequent so that jives
with your statement.

>
>>>>>> +static unsigned long
>>>>>> +xinterface_copy_to(struct kvm_xinterface *intf, unsigned long gpa,
>>>>>> + const void *src, unsigned long n)
>>>>>> +{
>>>>>> + struct _xinterface *_intf = to_intf(intf);
>>>>>> + unsigned long dst;
>>>>>> + bool kthread = !current->mm;
>>>>>> +
>>>>>> + down_read(&_intf->kvm->slots_lock);
>>>>>> +
>>>>>> + dst = gpa_to_hva(_intf, gpa);
>>>>>> + if (!dst)
>>>>>> + goto out;
>>>>>> +
>>>>>> + if (kthread)
>>>>>> + use_mm(_intf->mm);
>>>>>> +
>>>>>> + if (kthread || _intf->mm == current->mm)
>>>>>> + n = copy_to_user((void *)dst, src, n);
>>>>>> + else
>>>>>> + n = _slow_copy_to_user(_intf, dst, src, n);
>>>>>>
>>>>>>
>>>>>>
>>>>> Can't you switch the mm temporarily instead of this?
>>>>>
>>>>>
>>>> Thats actually what I do for the fast-path (use_mm() does a switch_to()
>>>> internally).
>>>>
>>>> The slow-path is only there for completeness for when switching is not
>>>> possible (such as if called with an mm already active i.e.
>>>> process-context).
>>>>
>>> Still, why can't you switch temporarily?
>>>
>> I am not an mm expert, but iiuc you cannot call switch_to() from
>> anything other than kthread context. Thats what the doc says, anyway.
>>
>
> Can you provide a pointer? I don't see why this limitation exists.

See the comments for "use_mm()"

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=mm/mmu_context.c;h=ded9081f40219d888e48e41d4b665a712a188d28;hb=HEAD

>
>>>> In practice, however, this doesnt happen. Virtually
>>>> 100% of the calls in vbus hit the fast-path here, and I suspect most
>>>> xinterface clients would find the same conditions as well.
>>>>
>>>>
>>> So you have 100% untested code here.
>>>
>>>
>> Actually, no. Before Michael enlightened me recently regarding
>> switch_to/use_mm, the '_slow_xx' functions were my _only_ path. So they
>> have indeed multiple months (and multiple GB) of testing, as it turns
>> out. I only recently optimized them away to "backup" duty.
>>
>
> I meant, unexercised. This will bitrot quickly.
>

Ok, since I don't need it, should I just fail the cases that do not
qualify to use switch_to() (if any)?


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-06 17:00:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Avi Kivity wrote:
> On 10/06/2009 04:22 PM, Gregory Haskins wrote:
>>>>>>> +
>>>>>>> +static inline void
>>>>>>> +_kvm_xinterface_release(struct kref *kref)
>>>>>>> +{
>>>>>>> + struct kvm_xinterface *intf;
>>>>>>> + struct module *owner;
>>>>>>> +
>>>>>>> + intf = container_of(kref, struct kvm_xinterface, kref);
>>>>>>> +
>>>>>>> + owner = intf->owner;
>>>>>>> + rmb();
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Why rmb?
>>>>>>
>>>>>>
>>>>> the intf->ops->release() line may invalidate the intf pointer, so we
>>>>> want to ensure that the read completes before the release() is called.
>>>>>
>>>>> TBH: I'm not 100% its needed, but I was being conservative.
>>>>>
>>>>>
>>>> rmb()s are only needed if an external agent can issue writes, otherwise
>>>> you'd need one after every statement.
>>>>
>>> I was following lessons learned here:
>>>
>>> http://lkml.org/lkml/2009/7/7/175
>>>
>>> Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
>>> David Howells in case he has more insight.
>>>
>> BTW: In case it is not clear, the rationale as I understand it is we
>> worry about the case where one cpu reorders the read to be after the
>> ->release(), and another cpu might grab the memory that was kfree()'d
>> within the ->release() and scribble something else on it before the read
>> completes.
>>
>> I know rmb() typically needs to be paired with wmb() to be correct, so
>> you are probably right to say that the rmb() itself is not appropriate.
>> This problem in general makes my head hurt, which is why I said I am
>> not 100% sure of what is required. As David mentions, perhaps
>> "smp_mb()" is more appropriate for this application. I also speculate
>> barrier() may be all that we need.
>>
>
> barrier() is the operation for a compiler barrier. But it's unneeded
> here - unless the compiler can prove that ->release(intf) will not
> modify intf->owner it is not allowed to move the access afterwards. An
> indirect function call is generally a barrier() since the compiler can't
> assume memory has not been modified.
>

You're logic seems reasonable to me. I will defer to David, since he
brought up the issue with the similar logic originally.

Kind Regards,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-06 17:01:39

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Gregory Haskins wrote:
> Avi Kivity wrote:
>> On 10/06/2009 04:22 PM, Gregory Haskins wrote:
>>>>>>>> +
>>>>>>>> +static inline void
>>>>>>>> +_kvm_xinterface_release(struct kref *kref)
>>>>>>>> +{
>>>>>>>> + struct kvm_xinterface *intf;
>>>>>>>> + struct module *owner;
>>>>>>>> +
>>>>>>>> + intf = container_of(kref, struct kvm_xinterface, kref);
>>>>>>>> +
>>>>>>>> + owner = intf->owner;
>>>>>>>> + rmb();
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Why rmb?
>>>>>>>
>>>>>>>
>>>>>> the intf->ops->release() line may invalidate the intf pointer, so we
>>>>>> want to ensure that the read completes before the release() is called.
>>>>>>
>>>>>> TBH: I'm not 100% its needed, but I was being conservative.
>>>>>>
>>>>>>
>>>>> rmb()s are only needed if an external agent can issue writes, otherwise
>>>>> you'd need one after every statement.
>>>>>
>>>> I was following lessons learned here:
>>>>
>>>> http://lkml.org/lkml/2009/7/7/175
>>>>
>>>> Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
>>>> David Howells in case he has more insight.
>>>>
>>> BTW: In case it is not clear, the rationale as I understand it is we
>>> worry about the case where one cpu reorders the read to be after the
>>> ->release(), and another cpu might grab the memory that was kfree()'d
>>> within the ->release() and scribble something else on it before the read
>>> completes.
>>>
>>> I know rmb() typically needs to be paired with wmb() to be correct, so
>>> you are probably right to say that the rmb() itself is not appropriate.
>>> This problem in general makes my head hurt, which is why I said I am
>>> not 100% sure of what is required. As David mentions, perhaps
>>> "smp_mb()" is more appropriate for this application. I also speculate
>>> barrier() may be all that we need.
>>>
>> barrier() is the operation for a compiler barrier. But it's unneeded
>> here - unless the compiler can prove that ->release(intf) will not
>> modify intf->owner it is not allowed to move the access afterwards. An
>> indirect function call is generally a barrier() since the compiler can't
>> assume memory has not been modified.
>>
>
> You're logic

gak. or "your logic" even.

> seems reasonable to me. I will defer to David, since he
> brought up the issue with the similar logic originally.
>
> Kind Regards,
> -Greg
>



Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-06 18:19:39

by Ira W. Snyder

[permalink] [raw]
Subject: Re: [Alacrityvm-devel] [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On Tue, Oct 06, 2009 at 12:58:06PM -0400, Gregory Haskins wrote:
> Avi Kivity wrote:
> > On 10/06/2009 03:31 PM, Gregory Haskins wrote:
> >>
> >>> slots would be one implementation, if you can think of others then you'd
> >>> add them.
> >>>
> >> I'm more interested in *how* you'd add them more than "if" we would add
> >> them. What I am getting at are the logistics of such a beast.
> >>
> >
> > Add alternative ioctls, or have one ioctl with a 'type' field.
> >
> >> For instance, would I have /dev/slots-vas with ioctls for adding slots,
> >> and /dev/foo-vas for adding foos? And each one would instantiate a
> >> different vas_struct object with its own vas_struct->ops? Or were you
> >> thinking of something different.
> >>
> >
> > I think a single /dev/foo is sufficient, unless some of those address
> > spaces are backed by real devices.
> >
> >>> If you can't, I think it indicates that the whole thing isn't necessary
> >>> and we're better off with slots and virtual memory.
> >>>
> >> I'm not sure if we are talking about the same thing yet, but if we are,
> >> there are uses of a generalized interface outside of slots/virtual
> >> memory (Ira's physical box being a good example).
> >>
> >
> > I'm not sure Ira's case is not best supported by virtual memory.
>
> Perhaps, but there are surely some cases where the memory is not
> pageable, but is accessible indirectly through some DMA controller. So
> if we require it to be pagable we will limit the utility of the
> interface, though admittedly it will probably cover most cases.
>

The limitation I have is that memory made available from the host system
(PCI card) as PCI BAR1 must not be migrated around in memory. I can only
change the address decoding to hit a specific physical address. AFAIK,
this means it cannot be userspace memory (since the underlying physical
page could change, or it could be in swap), and must be allocated with
something like __get_free_pages() or dma_alloc_coherent().

This is how all 83xx powerpc boards work, and I'd bet that the 85xx and
86xx boards work almost exactly the same. I can't say anything about
non-powerpc boards.

Ira

2009-10-06 19:41:04

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Gregory Haskins wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>> On 10/06/2009 04:22 PM, Gregory Haskins wrote:
>>>>>>>>> +
>>>>>>>>> +static inline void
>>>>>>>>> +_kvm_xinterface_release(struct kref *kref)
>>>>>>>>> +{
>>>>>>>>> + struct kvm_xinterface *intf;
>>>>>>>>> + struct module *owner;
>>>>>>>>> +
>>>>>>>>> + intf = container_of(kref, struct kvm_xinterface, kref);
>>>>>>>>> +
>>>>>>>>> + owner = intf->owner;
>>>>>>>>> + rmb();
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Why rmb?
>>>>>>>>
>>>>>>>>
>>>>>>> the intf->ops->release() line may invalidate the intf pointer, so we
>>>>>>> want to ensure that the read completes before the release() is called.
>>>>>>>
>>>>>>> TBH: I'm not 100% its needed, but I was being conservative.
>>>>>>>
>>>>>>>
>>>>>> rmb()s are only needed if an external agent can issue writes, otherwise
>>>>>> you'd need one after every statement.
>>>>>>
>>>>> I was following lessons learned here:
>>>>>
>>>>> http://lkml.org/lkml/2009/7/7/175
>>>>>
>>>>> Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing
>>>>> David Howells in case he has more insight.
>>>>>
>>>> BTW: In case it is not clear, the rationale as I understand it is we
>>>> worry about the case where one cpu reorders the read to be after the
>>>> ->release(), and another cpu might grab the memory that was kfree()'d
>>>> within the ->release() and scribble something else on it before the read
>>>> completes.
>>>>
>>>> I know rmb() typically needs to be paired with wmb() to be correct, so
>>>> you are probably right to say that the rmb() itself is not appropriate.
>>>> This problem in general makes my head hurt, which is why I said I am
>>>> not 100% sure of what is required. As David mentions, perhaps
>>>> "smp_mb()" is more appropriate for this application. I also speculate
>>>> barrier() may be all that we need.
>>>>
>>> barrier() is the operation for a compiler barrier. But it's unneeded
>>> here - unless the compiler can prove that ->release(intf) will not
>>> modify intf->owner it is not allowed to move the access afterwards. An
>>> indirect function call is generally a barrier() since the compiler can't
>>> assume memory has not been modified.
>>>
>> You're logic
>
> gak. or "your logic" even.
>
>> seems reasonable to me. I will defer to David, since he
>> brought up the issue with the similar logic originally.
>>
>> Kind Regards,
>> -Greg
>>
>
>

Thinking about this some more over lunch, I think we (Avi and I) might
both be wrong (and David is right). Avi is right that we don't need
rmb() or barrier() for the reasons already stated, but I think David is
right that we need an smp_mb() to ensure the cpu doesn't do the
reordering. Otherwise a different cpu could invalidate the memory if it
reuses the freed memory in the meantime, iiuc. IOW: its not a compiler
issue but a cpu issue.

Or am I still confused?

-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-07 05:12:17

by Amit Shah

[permalink] [raw]
Subject: Re: [Alacrityvm-devel] [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On (Tue) Oct 06 2009 [11:18:59], Ira W. Snyder wrote:
>
> The limitation I have is that memory made available from the host system
> (PCI card) as PCI BAR1 must not be migrated around in memory. I can only
> change the address decoding to hit a specific physical address. AFAIK,
> this means it cannot be userspace memory (since the underlying physical
> page could change, or it could be in swap), and must be allocated with
> something like __get_free_pages() or dma_alloc_coherent().

If the dma area is just one page in size, that page can be pinned by the
host (via a hypercall by the guest). If it's more, there is a
reserve-ram patch by Andrea. That'll reserve some RAM area for the guest
that's 1-1 mapped in the host address space. There was a patch posted on
the list sometime back.

Amit

2009-10-07 07:44:14

by Avi Kivity

[permalink] [raw]
Subject: Re: [Alacrityvm-devel] [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On 10/06/2009 08:18 PM, Ira W. Snyder wrote:
>
> The limitation I have is that memory made available from the host system
> (PCI card) as PCI BAR1 must not be migrated around in memory. I can only
> change the address decoding to hit a specific physical address. AFAIK,
> this means it cannot be userspace memory (since the underlying physical
> page could change, or it could be in swap), and must be allocated with
> something like __get_free_pages() or dma_alloc_coherent().
>

Expose it as /dev/something (/dev/mem, /sys/.../pci/...) and mmap() it,
and it becomes non-pageable user memory.

Not sure about dma_alloc_coherent(), that is meaningless on x86.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-07 08:12:49

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On 10/06/2009 09:40 PM, Gregory Haskins wrote:
> Thinking about this some more over lunch, I think we (Avi and I) might
> both be wrong (and David is right). Avi is right that we don't need
> rmb() or barrier() for the reasons already stated, but I think David is
> right that we need an smp_mb() to ensure the cpu doesn't do the
> reordering. Otherwise a different cpu could invalidate the memory if it
> reuses the freed memory in the meantime, iiuc. IOW: its not a compiler
> issue but a cpu issue.
>
> Or am I still confused?
>
>

The sequence of operations is:

v = p->v;
f();
// rmb() ?
g(v);

You are worried that the compiler or cpu will fetch p->v after f() has
executed? The compiler may not, since it can't tell whether f() might
change p->v. If f() can cause another agent to write to p (by freeing
it to a global list, for example), then it is its responsibility to
issue the smp_rmb(), otherwise no calculation that took place before f()
and accessed p is safe.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-07 12:49:13

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

Avi Kivity wrote:
> On 10/06/2009 09:40 PM, Gregory Haskins wrote:
>> Thinking about this some more over lunch, I think we (Avi and I) might
>> both be wrong (and David is right). Avi is right that we don't need
>> rmb() or barrier() for the reasons already stated, but I think David is
>> right that we need an smp_mb() to ensure the cpu doesn't do the
>> reordering. Otherwise a different cpu could invalidate the memory if it
>> reuses the freed memory in the meantime, iiuc. IOW: its not a compiler
>> issue but a cpu issue.
>>
>> Or am I still confused?
>>
>>
>
> The sequence of operations is:
>
> v = p->v;
> f();
> // rmb() ?
> g(v);
>
> You are worried that the compiler

No

> or cpu will fetch p->v after f() has executed?

Yes.

> The compiler may not, since it can't tell whether f() might
> change p->v.

Right, you were correct to say my barrier() suggestion was wrong.

> If f() can cause another agent to write to p (by freeing
> it to a global list, for example), then it is its responsibility to
> issue the smp_rmb(), otherwise no calculation that took place before f()
> and accessed p is safe.
>

IOW: David is right. You need a cpu-barrier one way or the other. We
can either allow ->release() to imply one (and probably document it that
way, like we did for slow-work), or we can be explicit. I chose to be
explicit since it is kind of self-documenting, and there is no need to
be worried about performance since the release is slow-path.

OTOH: If you feel strongly about it, we can take it out, knowing that
most anything the properly invalidates the memory will likely include an
implicit barrier of some kind.

Kind Regards,
-Greg



Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-08 14:46:35

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] KVM: introduce "xinterface" API for external interaction with guests

On 10/07/2009 02:48 PM, Gregory Haskins wrote:
>
>> If f() can cause another agent to write to p (by freeing
>> it to a global list, for example), then it is its responsibility to
>> issue the smp_rmb(), otherwise no calculation that took place before f()
>> and accessed p is safe.
>>
>>
> IOW: David is right. You need a cpu-barrier one way or the other. We
> can either allow ->release() to imply one (and probably document it that
> way, like we did for slow-work), or we can be explicit.

No, ->release() must do it or it becomes impossible to program. And in
fact it will, to place the freed structure on a global list it must take
a lock which implies an smp_rmb().

> I chose to be
> explicit since it is kind of self-documenting, and there is no need to
> be worried about performance since the release is slow-path.
>

It's so self-documenting I had no idea what it was there for.

--
error compiling committee.c: too many arguments to function