2009-06-25 13:28:23

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v5 0/4] irqfd fixes and enhancements

(Applies to kvm.git/master:4631e094)

The following is the latest attempt to fix the races in irqfd/eventfd, as
well as restore DEASSIGN support. For more details, please read the patch
headers.

This series has been tested against the kvm-eventfd unit test, and
appears to be functioning properly. You can download this test here:

ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2

I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
that its a complete reviewable series. Note, however, that there may be
later versions of his patch to consider for merging, so we should
coordinate with him.

-Greg

---

Davide Libenzi (1):
eventfd - revised interface and cleanups (4th rev)

Gregory Haskins (3):
KVM: add irqfd DEASSIGN feature
KVM: Fix races in irqfd using new eventfd_kref_get interface
kvm: prepare irqfd for having interrupts disabled during eventfd->release


drivers/lguest/lg.h | 2
drivers/lguest/lguest_user.c | 4 -
fs/aio.c | 24 +---
fs/eventfd.c | 126 ++++++++++++++++---
include/linux/aio.h | 4 -
include/linux/eventfd.h | 35 ++++-
include/linux/kvm.h | 2
include/linux/kvm_host.h | 7 +
virt/kvm/Kconfig | 1
virt/kvm/eventfd.c | 284 +++++++++++++++++++++++++++++++++---------
10 files changed, 379 insertions(+), 110 deletions(-)

--
Signature


2009-06-25 13:28:36

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v5 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release

We need to plug some race conditions on eventfd shutdown. In order to
do this, we need to change the context in which the release notification
is delivered so that the wqh lock is now held. However, there is currently
code in the release callback that assumes it can sleep.

We have a slight chicken and egg problem where we cant fix the race
without adding the lock, and we can't add the lock without breaking
the sleepy code. Normally we could deal with this by making both
changes in an atomic changeset. However, we want to keep the eventfd
and kvm specific changes isolated to ease the reviewer burden on upstream
eventfd (at the specific request of upstream). Therefore, we have this
intermediate patch.

This intermediate patch allows the release() method to work in an atomic
context, at the expense of correctness w.r.t. memory-leaks. Today we have
a race condition. With this patch applied we leak. Both issues will be
resolved later in the series. It is the author's opinion that a leak is
better for bisectability than the hang would be should we leave the sleepy
code in place after the locking changeover.

Signed-off-by: Gregory Haskins <[email protected]>
---

virt/kvm/eventfd.c | 89 ++++++++++++++++------------------------------------
1 files changed, 27 insertions(+), 62 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index a9e7de7..9656027 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -28,7 +28,6 @@
#include <linux/file.h>
#include <linux/list.h>
#include <linux/eventfd.h>
-#include <linux/srcu.h>

/*
* --------------------------------------------------------------------
@@ -38,8 +37,6 @@
* --------------------------------------------------------------------
*/
struct _irqfd {
- struct mutex lock;
- struct srcu_struct srcu;
struct kvm *kvm;
int gsi;
struct list_head list;
@@ -53,48 +50,12 @@ static void
irqfd_inject(struct work_struct *work)
{
struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
- struct kvm *kvm;
- int idx;
+ struct kvm *kvm = irqfd->kvm;

- idx = srcu_read_lock(&irqfd->srcu);
-
- kvm = rcu_dereference(irqfd->kvm);
- if (kvm) {
- mutex_lock(&kvm->irq_lock);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
- mutex_unlock(&kvm->irq_lock);
- }
-
- srcu_read_unlock(&irqfd->srcu, idx);
-}
-
-static void
-irqfd_disconnect(struct _irqfd *irqfd)
-{
- struct kvm *kvm;
-
- mutex_lock(&irqfd->lock);
-
- kvm = rcu_dereference(irqfd->kvm);
- rcu_assign_pointer(irqfd->kvm, NULL);
-
- mutex_unlock(&irqfd->lock);
-
- if (!kvm)
- return;
-
- mutex_lock(&kvm->lock);
- list_del(&irqfd->list);
- mutex_unlock(&kvm->lock);
-
- /*
- * It is important to not drop the kvm reference until the next grace
- * period because there might be lockless references in flight up
- * until then
- */
- synchronize_srcu(&irqfd->srcu);
- kvm_put_kvm(kvm);
+ mutex_lock(&kvm->irq_lock);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+ mutex_unlock(&kvm->irq_lock);
}

static int
@@ -103,26 +64,24 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
unsigned long flags = (unsigned long)key;

+ /*
+ * Assume we will be called with interrupts disabled
+ */
if (flags & POLLIN)
/*
- * The POLLIN wake_up is called with interrupts disabled.
- * Therefore we need to defer the IRQ injection until later
- * since we need to acquire the kvm->lock to do so.
+ * Defer the IRQ injection until later since we need to
+ * acquire the kvm->lock to do so.
*/
schedule_work(&irqfd->inject);

if (flags & POLLHUP) {
/*
- * The POLLHUP is called unlocked, so it theoretically should
- * be safe to remove ourselves from the wqh using the locked
- * variant of remove_wait_queue()
+ * for now, just remove ourselves from the list and let
+ * the rest dangle. We will fix this up later once
+ * the races in eventfd are fixed
*/
- remove_wait_queue(irqfd->wqh, &irqfd->wait);
- flush_work(&irqfd->inject);
- irqfd_disconnect(irqfd);
-
- cleanup_srcu_struct(&irqfd->srcu);
- kfree(irqfd);
+ __remove_wait_queue(irqfd->wqh, &irqfd->wait);
+ irqfd->wqh = NULL;
}

return 0;
@@ -150,8 +109,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
if (!irqfd)
return -ENOMEM;

- mutex_init(&irqfd->lock);
- init_srcu_struct(&irqfd->srcu);
irqfd->kvm = kvm;
irqfd->gsi = gsi;
INIT_LIST_HEAD(&irqfd->list);
@@ -172,8 +129,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)

events = file->f_op->poll(file, &irqfd->pt);

- kvm_get_kvm(kvm);
-
mutex_lock(&kvm->lock);
list_add_tail(&irqfd->list, &kvm->irqfds);
mutex_unlock(&kvm->lock);
@@ -211,6 +166,16 @@ kvm_irqfd_release(struct kvm *kvm)
{
struct _irqfd *irqfd, *tmp;

- list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
- irqfd_disconnect(irqfd);
+ list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
+ if (irqfd->wqh)
+ remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+ flush_work(&irqfd->inject);
+
+ mutex_lock(&kvm->lock);
+ list_del(&irqfd->list);
+ mutex_unlock(&kvm->lock);
+
+ kfree(irqfd);
+ }
}

2009-06-25 13:29:15

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v5 2/4] eventfd - revised interface and cleanups (4th rev)

From: Davide Libenzi <[email protected]>

The following patch changes the eventfd interface to de-couple the eventfd
memory context, from the file pointer instance.
Without such change, there is no clean way to racely free handle the
POLLHUP event sent when the last instance of the file* goes away.
Also, now the internal eventfd APIs are using the eventfd context instead
of the file*.

Gregory, none of the APIs changed, so your patches do not need to be
rebased over this one. The last three revisions just updated comments.

Andrew, I'm not posting the "AIO select EVENTFD" patch at this time. Will
eventually try another push in your skull once I come back from vacation ;)

[gmh: resolved merge conflict against prior POLLHUP patch ]

Signed-off-by: Davide Libenzi <[email protected]>
Signed-off-by: Gregory Haskins <[email protected]>
---

drivers/lguest/lg.h | 2 -
drivers/lguest/lguest_user.c | 4 +
fs/aio.c | 24 ++------
fs/eventfd.c | 126 ++++++++++++++++++++++++++++++++++++------
include/linux/aio.h | 4 +
include/linux/eventfd.h | 35 +++++++++---
6 files changed, 147 insertions(+), 48 deletions(-)

diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
index d4e8979..9c31382 100644
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -82,7 +82,7 @@ struct lg_cpu {

struct lg_eventfd {
unsigned long addr;
- struct file *event;
+ struct eventfd_ctx *event;
};

struct lg_eventfd_map {
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
index 32e2971..9f9a295 100644
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -50,7 +50,7 @@ static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)

/* Now append new entry. */
new->map[new->num].addr = addr;
- new->map[new->num].event = eventfd_fget(fd);
+ new->map[new->num].event = eventfd_ctx_fdget(fd);
if (IS_ERR(new->map[new->num].event)) {
kfree(new);
return PTR_ERR(new->map[new->num].event);
@@ -357,7 +357,7 @@ static int close(struct inode *inode, struct file *file)

/* Release any eventfds they registered. */
for (i = 0; i < lg->eventfds->num; i++)
- fput(lg->eventfds->map[i].event);
+ eventfd_ctx_put(lg->eventfds->map[i].event);
kfree(lg->eventfds);

/* If lg->dead doesn't contain an error code it will be NULL or a
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..d065b2c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -485,6 +485,8 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
{
assert_spin_locked(&ctx->ctx_lock);

+ if (req->ki_eventfd != NULL)
+ eventfd_ctx_put(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
@@ -509,8 +511,6 @@ static void aio_fput_routine(struct work_struct *data)
/* Complete the fput(s) */
if (req->ki_filp != NULL)
__fput(req->ki_filp);
- if (req->ki_eventfd != NULL)
- __fput(req->ki_eventfd);

/* Link the iocb into the context's free list */
spin_lock_irq(&ctx->ctx_lock);
@@ -528,8 +528,6 @@ static void aio_fput_routine(struct work_struct *data)
*/
static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
- int schedule_putreq = 0;
-
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(&req->ki_filp->f_count));

@@ -549,24 +547,16 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
* we would not be holding the last reference to the file*, so
* this function will be executed w/out any aio kthread wakeup.
*/
- if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
- schedule_putreq++;
- else
- req->ki_filp = NULL;
- if (req->ki_eventfd != NULL) {
- if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
- schedule_putreq++;
- else
- req->ki_eventfd = NULL;
- }
- if (unlikely(schedule_putreq)) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
get_ioctx(ctx);
spin_lock(&fput_lock);
list_add(&req->ki_list, &fput_head);
spin_unlock(&fput_lock);
queue_work(aio_wq, &fput_work);
- } else
+ } else {
+ req->ki_filp = NULL;
really_put_req(ctx, req);
+ }
return 1;
}

@@ -1622,7 +1612,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
* an eventfd() fd, and will be signaled for each completed
* event using the eventfd_signal() function.
*/
- req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
+ req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
if (IS_ERR(req->ki_eventfd)) {
ret = PTR_ERR(req->ki_eventfd);
req->ki_eventfd = NULL;
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 72f5f8d..31d12de 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -14,35 +14,44 @@
#include <linux/list.h>
#include <linux/spinlock.h>
#include <linux/anon_inodes.h>
-#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>
+#include <linux/eventfd.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
* value of the __u64 being written is added to "count" and a
* wakeup is performed on "wqh". A read(2) will return the "count"
* value to userspace, and will reset "count" to zero. The kernel
- * size eventfd_signal() also, adds to the "count" counter and
+ * side eventfd_signal() also, adds to the "count" counter and
* issue a wakeup.
*/
__u64 count;
unsigned int flags;
};

-/*
- * Adds "n" to the eventfd counter "count". Returns "n" in case of
- * success, or a value lower then "n" in case of coutner overflow.
- * This function is supposed to be called by the kernel in paths
- * that do not allow sleeping. In this function we allow the counter
- * to reach the ULLONG_MAX value, and we signal this as overflow
- * condition by returining a POLLERR to poll(2).
+/**
+ * eventfd_signal - Adds @n to the eventfd counter.
+ * @ctx: [in] Pointer to the eventfd context.
+ * @n: [in] Value of the counter to be added to the eventfd internal counter.
+ * The value cannot be negative.
+ *
+ * This function is supposed to be called by the kernel in paths that do not
+ * allow sleeping. In this function we allow the counter to reach the ULLONG_MAX
+ * value, and we signal this as overflow condition by returining a POLLERR
+ * to poll(2).
+ *
+ * Returns @n in case of success, a non-negative number lower than @n in case
+ * of overflow, or the following error codes:
+ *
+ * -EINVAL : The value of @n is negative.
*/
-int eventfd_signal(struct file *file, int n)
+int eventfd_signal(struct eventfd_ctx *ctx, int n)
{
- struct eventfd_ctx *ctx = file->private_data;
unsigned long flags;

if (n < 0)
@@ -59,17 +68,45 @@ int eventfd_signal(struct file *file, int n)
}
EXPORT_SYMBOL_GPL(eventfd_signal);

+static void eventfd_free(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
+/**
+ * eventfd_ctx_get - Acquires a reference to the internal eventfd context.
+ * @ctx: [in] Pointer to the eventfd context.
+ *
+ * Returns: In case of success, returns a pointer to the eventfd context.
+ */
+struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx)
+{
+ kref_get(&ctx->kref);
+ return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_get);
+
+/**
+ * eventfd_ctx_put - Releases a reference to the internal eventfd context.
+ * @ctx: [in] Pointer to eventfd context.
+ *
+ * The eventfd context reference must have been previously acquired either
+ * with eventfd_ctx_get() or eventfd_ctx_fdget()).
+ */
+void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+ kref_put(&ctx->kref, eventfd_free);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_put);
+
static int eventfd_release(struct inode *inode, struct file *file)
{
struct eventfd_ctx *ctx = file->private_data;

- /*
- * No need to hold the lock here, since we are on the file cleanup
- * path and the ones still attached to the wait queue will be
- * serialized by wake_up_locked_poll().
- */
- wake_up_locked_poll(&ctx->wqh, POLLHUP);
- kfree(ctx);
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ eventfd_ctx_put(ctx);
return 0;
}

@@ -193,6 +230,16 @@ static const struct file_operations eventfd_fops = {
.write = eventfd_write,
};

+/**
+ * eventfd_fget - Acquire a reference of an eventfd file descriptor.
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns a pointer to the eventfd file structure in case of success, or the
+ * following error pointer:
+ *
+ * -EBADF : Invalid @fd file descriptor.
+ * -EINVAL : The @fd file descriptor is not an eventfd file.
+ */
struct file *eventfd_fget(int fd)
{
struct file *file;
@@ -209,6 +256,48 @@ struct file *eventfd_fget(int fd)
}
EXPORT_SYMBOL_GPL(eventfd_fget);

+/**
+ * eventfd_ctx_fdget - Acquires a reference to the internal eventfd context.
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns a pointer to the internal eventfd context, otherwise the error
+ * pointers returned by the following functions:
+ *
+ * eventfd_fget
+ */
+struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+ struct file *file;
+ struct eventfd_ctx *ctx;
+
+ file = eventfd_fget(fd);
+ if (IS_ERR(file))
+ return (struct eventfd_ctx *) file;
+ ctx = eventfd_ctx_get(file->private_data);
+ fput(file);
+
+ return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fdget);
+
+/**
+ * eventfd_ctx_fileget - Acquires a reference to the internal eventfd context.
+ * @file: [in] Eventfd file pointer.
+ *
+ * Returns a pointer to the internal eventfd context, otherwise the error
+ * pointer:
+ *
+ * -EINVAL : The @fd file descriptor is not an eventfd file.
+ */
+struct eventfd_ctx *eventfd_ctx_fileget(struct file *file)
+{
+ if (file->f_op != &eventfd_fops)
+ return ERR_PTR(-EINVAL);
+
+ return eventfd_ctx_get(file->private_data);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
+
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{
int fd;
@@ -225,6 +314,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
if (!ctx)
return -ENOMEM;

+ kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index b16a957..47f7d93 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -121,9 +121,9 @@ struct kiocb {

/*
* If the aio_resfd field of the userspace iocb is not zero,
- * this is the underlying file* to deliver event to.
+ * this is the underlying eventfd context to deliver events to.
*/
- struct file *ki_eventfd;
+ struct eventfd_ctx *ki_eventfd;
};

#define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..3b85ba6 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,10 +8,8 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

-#ifdef CONFIG_EVENTFD
-
-/* For O_CLOEXEC and O_NONBLOCK */
#include <linux/fcntl.h>
+#include <linux/file.h>

/*
* CAREFUL: Check include/asm-generic/fcntl.h when defining
@@ -27,16 +25,37 @@
#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)

+#ifdef CONFIG_EVENTFD
+
+struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx);
+void eventfd_ctx_put(struct eventfd_ctx *ctx);
struct file *eventfd_fget(int fd);
-int eventfd_signal(struct file *file, int n);
+struct eventfd_ctx *eventfd_ctx_fdget(int fd);
+struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
+int eventfd_signal(struct eventfd_ctx *ctx, int n);

#else /* CONFIG_EVENTFD */

-#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
-static inline int eventfd_signal(struct file *file, int n)
-{ return 0; }
+/*
+ * Ugly ugly ugly error layer to support modules that uses eventfd but
+ * pretend to work in !CONFIG_EVENTFD configurations. Namely, AIO.
+ */
+static inline struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+ return ERR_PTR(-ENOSYS);
+}
+
+static inline int eventfd_signal(struct eventfd_ctx *ctx, int n)
+{
+ return -ENOSYS;
+}
+
+static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+
+}

-#endif /* CONFIG_EVENTFD */
+#endif

#endif /* _LINUX_EVENTFD_H */

2009-06-25 13:29:28

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback. This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients. However,
until recently it is not possible to use this feature of eventfd in a
race-free way. This patch utilizes a new eventfd interface to rectify
the problem.

Note that one final race is known to exist: the slow-work thread may race
with module removal. We are currently working with slow-work upstream
to fix this issue as well. Since the code prior to this patch also
races with module_put(), we are not making anything worse, but rather
shifting the cause of the race. Once the slow-work code is patched we
will be fixing the last remaining issue.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/kvm_host.h | 7 +-
virt/kvm/Kconfig | 1
virt/kvm/eventfd.c | 199 ++++++++++++++++++++++++++++++++++++++++------
3 files changed, 179 insertions(+), 28 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2451f48..d94ee72 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -141,7 +141,12 @@ struct kvm {
struct kvm_io_bus mmio_bus;
struct kvm_io_bus pio_bus;
#ifdef CONFIG_HAVE_KVM_EVENTFD
- struct list_head irqfds;
+ struct {
+ spinlock_t lock;
+ struct list_head items;
+ atomic_t outstanding;
+ wait_queue_head_t wqh;
+ } irqfds;
#endif
struct kvm_vm_stat stat;
struct kvm_arch arch;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index daece36..ab7848a 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
config HAVE_KVM_EVENTFD
bool
select EVENTFD
+ select SLOW_WORK

config KVM_APIC_ARCHITECTURE
bool
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9656027..ca21e8a 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -28,6 +28,7 @@
#include <linux/file.h>
#include <linux/list.h>
#include <linux/eventfd.h>
+#include <linux/slow-work.h>

/*
* --------------------------------------------------------------------
@@ -36,17 +37,36 @@
* Credit goes to Avi Kivity for the original idea.
* --------------------------------------------------------------------
*/
+
struct _irqfd {
struct kvm *kvm;
+ struct eventfd_ctx *eventfd;
int gsi;
struct list_head list;
poll_table pt;
wait_queue_head_t *wqh;
wait_queue_t wait;
struct work_struct inject;
+ struct slow_work shutdown;
+ int active:1;
};

static void
+irqfd_release(struct _irqfd *irqfd)
+{
+ eventfd_ctx_put(irqfd->eventfd);
+ kfree(irqfd);
+}
+
+/* assumes kvm->irqfds.lock is held */
+static void
+irqfd_deactivate(struct _irqfd *irqfd)
+{
+ irqfd->active = false;
+ list_del(&irqfd->list);
+}
+
+static void
irqfd_inject(struct work_struct *work)
{
struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
@@ -58,6 +78,55 @@ irqfd_inject(struct work_struct *work)
mutex_unlock(&kvm->irq_lock);
}

+static struct _irqfd *
+work_to_irqfd(struct slow_work *work)
+{
+ return container_of(work, struct _irqfd, shutdown);
+}
+
+static int
+irqfd_shutdown_get_ref(struct slow_work *work)
+{
+ struct _irqfd *irqfd = work_to_irqfd(work);
+ struct kvm *kvm = irqfd->kvm;
+
+ atomic_inc(&kvm->irqfds.outstanding);
+
+ return 0;
+}
+
+static void
+irqfd_shutdown_put_ref(struct slow_work *work)
+{
+ struct _irqfd *irqfd = work_to_irqfd(work);
+ struct kvm *kvm = irqfd->kvm;
+
+ irqfd_release(irqfd);
+
+ if (atomic_dec_and_test(&kvm->irqfds.outstanding))
+ wake_up(&kvm->irqfds.wqh);
+}
+
+static void
+irqfd_shutdown_execute(struct slow_work *work)
+{
+ struct _irqfd *irqfd = work_to_irqfd(work);
+
+ /*
+ * Ensure there are no outstanding "inject" work-items before we blow
+ * away our state. Once this job completes, the slow_work
+ * infrastructure will drop the irqfd object completely via put_ref
+ */
+ flush_work(&irqfd->inject);
+}
+
+const static struct slow_work_ops irqfd_shutdown_work_ops = {
+ .get_ref = irqfd_shutdown_get_ref,
+ .put_ref = irqfd_shutdown_put_ref,
+ .execute = irqfd_shutdown_execute,
+};
+
+
static int
irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
@@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
unsigned long flags = (unsigned long)key;

/*
- * Assume we will be called with interrupts disabled
+ * Called with interrupts disabled
*/
if (flags & POLLIN)
- /*
- * Defer the IRQ injection until later since we need to
- * acquire the kvm->lock to do so.
- */
+ /* An event has been signaled, inject an interrupt */
schedule_work(&irqfd->inject);

if (flags & POLLHUP) {
- /*
- * for now, just remove ourselves from the list and let
- * the rest dangle. We will fix this up later once
- * the races in eventfd are fixed
- */
+ /* The eventfd is closing, detach from KVM */
+ struct kvm *kvm = irqfd->kvm;
+ unsigned long flags;
+
__remove_wait_queue(irqfd->wqh, &irqfd->wait);
- irqfd->wqh = NULL;
+
+ spin_lock_irqsave(&kvm->irqfds.lock, flags);
+
+ if (irqfd->active) {
+ /*
+ * If the item is still active we can be sure that
+ * no-one else is trying to shutdown this object at
+ * the same time.
+ *
+ * Defer the shutdown to a thread so we can flush
+ * all remaining inject jobs. We use a slow-work
+ * item to prevent a deadlock against the work-queue
+ */
+ irqfd_deactivate(irqfd);
+ slow_work_enqueue(&irqfd->shutdown);
+ }
+
+ spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
}

+
return 0;
}

@@ -102,6 +185,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
{
struct _irqfd *irqfd;
struct file *file = NULL;
+ struct eventfd_ctx *eventfd = NULL;
int ret;
unsigned int events;

@@ -113,6 +197,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
irqfd->gsi = gsi;
INIT_LIST_HEAD(&irqfd->list);
INIT_WORK(&irqfd->inject, irqfd_inject);
+ slow_work_init(&irqfd->shutdown, &irqfd_shutdown_work_ops);
+ irqfd->active = true;

file = eventfd_fget(fd);
if (IS_ERR(file)) {
@@ -129,12 +215,21 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)

events = file->f_op->poll(file, &irqfd->pt);

- mutex_lock(&kvm->lock);
- list_add_tail(&irqfd->list, &kvm->irqfds);
- mutex_unlock(&kvm->lock);
+ spin_lock_irq(&kvm->irqfds.lock);
+ list_add_tail(&irqfd->list, &kvm->irqfds.items);
+ spin_unlock_irq(&kvm->irqfds.lock);
+
+ eventfd = eventfd_ctx_fileget(file);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto fail;
+ }
+
+ irqfd->eventfd = eventfd;

/*
- * Check if there was an event already queued
+ * Check if there was an event already pending on the eventfd
+ * before we registered, and trigger it as if we didn't miss it.
*/
if (events & POLLIN)
schedule_work(&irqfd->inject);
@@ -148,6 +243,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
return 0;

fail:
+ if (eventfd && !IS_ERR(eventfd))
+ eventfd_ctx_put(eventfd);
+
if (file && !IS_ERR(file))
fput(file);

@@ -158,24 +256,71 @@ fail:
void
kvm_irqfd_init(struct kvm *kvm)
{
- INIT_LIST_HEAD(&kvm->irqfds);
+ slow_work_register_user();
+
+ spin_lock_init(&kvm->irqfds.lock);
+ INIT_LIST_HEAD(&kvm->irqfds.items);
+ atomic_set(&kvm->irqfds.outstanding, 0);
+ init_waitqueue_head(&kvm->irqfds.wqh);
+}
+
+static struct _irqfd *
+irqfd_pop(struct kvm *kvm)
+{
+ struct _irqfd *irqfd = NULL;
+
+ spin_lock_irq(&kvm->irqfds.lock);
+
+ if (!list_empty(&kvm->irqfds.items)) {
+ irqfd = list_first_entry(&kvm->irqfds.items,
+ struct _irqfd, list);
+ irqfd_deactivate(irqfd);
+ }
+
+ spin_unlock_irq(&kvm->irqfds.lock);
+
+ return irqfd;
+}
+
+/*
+ * locally releases the irqfd
+ *
+ * This function is called when KVM won the race with eventfd (signalled by
+ * finding the item active on the kvm->irqfds.item list). We are now guaranteed
+ * that we will never schedule a deferred shutdown task against this object,
+ * so we take steps to perform the shutdown ourselves.
+ *
+ * 1) We must remove ourselves from the wait-queue to prevent further events,
+ * which will simultaneously act to sync us with eventfd (via wqh->lock)
+ * 2) Flush any outstanding inject-tasks to ensure its safe to free memory
+ * 3) Delete the object
+ */
+static void
+irqfd_shutdown(struct _irqfd *irqfd)
+{
+ remove_wait_queue(irqfd->wqh, &irqfd->wait);
+ flush_work(&irqfd->inject);
+ irqfd_release(irqfd);
}

void
kvm_irqfd_release(struct kvm *kvm)
{
- struct _irqfd *irqfd, *tmp;
-
- list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
- if (irqfd->wqh)
- remove_wait_queue(irqfd->wqh, &irqfd->wait);
+ struct _irqfd *irqfd;

- flush_work(&irqfd->inject);
+ /*
+ * Shutdown all irqfds that still remain
+ */
+ while ((irqfd = irqfd_pop(kvm)))
+ irqfd_shutdown(irqfd);

- mutex_lock(&kvm->lock);
- list_del(&irqfd->list);
- mutex_unlock(&kvm->lock);
+ /*
+ * irqfds.outstanding tracks the number of outstanding "shutdown"
+ * jobs pending at any given time. Once we get here, we know that
+ * no more jobs will get scheduled, so go ahead and block until all
+ * of them complete
+ */
+ wait_event(kvm->irqfds.wqh, (!atomic_read(&kvm->irqfds.outstanding)));

- kfree(irqfd);
- }
+ slow_work_unregister_user();
}

2009-06-25 13:29:41

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature

DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
eventfd without destroying the eventfd in the process. This is useful
for conditions like live-migration which may have an eventfd associated
with a device and an IRQFD. We need to be able to decouple the guest
from the event source while not perturbing the event source itself.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Michael S. Tsirkin <[email protected]>
---

include/linux/kvm.h | 2 ++
virt/kvm/eventfd.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 38ff31e..6710518 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -490,6 +490,8 @@ struct kvm_x86_mce {
};
#endif

+#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+
struct kvm_irqfd {
__u32 fd;
__u32 gsi;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index ca21e8a..2d4549c 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -180,8 +180,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
add_wait_queue(wqh, &irqfd->wait);
}

-int
-kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+static int
+kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
{
struct _irqfd *irqfd;
struct file *file = NULL;
@@ -303,6 +303,58 @@ irqfd_shutdown(struct _irqfd *irqfd)
irqfd_release(irqfd);
}

+/*
+ * assumes kvm->irqfds.lock is held
+ */
+static struct _irqfd *
+irqfd_find(struct kvm *kvm, int fd, int gsi)
+{
+ struct _irqfd *irqfd, *tmp, *ret = ERR_PTR(-ENOENT);
+ struct eventfd_ctx *eventfd;
+
+ eventfd = eventfd_ctx_fdget(fd);
+ if (IS_ERR(eventfd))
+ return ERR_PTR(PTR_ERR(eventfd));
+
+ spin_lock_irq(&kvm->irqfds.lock);
+
+ list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
+ if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) {
+ irqfd_deactivate(irqfd);
+ ret = irqfd;
+ break;
+ }
+ }
+
+ spin_unlock_irq(&kvm->irqfds.lock);
+ eventfd_ctx_put(eventfd);
+
+ return ret;
+}
+
+static int
+kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
+{
+ struct _irqfd *irqfd;
+
+ irqfd = irqfd_find(kvm, fd, gsi);
+ if (IS_ERR(irqfd))
+ return PTR_ERR(irqfd);
+
+ irqfd_shutdown(irqfd);
+
+ return 0;
+}
+
+int
+kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+{
+ if (flags & KVM_IRQFD_FLAG_DEASSIGN)
+ return kvm_irqfd_deassign(kvm, fd, gsi);
+
+ return kvm_irqfd_assign(kvm, fd, gsi);
+}
+
void
kvm_irqfd_release(struct kvm *kvm)
{

2009-06-25 13:59:38

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements

Gregory Haskins wrote:
> (Applies to kvm.git/master:4631e094)
>
> The following is the latest attempt to fix the races in irqfd/eventfd, as
> well as restore DEASSIGN support. For more details, please read the patch
> headers.
>
> This series has been tested against the kvm-eventfd unit test, and
> appears to be functioning properly. You can download this test here:
>
> ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2
>
> I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
> that its a complete reviewable series. Note, however, that there may be
> later versions of his patch to consider for merging, so we should
> coordinate with him.
>

So I know we talked yesterday in the review session about whether it was
actually worth all this complexity to deal with the POLLHUP or if we
should just revert to the prior "two syscall" model and be done with
it. Rusty reflected these same sentiments this morning in response to
Davide's patch in a different thread.

I am a bit torn myself, tbh. I do feel as though I have a good handle
on the issue and that it is indeed now fixed (at least, if this series
is applied and the slow-work issue is fixed, still pending upstream
ACK). I have a lot invested in going the POLLHUP direction having spent
so much time thinking about the problem and working on the patches, so I
a bit of a biased opinion, I know.

The reason why I am pushing this series out now is at least partly so we
can tie up these loose ends. We have both solutions in front of us and
can make a decision either way. At least the solution is formally
documented in the internet archives forever this way ;)

I took the review comments to heart that the shutdown code was
substantially larger and more complex than the actual fast-path code. I
went though last night and simplified and clarified it. I think the
latest result is leaner and clearer, so please give it another review
(particularly for races) before dismissing it.

Ultimately, I think the concept of a release notification for eventfd is
a good thing for all eventfd users, so I don't think this thing should
go away per se even if irqfd decides to not use it.

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-25 16:49:48

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements

On Thu, 25 Jun 2009, Gregory Haskins wrote:

> So I know we talked yesterday in the review session about whether it was
> actually worth all this complexity to deal with the POLLHUP or if we
> should just revert to the prior "two syscall" model and be done with
> it. Rusty reflected these same sentiments this morning in response to
> Davide's patch in a different thread.
>
> I am a bit torn myself, tbh. I do feel as though I have a good handle
> on the issue and that it is indeed now fixed (at least, if this series
> is applied and the slow-work issue is fixed, still pending upstream
> ACK). I have a lot invested in going the POLLHUP direction having spent
> so much time thinking about the problem and working on the patches, so I
> a bit of a biased opinion, I know.
>
> The reason why I am pushing this series out now is at least partly so we
> can tie up these loose ends. We have both solutions in front of us and
> can make a decision either way. At least the solution is formally
> documented in the internet archives forever this way ;)
>
> I took the review comments to heart that the shutdown code was
> substantially larger and more complex than the actual fast-path code. I
> went though last night and simplified and clarified it. I think the
> latest result is leaner and clearer, so please give it another review
> (particularly for races) before dismissing it.
>
> Ultimately, I think the concept of a release notification for eventfd is
> a good thing for all eventfd users, so I don't think this thing should
> go away per se even if irqfd decides to not use it.

Whatever you guys decide if fine for me. Next time though, I think I'll
wait a month or so after taking any action :)


- Davide

2009-06-26 14:06:15

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface

Gregory Haskins wrote:
> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> "release" callback. This lets eventfd clients know if the eventfd is about
> to go away and is very useful particularly for in-kernel clients. However,
> until recently it is not possible to use this feature of eventfd in a
> race-free way. This patch utilizes a new eventfd interface to rectify
> the problem.
>
> Note that one final race is known to exist: the slow-work thread may race
> with module removal. We are currently working with slow-work upstream
> to fix this issue as well. Since the code prior to this patch also
> races with module_put(), we are not making anything worse, but rather
> shifting the cause of the race. Once the slow-work code is patched we
> will be fixing the last remaining issue.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> include/linux/kvm_host.h | 7 +-
> virt/kvm/Kconfig | 1
> virt/kvm/eventfd.c | 199 ++++++++++++++++++++++++++++++++++++++++------
> 3 files changed, 179 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2451f48..d94ee72 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -141,7 +141,12 @@ struct kvm {
> struct kvm_io_bus mmio_bus;
> struct kvm_io_bus pio_bus;
> #ifdef CONFIG_HAVE_KVM_EVENTFD
> - struct list_head irqfds;
> + struct {
> + spinlock_t lock;
> + struct list_head items;
> + atomic_t outstanding;
> + wait_queue_head_t wqh;
> + } irqfds;
> #endif
> struct kvm_vm_stat stat;
> struct kvm_arch arch;
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index daece36..ab7848a 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
> config HAVE_KVM_EVENTFD
> bool
> select EVENTFD
> + select SLOW_WORK
>
> config KVM_APIC_ARCHITECTURE
> bool
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 9656027..ca21e8a 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -28,6 +28,7 @@
> #include <linux/file.h>
> #include <linux/list.h>
> #include <linux/eventfd.h>
> +#include <linux/slow-work.h>
>
> /*
> * --------------------------------------------------------------------
> @@ -36,17 +37,36 @@
> * Credit goes to Avi Kivity for the original idea.
> * --------------------------------------------------------------------
> */
> +
> struct _irqfd {
> struct kvm *kvm;
> + struct eventfd_ctx *eventfd;
> int gsi;
> struct list_head list;
> poll_table pt;
> wait_queue_head_t *wqh;
> wait_queue_t wait;
> struct work_struct inject;
> + struct slow_work shutdown;
> + int active:1;
> };
>
> static void
> +irqfd_release(struct _irqfd *irqfd)
> +{
> + eventfd_ctx_put(irqfd->eventfd);
> + kfree(irqfd);
> +}
> +
> +/* assumes kvm->irqfds.lock is held */
> +static void
> +irqfd_deactivate(struct _irqfd *irqfd)
> +{
> + irqfd->active = false;
> + list_del(&irqfd->list);
> +}
> +
> +static void
> irqfd_inject(struct work_struct *work)
> {
> struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> @@ -58,6 +78,55 @@ irqfd_inject(struct work_struct *work)
> mutex_unlock(&kvm->irq_lock);
> }
>
> +static struct _irqfd *
> +work_to_irqfd(struct slow_work *work)
> +{
> + return container_of(work, struct _irqfd, shutdown);
> +}
> +
> +static int
> +irqfd_shutdown_get_ref(struct slow_work *work)
> +{
> + struct _irqfd *irqfd = work_to_irqfd(work);
> + struct kvm *kvm = irqfd->kvm;
> +
> + atomic_inc(&kvm->irqfds.outstanding);
> +
> + return 0;
> +}
> +
> +static void
> +irqfd_shutdown_put_ref(struct slow_work *work)
> +{
> + struct _irqfd *irqfd = work_to_irqfd(work);
> + struct kvm *kvm = irqfd->kvm;
> +
> + irqfd_release(irqfd);
> +
> + if (atomic_dec_and_test(&kvm->irqfds.outstanding))
> + wake_up(&kvm->irqfds.wqh);
> +}
> +
> +static void
> +irqfd_shutdown_execute(struct slow_work *work)
> +{
> + struct _irqfd *irqfd = work_to_irqfd(work);
> +
> + /*
> + * Ensure there are no outstanding "inject" work-items before we blow
> + * away our state. Once this job completes, the slow_work
> + * infrastructure will drop the irqfd object completely via put_ref
> + */
> + flush_work(&irqfd->inject);
> +}
> +
> +const static struct slow_work_ops irqfd_shutdown_work_ops = {
> + .get_ref = irqfd_shutdown_get_ref,
> + .put_ref = irqfd_shutdown_put_ref,
> + .execute = irqfd_shutdown_execute,
> +};
> +
> +
> static int
> irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> {
> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> unsigned long flags = (unsigned long)key;
>
> /*
> - * Assume we will be called with interrupts disabled
> + * Called with interrupts disabled
> */
> if (flags & POLLIN)
> - /*
> - * Defer the IRQ injection until later since we need to
> - * acquire the kvm->lock to do so.
> - */
> + /* An event has been signaled, inject an interrupt */
> schedule_work(&irqfd->inject);
>
> if (flags & POLLHUP) {
> - /*
> - * for now, just remove ourselves from the list and let
> - * the rest dangle. We will fix this up later once
> - * the races in eventfd are fixed
> - */
> + /* The eventfd is closing, detach from KVM */
> + struct kvm *kvm = irqfd->kvm;
> + unsigned long flags;
> +
> __remove_wait_queue(irqfd->wqh, &irqfd->wait);
> - irqfd->wqh = NULL;
> +
> + spin_lock_irqsave(&kvm->irqfds.lock, flags);
> +
> + if (irqfd->active) {
> + /*
> + * If the item is still active we can be sure that
> + * no-one else is trying to shutdown this object at
> + * the same time.
> + *
> + * Defer the shutdown to a thread so we can flush
> + * all remaining inject jobs. We use a slow-work
> + * item to prevent a deadlock against the work-queue
> + */
> + irqfd_deactivate(irqfd);
> + slow_work_enqueue(&irqfd->shutdown);
> + }
> +
> + spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
> }
>
> +
> return 0;
> }
>
> @@ -102,6 +185,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> {
> struct _irqfd *irqfd;
> struct file *file = NULL;
> + struct eventfd_ctx *eventfd = NULL;
> int ret;
> unsigned int events;
>
> @@ -113,6 +197,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> irqfd->gsi = gsi;
> INIT_LIST_HEAD(&irqfd->list);
> INIT_WORK(&irqfd->inject, irqfd_inject);
> + slow_work_init(&irqfd->shutdown, &irqfd_shutdown_work_ops);
> + irqfd->active = true;
>
> file = eventfd_fget(fd);
> if (IS_ERR(file)) {
> @@ -129,12 +215,21 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>
> events = file->f_op->poll(file, &irqfd->pt);
>
> - mutex_lock(&kvm->lock);
> - list_add_tail(&irqfd->list, &kvm->irqfds);
> - mutex_unlock(&kvm->lock);
> + spin_lock_irq(&kvm->irqfds.lock);
> + list_add_tail(&irqfd->list, &kvm->irqfds.items);
> + spin_unlock_irq(&kvm->irqfds.lock);
> +
> + eventfd = eventfd_ctx_fileget(file);
> + if (IS_ERR(file)) {
> + ret = PTR_ERR(file);
> + goto fail;
> + }
> +
> + irqfd->eventfd = eventfd;
>

<sigh> I just noticed this while doing a self review: I need to assign
the eventfd context *before* putting the item on the list. Not sure why
I even did that. I suspect I re-arranged the code at the last minute
and didn't notice what a dumb thing I was doing.

So this will need at least a v6, but I will wait to hear if there are
any other comments from Michael et. al.

-Greg

>
> /*
> - * Check if there was an event already queued
> + * Check if there was an event already pending on the eventfd
> + * before we registered, and trigger it as if we didn't miss it.
> */
> if (events & POLLIN)
> schedule_work(&irqfd->inject);
> @@ -148,6 +243,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> return 0;
>
> fail:
> + if (eventfd && !IS_ERR(eventfd))
> + eventfd_ctx_put(eventfd);
> +
> if (file && !IS_ERR(file))
> fput(file);
>
> @@ -158,24 +256,71 @@ fail:
> void
> kvm_irqfd_init(struct kvm *kvm)
> {
> - INIT_LIST_HEAD(&kvm->irqfds);
> + slow_work_register_user();
> +
> + spin_lock_init(&kvm->irqfds.lock);
> + INIT_LIST_HEAD(&kvm->irqfds.items);
> + atomic_set(&kvm->irqfds.outstanding, 0);
> + init_waitqueue_head(&kvm->irqfds.wqh);
> +}
> +
> +static struct _irqfd *
> +irqfd_pop(struct kvm *kvm)
> +{
> + struct _irqfd *irqfd = NULL;
> +
> + spin_lock_irq(&kvm->irqfds.lock);
> +
> + if (!list_empty(&kvm->irqfds.items)) {
> + irqfd = list_first_entry(&kvm->irqfds.items,
> + struct _irqfd, list);
> + irqfd_deactivate(irqfd);
> + }
> +
> + spin_unlock_irq(&kvm->irqfds.lock);
> +
> + return irqfd;
> +}
> +
> +/*
> + * locally releases the irqfd
> + *
> + * This function is called when KVM won the race with eventfd (signalled by
> + * finding the item active on the kvm->irqfds.item list). We are now guaranteed
> + * that we will never schedule a deferred shutdown task against this object,
> + * so we take steps to perform the shutdown ourselves.
> + *
> + * 1) We must remove ourselves from the wait-queue to prevent further events,
> + * which will simultaneously act to sync us with eventfd (via wqh->lock)
> + * 2) Flush any outstanding inject-tasks to ensure its safe to free memory
> + * 3) Delete the object
> + */
> +static void
> +irqfd_shutdown(struct _irqfd *irqfd)
> +{
> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
> + flush_work(&irqfd->inject);
> + irqfd_release(irqfd);
> }
>
> void
> kvm_irqfd_release(struct kvm *kvm)
> {
> - struct _irqfd *irqfd, *tmp;
> -
> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> - if (irqfd->wqh)
> - remove_wait_queue(irqfd->wqh, &irqfd->wait);
> + struct _irqfd *irqfd;
>
> - flush_work(&irqfd->inject);
> + /*
> + * Shutdown all irqfds that still remain
> + */
> + while ((irqfd = irqfd_pop(kvm)))
> + irqfd_shutdown(irqfd);
>
> - mutex_lock(&kvm->lock);
> - list_del(&irqfd->list);
> - mutex_unlock(&kvm->lock);
> + /*
> + * irqfds.outstanding tracks the number of outstanding "shutdown"
> + * jobs pending at any given time. Once we get here, we know that
> + * no more jobs will get scheduled, so go ahead and block until all
> + * of them complete
> + */
> + wait_event(kvm->irqfds.wqh, (!atomic_read(&kvm->irqfds.outstanding)));
>
> - kfree(irqfd);
> - }
> + slow_work_unregister_user();
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-28 10:47:03

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature

On Thu, Jun 25, 2009 at 09:28:32AM -0400, Gregory Haskins wrote:
> DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
> eventfd without destroying the eventfd in the process. This is useful
> for conditions like live-migration which may have an eventfd associated
> with a device and an IRQFD. We need to be able to decouple the guest
> from the event source while not perturbing the event source itself.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Michael S. Tsirkin <[email protected]>
> ---
>
> include/linux/kvm.h | 2 ++
> virt/kvm/eventfd.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 56 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 38ff31e..6710518 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -490,6 +490,8 @@ struct kvm_x86_mce {
> };
> #endif
>
> +#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> +
> struct kvm_irqfd {
> __u32 fd;
> __u32 gsi;
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index ca21e8a..2d4549c 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -180,8 +180,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> add_wait_queue(wqh, &irqfd->wait);
> }
>
> -int
> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> +static int
> +kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> {
> struct _irqfd *irqfd;
> struct file *file = NULL;
> @@ -303,6 +303,58 @@ irqfd_shutdown(struct _irqfd *irqfd)
> irqfd_release(irqfd);
> }
>
> +/*
> + * assumes kvm->irqfds.lock is held
> + */
> +static struct _irqfd *
> +irqfd_find(struct kvm *kvm, int fd, int gsi)
> +{
> + struct _irqfd *irqfd, *tmp, *ret = ERR_PTR(-ENOENT);
> + struct eventfd_ctx *eventfd;
> +
> + eventfd = eventfd_ctx_fdget(fd);
> + if (IS_ERR(eventfd))
> + return ERR_PTR(PTR_ERR(eventfd));
> +
> + spin_lock_irq(&kvm->irqfds.lock);
> +
> + list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
> + if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) {
> + irqfd_deactivate(irqfd);
> + ret = irqfd;
> + break;
> + }
> + }
> +
> + spin_unlock_irq(&kvm->irqfds.lock);
> + eventfd_ctx_put(eventfd);
> +
> + return ret;
> +}
> +
> +static int
> +kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> +{
> + struct _irqfd *irqfd;
> +
> + irqfd = irqfd_find(kvm, fd, gsi);
> + if (IS_ERR(irqfd))
> + return PTR_ERR(irqfd);
> +
> + irqfd_shutdown(irqfd);
> +
> + return 0;
> +}


I think that, to make this work properly, you must
add irqfd to list the last thing so do.
As it is, when you assign irqfd, the last thing you do is

irqfd->eventfd = eventfd;

I think you should move this to within a spinlock.

> +
> +int
> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> +{
> + if (flags & KVM_IRQFD_FLAG_DEASSIGN)
> + return kvm_irqfd_deassign(kvm, fd, gsi);
> +
> + return kvm_irqfd_assign(kvm, fd, gsi);
> +}
> +


At some point we discussed limiting the number of
irqfds that can be created in some way, so that userspace
can not consume unlimited amount of kernel memory.

What happened to that idea?

This will happen naturally if
- we keep fget on the file until irqfd goes away
- we allow the same file be bound to only one irqfd
but there might be other ways to do this

--
MST

2009-06-28 11:02:20

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements

On 06/25/2009 04:59 PM, Gregory Haskins wrote:
> Gregory Haskins wrote:
>
>> (Applies to kvm.git/master:4631e094)
>>
>> The following is the latest attempt to fix the races in irqfd/eventfd, as
>> well as restore DEASSIGN support. For more details, please read the patch
>> headers.
>>
>> This series has been tested against the kvm-eventfd unit test, and
>> appears to be functioning properly. You can download this test here:
>>
>> ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2
>>
>> I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
>> that its a complete reviewable series. Note, however, that there may be
>> later versions of his patch to consider for merging, so we should
>> coordinate with him.
>>
>>
>
> So I know we talked yesterday in the review session about whether it was
> actually worth all this complexity to deal with the POLLHUP or if we
> should just revert to the prior "two syscall" model and be done with
> it. Rusty reflected these same sentiments this morning in response to
> Davide's patch in a different thread.
>
> I am a bit torn myself, tbh. I do feel as though I have a good handle
> on the issue and that it is indeed now fixed (at least, if this series
> is applied and the slow-work issue is fixed, still pending upstream
> ACK). I have a lot invested in going the POLLHUP direction having spent
> so much time thinking about the problem and working on the patches, so I
> a bit of a biased opinion, I know.
>
> The reason why I am pushing this series out now is at least partly so we
> can tie up these loose ends. We have both solutions in front of us and
> can make a decision either way. At least the solution is formally
> documented in the internet archives forever this way ;)
>
> I took the review comments to heart that the shutdown code was
> substantially larger and more complex than the actual fast-path code. I
> went though last night and simplified and clarified it. I think the
> latest result is leaner and clearer, so please give it another review
> (particularly for races) before dismissing it.
>

Yes, it's much nicer. I can't say I'm certain it's race free but it's a
lot more digestible

> Ultimately, I think the concept of a release notification for eventfd is
> a good thing for all eventfd users, so I don't think this thing should
> go away per se even if irqfd decides to not use it.
>

I agree that we want POLLHUP support, it's better than holding on to the
eventfd. But I think we can make it even cleaner by merging it with
deassign. Basically, when we get POLLHUP, we launch a slow_work (or
something) that does a regular deassign. That slow_work can grab a ref
to the vm, so we don't race with the VM disappearing.

But given that the current slow_work does almost nothing, I'm not sure
it's worth it.

--
error compiling committee.c: too many arguments to function

2009-06-28 11:07:34

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface

On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> unsigned long flags = (unsigned long)key;
>
> /*
> - * Assume we will be called with interrupts disabled
> + * Called with interrupts disabled
> */
> if (flags & POLLIN)
> - /*
> - * Defer the IRQ injection until later since we need to
> - * acquire the kvm->lock to do so.
> - */
> + /* An event has been signaled, inject an interrupt */
> schedule_work(&irqfd->inject);
>
> if (flags & POLLHUP) {
> - /*
> - * for now, just remove ourselves from the list and let
> - * the rest dangle. We will fix this up later once
> - * the races in eventfd are fixed
> - */
> + /* The eventfd is closing, detach from KVM */
> + struct kvm *kvm = irqfd->kvm;
> + unsigned long flags;
> +
> __remove_wait_queue(irqfd->wqh, &irqfd->wait);
> - irqfd->wqh = NULL;
> +
> + spin_lock_irqsave(&kvm->irqfds.lock, flags);
> +
> + if (irqfd->active) {
> + /*
> + * If the item is still active we can be sure that
> + * no-one else is trying to shutdown this object at
> + * the same time.
> + *
> + * Defer the shutdown to a thread so we can flush
> + * all remaining inject jobs. We use a slow-work
> + * item to prevent a deadlock against the work-queue
> + */
> + irqfd_deactivate(irqfd);
> + slow_work_enqueue(&irqfd->shutdown);

Greg, in your patch for slow-work module removal, you write:
"Callers must ensure that their module has at least
one reference held while the work is enqueued."
Where does this guarantee come from, in this case?

--
MST

2009-06-28 12:39:36

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature

Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 09:28:32AM -0400, Gregory Haskins wrote:
>
>> DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
>> eventfd without destroying the eventfd in the process. This is useful
>> for conditions like live-migration which may have an eventfd associated
>> with a device and an IRQFD. We need to be able to decouple the guest
>> from the event source while not perturbing the event source itself.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Michael S. Tsirkin <[email protected]>
>> ---
>>
>> include/linux/kvm.h | 2 ++
>> virt/kvm/eventfd.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++--
>> 2 files changed, 56 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>> index 38ff31e..6710518 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -490,6 +490,8 @@ struct kvm_x86_mce {
>> };
>> #endif
>>
>> +#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
>> +
>> struct kvm_irqfd {
>> __u32 fd;
>> __u32 gsi;
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index ca21e8a..2d4549c 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -180,8 +180,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>> add_wait_queue(wqh, &irqfd->wait);
>> }
>>
>> -int
>> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>> +static int
>> +kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>> {
>> struct _irqfd *irqfd;
>> struct file *file = NULL;
>> @@ -303,6 +303,58 @@ irqfd_shutdown(struct _irqfd *irqfd)
>> irqfd_release(irqfd);
>> }
>>
>> +/*
>> + * assumes kvm->irqfds.lock is held
>> + */
>> +static struct _irqfd *
>> +irqfd_find(struct kvm *kvm, int fd, int gsi)
>> +{
>> + struct _irqfd *irqfd, *tmp, *ret = ERR_PTR(-ENOENT);
>> + struct eventfd_ctx *eventfd;
>> +
>> + eventfd = eventfd_ctx_fdget(fd);
>> + if (IS_ERR(eventfd))
>> + return ERR_PTR(PTR_ERR(eventfd));
>> +
>> + spin_lock_irq(&kvm->irqfds.lock);
>> +
>> + list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
>> + if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) {
>> + irqfd_deactivate(irqfd);
>> + ret = irqfd;
>> + break;
>> + }
>> + }
>> +
>> + spin_unlock_irq(&kvm->irqfds.lock);
>> + eventfd_ctx_put(eventfd);
>> +
>> + return ret;
>> +}
>> +
>> +static int
>> +kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
>> +{
>> + struct _irqfd *irqfd;
>> +
>> + irqfd = irqfd_find(kvm, fd, gsi);
>> + if (IS_ERR(irqfd))
>> + return PTR_ERR(irqfd);
>> +
>> + irqfd_shutdown(irqfd);
>> +
>> + return 0;
>> +}
>>
>
>
> I think that, to make this work properly, you must
> add irqfd to list the last thing so do.
> As it is, when you assign irqfd, the last thing you do is
>
> irqfd->eventfd = eventfd;
>

Yeah, I agree. I actually already replied to this effect on the thread
for 3/4. ;)

> I think you should move this to within a spinlock.
>

I think if I fix the ordering, the list spinlock should be sufficient.
Am I missing something?

>
>> +
>> +int
>> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>> +{
>> + if (flags & KVM_IRQFD_FLAG_DEASSIGN)
>> + return kvm_irqfd_deassign(kvm, fd, gsi);
>> +
>> + return kvm_irqfd_assign(kvm, fd, gsi);
>> +}
>> +
>>
>
>
> At some point we discussed limiting the number of
> irqfds that can be created in some way, so that userspace
> can not consume unlimited amount of kernel memory.
>
> What happened to that idea?
>

Yeah, that is a good question. I thought I had already done that, too,
but now I don't know what happened to the logic. Perhaps it got lost on
a respin somewhere. I will look into this and add the feature.

> This will happen naturally if
> - we keep fget on the file until irqfd goes away
> - we allow the same file be bound to only one irqfd
> but there might be other ways to do this
>
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-28 12:50:59

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface

Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
>
>> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> unsigned long flags = (unsigned long)key;
>>
>> /*
>> - * Assume we will be called with interrupts disabled
>> + * Called with interrupts disabled
>> */
>> if (flags & POLLIN)
>> - /*
>> - * Defer the IRQ injection until later since we need to
>> - * acquire the kvm->lock to do so.
>> - */
>> + /* An event has been signaled, inject an interrupt */
>> schedule_work(&irqfd->inject);
>>
>> if (flags & POLLHUP) {
>> - /*
>> - * for now, just remove ourselves from the list and let
>> - * the rest dangle. We will fix this up later once
>> - * the races in eventfd are fixed
>> - */
>> + /* The eventfd is closing, detach from KVM */
>> + struct kvm *kvm = irqfd->kvm;
>> + unsigned long flags;
>> +
>> __remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> - irqfd->wqh = NULL;
>> +
>> + spin_lock_irqsave(&kvm->irqfds.lock, flags);
>> +
>> + if (irqfd->active) {
>> + /*
>> + * If the item is still active we can be sure that
>> + * no-one else is trying to shutdown this object at
>> + * the same time.
>> + *
>> + * Defer the shutdown to a thread so we can flush
>> + * all remaining inject jobs. We use a slow-work
>> + * item to prevent a deadlock against the work-queue
>> + */
>> + irqfd_deactivate(irqfd);
>> + slow_work_enqueue(&irqfd->shutdown);
>>
>
> Greg, in your patch for slow-work module removal, you write:
> "Callers must ensure that their module has at least
> one reference held while the work is enqueued."
> Where does this guarantee come from, in this case?
>
The general guarantee comes from the fact that modules naturally have to
have a reference to be able to call the enqueue function to begin with,
or the calling function was already racy. In this particular case, we
can guarantee that the kvm vm fd is held while our slow-work is active,
and all slow work is flushed before it is released. (I guess I am
assuming that VFS takes a module reference when an fd is opened, but I
have not verified that it actually does. If it doesn't, I suppose KVM
is already racy w.r.t. unloading, independent of my patches)

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-28 12:59:43

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements

Avi Kivity wrote:
> On 06/25/2009 04:59 PM, Gregory Haskins wrote:
>> Gregory Haskins wrote:
>>
>>> (Applies to kvm.git/master:4631e094)
>>>
>>> The following is the latest attempt to fix the races in
>>> irqfd/eventfd, as
>>> well as restore DEASSIGN support. For more details, please read the
>>> patch
>>> headers.
>>>
>>> This series has been tested against the kvm-eventfd unit test, and
>>> appears to be functioning properly. You can download this test here:
>>>
>>> ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2
>>>
>>> I've included version 4 of Davide's eventfd patch (ported to
>>> kvm.git) so
>>> that its a complete reviewable series. Note, however, that there
>>> may be
>>> later versions of his patch to consider for merging, so we should
>>> coordinate with him.
>>>
>>>
>>
>> So I know we talked yesterday in the review session about whether it was
>> actually worth all this complexity to deal with the POLLHUP or if we
>> should just revert to the prior "two syscall" model and be done with
>> it. Rusty reflected these same sentiments this morning in response to
>> Davide's patch in a different thread.
>>
>> I am a bit torn myself, tbh. I do feel as though I have a good handle
>> on the issue and that it is indeed now fixed (at least, if this series
>> is applied and the slow-work issue is fixed, still pending upstream
>> ACK). I have a lot invested in going the POLLHUP direction having spent
>> so much time thinking about the problem and working on the patches, so I
>> a bit of a biased opinion, I know.
>>
>> The reason why I am pushing this series out now is at least partly so we
>> can tie up these loose ends. We have both solutions in front of us and
>> can make a decision either way. At least the solution is formally
>> documented in the internet archives forever this way ;)
>>
>> I took the review comments to heart that the shutdown code was
>> substantially larger and more complex than the actual fast-path code. I
>> went though last night and simplified and clarified it. I think the
>> latest result is leaner and clearer, so please give it another review
>> (particularly for races) before dismissing it.
>>
>
> Yes, it's much nicer. I can't say I'm certain it's race free but it's
> a lot more digestible
>
>> Ultimately, I think the concept of a release notification for eventfd is
>> a good thing for all eventfd users, so I don't think this thing should
>> go away per se even if irqfd decides to not use it.
>>
>
> I agree that we want POLLHUP support, it's better than holding on to
> the eventfd. But I think we can make it even cleaner by merging it
> with deassign. Basically, when we get POLLHUP, we launch a slow_work
> (or something) that does a regular deassign. That slow_work can grab
> a ref to the vm, so we don't race with the VM disappearing.
>
> But given that the current slow_work does almost nothing, I'm not sure
> it's worth it.

Yeah, and also note that the algorithm to unhook each side is not quite
symmetrical. I think I've captured all the common parts (in things like
irqfd_deactivate(), etc). A minor change in kvm_irqfd_release() could
technically use a deferred job to release instead of doing it inline,
but I do not think it buys us very much to do so (as you pointed out,
the defered part is actually fairly simple). The important parts of the
protocol lie outside of the work we can do in the work-item anyway.


-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-28 13:19:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface

On Sun, Jun 28, 2009 at 08:50:28AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> >
> >> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >> unsigned long flags = (unsigned long)key;
> >>
> >> /*
> >> - * Assume we will be called with interrupts disabled
> >> + * Called with interrupts disabled
> >> */
> >> if (flags & POLLIN)
> >> - /*
> >> - * Defer the IRQ injection until later since we need to
> >> - * acquire the kvm->lock to do so.
> >> - */
> >> + /* An event has been signaled, inject an interrupt */
> >> schedule_work(&irqfd->inject);
> >>
> >> if (flags & POLLHUP) {
> >> - /*
> >> - * for now, just remove ourselves from the list and let
> >> - * the rest dangle. We will fix this up later once
> >> - * the races in eventfd are fixed
> >> - */
> >> + /* The eventfd is closing, detach from KVM */
> >> + struct kvm *kvm = irqfd->kvm;
> >> + unsigned long flags;
> >> +
> >> __remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> - irqfd->wqh = NULL;
> >> +
> >> + spin_lock_irqsave(&kvm->irqfds.lock, flags);
> >> +
> >> + if (irqfd->active) {
> >> + /*
> >> + * If the item is still active we can be sure that
> >> + * no-one else is trying to shutdown this object at
> >> + * the same time.
> >> + *
> >> + * Defer the shutdown to a thread so we can flush
> >> + * all remaining inject jobs. We use a slow-work
> >> + * item to prevent a deadlock against the work-queue
> >> + */
> >> + irqfd_deactivate(irqfd);
> >> + slow_work_enqueue(&irqfd->shutdown);
> >>
> >
> > Greg, in your patch for slow-work module removal, you write:
> > "Callers must ensure that their module has at least
> > one reference held while the work is enqueued."
> > Where does this guarantee come from, in this case?
> >
> The general guarantee comes from the fact that modules naturally have to
> have a reference to be able to call the enqueue function to begin with,
> or the calling function was already racy. In this particular case, we
> can guarantee that the kvm vm fd is held while our slow-work is active,
> and all slow work is flushed before it is released. (I guess I am
> assuming that VFS takes a module reference when an fd is opened, but I
> have not verified that it actually does. If it doesn't, I suppose KVM
> is already racy w.r.t. unloading, independent of my patches)
>
> -Greg
>

that could be the case, as we have, for example:

static struct file_operations kvm_vm_fops = {
.release = kvm_vm_release,
.unlocked_ioctl = kvm_vm_ioctl,
.compat_ioctl = kvm_vm_ioctl,
.mmap = kvm_vm_mmap,
};

with no owner field.

Avi, shouldn't we initialize the owner field to prevent
kvm module from going away while files are open?

--
MST

2009-06-28 13:24:03

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface

On 06/28/2009 04:18 PM, Michael S. Tsirkin wrote:
>
> that could be the case, as we have, for example:
>
> static struct file_operations kvm_vm_fops = {
> .release = kvm_vm_release,
> .unlocked_ioctl = kvm_vm_ioctl,
> .compat_ioctl = kvm_vm_ioctl,
> .mmap = kvm_vm_mmap,
> };
>
> with no owner field.
>
> Avi, shouldn't we initialize the owner field to prevent
> kvm module from going away while files are open?
>

We do initialize it:

kvm_chardev_ops.owner = module;
kvm_vm_fops.owner = module;
kvm_vcpu_fops.owner = module;

The reason it's not done through the initializer is that we set the
owner to the vendor module (e.g. kvm-intel.ko) so that you can't remove
the vendor module when a guest is running.


--
error compiling committee.c: too many arguments to function

2009-06-28 13:38:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements

On 06/28/2009 03:59 PM, Gregory Haskins wrote:
>> I agree that we want POLLHUP support, it's better than holding on to
>> the eventfd. But I think we can make it even cleaner by merging it
>> with deassign. Basically, when we get POLLHUP, we launch a slow_work
>> (or something) that does a regular deassign. That slow_work can grab
>> a ref to the vm, so we don't race with the VM disappearing.
>>
>> But given that the current slow_work does almost nothing, I'm not sure
>> it's worth it.
>>
>
> Yeah, and also note that the algorithm to unhook each side is not quite
> symmetrical. I think I've captured all the common parts (in things like
> irqfd_deactivate(), etc). A minor change in kvm_irqfd_release() could
> technically use a deferred job to release instead of doing it inline,
> but I do not think it buys us very much to do so (as you pointed out,
> the defered part is actually fairly simple). The important parts of the
> protocol lie outside of the work we can do in the work-item anyway.
>

Is the case of deassign vs POLLHUP covered?

Reusing deassign in POLLHUP at least makes it easy to verify that it is.

--
error compiling committee.c: too many arguments to function