2009-07-02 15:38:23

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v9 0/5] irqfd fixes and enhancements

(Applies to kvm.git/master:1f9050fd)

The following is the latest attempt to fix the races in irqfd/eventfd, as
well as restore DEASSIGN support. For more details, please read the patch
headers.

As always, this series has been tested against the kvm-eventfd unit test
and everything appears to be functioning properly. You can download this
test here:

ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2

Kind Regards,
-Greg

[Changelog:

v9:
*) Re-split v8:1/3 into v9[1/5, 2/5, 3/5]
*) Fixed bad error handing in 3/5
*) Changed "init" bitfield in 5/5 to a bool
*) Rebased to kvm.git/master:1f9050fd)

v8:
*) Rebased to kvm.git/master:beeaacd1)
*) Dropped Davide's patch (2/5 in v7) since it's now upstream
*) Folded v7's 1/5 and 3/5 together, and added a single
eventfd hunk to convert wake_up_locked_polled to wake_up_polled
*) Dropped irqfd->active bit in favor of irqfd_is_active() function
*) Cleaned up comments in 1/3
*) Dropped v7's 5/5 (slow-work)
*) Added new patch (3/3) which makes the cleanup-wq's creation
dynamic so to avoid the resource penalty for guests that do
not use irqfd.

v7:
*) Addressed minor-nit feedback from Michael
*) Cleaned up patch headers
*) Re-added separate slow-work feature patch to end for comparison

v6:
*) Removed slow-work in favor of using a dedicated single-thread
workqueue.
*) Condensed cleanup path to always use deferred shutdown
*) Saved about 56 lines over v5, with the following diffstat:

include/linux/kvm_host.h | 2
virt/kvm/eventfd.c | 248 ++++++++++++++++++-----------------------------
2 files changed, 97 insertions(+), 153 deletions(-)
v5:
Untracked..
]


---

Gregory Haskins (5):
KVM: create irqfd-cleanup-wq on demand
KVM: add irqfd DEASSIGN feature
KVM: Fix races in irqfd using new eventfd_kref_get interface
eventfd: use locked POLLHUP
kvm: prepare irqfd for having interrupts disabled during eventfd->release


fs/eventfd.c | 7 -
include/linux/kvm.h | 2
include/linux/kvm_host.h | 6 +
virt/kvm/eventfd.c | 281 ++++++++++++++++++++++++++++++++++++----------
4 files changed, 229 insertions(+), 67 deletions(-)

--
Signature


2009-07-02 15:38:37

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v9 1/5] kvm: prepare irqfd for having interrupts disabled during eventfd->release

We need to plug some race conditions on eventfd shutdown. In order to
do this, we need to change the context in which the release notification
is delivered so that the wqh lock is now held. However, there is currently
code in the release callback that assumes it can sleep.

We have a slight chicken and egg problem where we cant fix the race
without adding the lock, and we can't add the lock without breaking
the sleepy code. Normally we could deal with this by making both
changes in an atomic changeset. However, we want to keep the eventfd
and kvm specific changes isolated to ease the reviewer burden on upstream
eventfd (at the specific request of upstream). Therefore, we have this
intermediate patch.

This intermediate patch allows the release() method to work in an atomic
context, at the expense of correctness w.r.t. memory-leaks. Today we have
a race condition. With this patch applied we leak. Both issues will be
resolved later in the series. It is the author's opinion that a leak is
better for bisectability than the hang would be should we leave the sleepy
code in place after the locking changeover.

Signed-off-by: Gregory Haskins <[email protected]>
---

virt/kvm/eventfd.c | 89 ++++++++++++++++------------------------------------
1 files changed, 27 insertions(+), 62 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index a9e7de7..9656027 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -28,7 +28,6 @@
#include <linux/file.h>
#include <linux/list.h>
#include <linux/eventfd.h>
-#include <linux/srcu.h>

/*
* --------------------------------------------------------------------
@@ -38,8 +37,6 @@
* --------------------------------------------------------------------
*/
struct _irqfd {
- struct mutex lock;
- struct srcu_struct srcu;
struct kvm *kvm;
int gsi;
struct list_head list;
@@ -53,48 +50,12 @@ static void
irqfd_inject(struct work_struct *work)
{
struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
- struct kvm *kvm;
- int idx;
+ struct kvm *kvm = irqfd->kvm;

- idx = srcu_read_lock(&irqfd->srcu);
-
- kvm = rcu_dereference(irqfd->kvm);
- if (kvm) {
- mutex_lock(&kvm->irq_lock);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
- mutex_unlock(&kvm->irq_lock);
- }
-
- srcu_read_unlock(&irqfd->srcu, idx);
-}
-
-static void
-irqfd_disconnect(struct _irqfd *irqfd)
-{
- struct kvm *kvm;
-
- mutex_lock(&irqfd->lock);
-
- kvm = rcu_dereference(irqfd->kvm);
- rcu_assign_pointer(irqfd->kvm, NULL);
-
- mutex_unlock(&irqfd->lock);
-
- if (!kvm)
- return;
-
- mutex_lock(&kvm->lock);
- list_del(&irqfd->list);
- mutex_unlock(&kvm->lock);
-
- /*
- * It is important to not drop the kvm reference until the next grace
- * period because there might be lockless references in flight up
- * until then
- */
- synchronize_srcu(&irqfd->srcu);
- kvm_put_kvm(kvm);
+ mutex_lock(&kvm->irq_lock);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+ mutex_unlock(&kvm->irq_lock);
}

static int
@@ -103,26 +64,24 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
unsigned long flags = (unsigned long)key;

+ /*
+ * Assume we will be called with interrupts disabled
+ */
if (flags & POLLIN)
/*
- * The POLLIN wake_up is called with interrupts disabled.
- * Therefore we need to defer the IRQ injection until later
- * since we need to acquire the kvm->lock to do so.
+ * Defer the IRQ injection until later since we need to
+ * acquire the kvm->lock to do so.
*/
schedule_work(&irqfd->inject);

if (flags & POLLHUP) {
/*
- * The POLLHUP is called unlocked, so it theoretically should
- * be safe to remove ourselves from the wqh using the locked
- * variant of remove_wait_queue()
+ * for now, just remove ourselves from the list and let
+ * the rest dangle. We will fix this up later once
+ * the races in eventfd are fixed
*/
- remove_wait_queue(irqfd->wqh, &irqfd->wait);
- flush_work(&irqfd->inject);
- irqfd_disconnect(irqfd);
-
- cleanup_srcu_struct(&irqfd->srcu);
- kfree(irqfd);
+ __remove_wait_queue(irqfd->wqh, &irqfd->wait);
+ irqfd->wqh = NULL;
}

return 0;
@@ -150,8 +109,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
if (!irqfd)
return -ENOMEM;

- mutex_init(&irqfd->lock);
- init_srcu_struct(&irqfd->srcu);
irqfd->kvm = kvm;
irqfd->gsi = gsi;
INIT_LIST_HEAD(&irqfd->list);
@@ -172,8 +129,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)

events = file->f_op->poll(file, &irqfd->pt);

- kvm_get_kvm(kvm);
-
mutex_lock(&kvm->lock);
list_add_tail(&irqfd->list, &kvm->irqfds);
mutex_unlock(&kvm->lock);
@@ -211,6 +166,16 @@ kvm_irqfd_release(struct kvm *kvm)
{
struct _irqfd *irqfd, *tmp;

- list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
- irqfd_disconnect(irqfd);
+ list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
+ if (irqfd->wqh)
+ remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+ flush_work(&irqfd->inject);
+
+ mutex_lock(&kvm->lock);
+ list_del(&irqfd->list);
+ mutex_unlock(&kvm->lock);
+
+ kfree(irqfd);
+ }
}

2009-07-02 15:39:23

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v9 4/5] KVM: add irqfd DEASSIGN feature

DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
eventfd without destroying the eventfd in the process. This is useful
for conditions like live-migration which may have an eventfd associated
with a device and an IRQFD. We need to be able to decouple the guest
from the event source while not perturbing the event source itself.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Michael S. Tsirkin <[email protected]>
---

include/linux/kvm.h | 2 ++
virt/kvm/eventfd.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 69d3d73..76c6408 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -461,6 +461,8 @@ struct kvm_x86_mce {
};
#endif

+#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+
struct kvm_irqfd {
__u32 fd;
__u32 gsi;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9da9286..4092b8d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -161,8 +161,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
add_wait_queue(wqh, &irqfd->wait);
}

-int
-kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+static int
+kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
{
struct _irqfd *irqfd;
struct file *file = NULL;
@@ -241,6 +241,48 @@ kvm_irqfd_init(struct kvm *kvm)
}

/*
+ * shutdown any irqfd's that match fd+gsi
+ */
+static int
+kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
+{
+ struct _irqfd *irqfd, *tmp;
+ struct eventfd_ctx *eventfd;
+
+ eventfd = eventfd_ctx_fdget(fd);
+ if (IS_ERR(eventfd))
+ return PTR_ERR(eventfd);
+
+ spin_lock_irq(&kvm->irqfds.lock);
+
+ list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
+ if (irqfd->eventfd == eventfd && irqfd->gsi == gsi)
+ irqfd_deactivate(irqfd);
+ }
+
+ spin_unlock_irq(&kvm->irqfds.lock);
+ eventfd_ctx_put(eventfd);
+
+ /*
+ * Block until we know all outstanding shutdown jobs have completed
+ * so that we guarantee there will not be any more interrupts on this
+ * gsi once this deassign function returns.
+ */
+ flush_workqueue(irqfd_cleanup_wq);
+
+ return 0;
+}
+
+int
+kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+{
+ if (flags & KVM_IRQFD_FLAG_DEASSIGN)
+ return kvm_irqfd_deassign(kvm, fd, gsi);
+
+ return kvm_irqfd_assign(kvm, fd, gsi);
+}
+
+/*
* This function is called as the kvm VM fd is being released. Shutdown all
* irqfds that still remain open
*/

2009-07-02 15:38:50

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v9 2/5] eventfd: use locked POLLHUP

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback. This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients. However,
as it stands today it is not possible to use this feature of eventfd in a
race-free way. This patch changes the POLLHUP code to use the locked variant
to rectify this problem.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Davide Libenzi <[email protected]>
---

fs/eventfd.c | 7 +------
1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index d9849a1..31d12de 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -105,12 +105,7 @@ static int eventfd_release(struct inode *inode, struct file *file)
{
struct eventfd_ctx *ctx = file->private_data;

- /*
- * No need to hold the lock here, since we are on the file cleanup
- * path and the ones still attached to the wait queue will be
- * serialized by wake_up_locked_poll().
- */
- wake_up_locked_poll(&ctx->wqh, POLLHUP);
+ wake_up_poll(&ctx->wqh, POLLHUP);
eventfd_ctx_put(ctx);
return 0;
}

2009-07-02 15:39:04

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v9 3/5] KVM: Fix races in irqfd using new eventfd_kref_get interface

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback. This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients. However,
until recently it is not possible to use this feature of eventfd in a
race-free way.

This patch utilizes a new eventfd interface to rectify the problem. It
also carefully coordinates the shutdown path using a deferred work-item
so that we may support a full mutual-disassociate protocol between KVM and
eventfd.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/kvm_host.h | 5 +
virt/kvm/eventfd.c | 158 ++++++++++++++++++++++++++++++++++++++--------
2 files changed, 136 insertions(+), 27 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9634f31..8e04a34 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -146,7 +146,10 @@ struct kvm {
struct kvm_io_bus mmio_bus;
struct kvm_io_bus pio_bus;
#ifdef CONFIG_HAVE_KVM_EVENTFD
- struct list_head irqfds;
+ struct {
+ spinlock_t lock;
+ struct list_head items;
+ } irqfds;
#endif
struct kvm_vm_stat stat;
struct kvm_arch arch;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9656027..9da9286 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -36,16 +36,21 @@
* Credit goes to Avi Kivity for the original idea.
* --------------------------------------------------------------------
*/
+
struct _irqfd {
struct kvm *kvm;
+ struct eventfd_ctx *eventfd;
int gsi;
struct list_head list;
poll_table pt;
wait_queue_head_t *wqh;
wait_queue_t wait;
struct work_struct inject;
+ struct work_struct shutdown;
};

+static struct workqueue_struct *irqfd_cleanup_wq;
+
static void
irqfd_inject(struct work_struct *work)
{
@@ -58,30 +63,89 @@ irqfd_inject(struct work_struct *work)
mutex_unlock(&kvm->irq_lock);
}

+/*
+ * Race-free decouple logic (ordering is critical)
+ */
+static void
+irqfd_shutdown(struct work_struct *work)
+{
+ struct _irqfd *irqfd = container_of(work, struct _irqfd, shutdown);
+
+ /*
+ * Synchronize with the wait-queue and unhook ourselves to prevent
+ * further events.
+ */
+ remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+ /*
+ * We know no new events will be scheduled at this point, so block
+ * until all previously outstanding events have completed
+ */
+ flush_work(&irqfd->inject);
+
+ /*
+ * It is now safe to release the object's resources
+ */
+ eventfd_ctx_put(irqfd->eventfd);
+ kfree(irqfd);
+}
+
+
+/* assumes kvm->irqfds.lock is held */
+static bool
+irqfd_is_active(struct _irqfd *irqfd)
+{
+ return list_empty(&irqfd->list) ? false : true;
+}
+
+/*
+ * Mark the irqfd as inactive and schedule it for removal
+ *
+ * assumes kvm->irqfds.lock is held
+ */
+static void
+irqfd_deactivate(struct _irqfd *irqfd)
+{
+ BUG_ON(!irqfd_is_active(irqfd));
+
+ list_del_init(&irqfd->list);
+
+ queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
+}
+
+/*
+ * Called with wqh->lock held and interrupts disabled
+ */
static int
irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
unsigned long flags = (unsigned long)key;

- /*
- * Assume we will be called with interrupts disabled
- */
if (flags & POLLIN)
- /*
- * Defer the IRQ injection until later since we need to
- * acquire the kvm->lock to do so.
- */
+ /* An event has been signaled, inject an interrupt */
schedule_work(&irqfd->inject);

if (flags & POLLHUP) {
+ /* The eventfd is closing, detach from KVM */
+ struct kvm *kvm = irqfd->kvm;
+ unsigned long flags;
+
+ spin_lock_irqsave(&kvm->irqfds.lock, flags);
+
/*
- * for now, just remove ourselves from the list and let
- * the rest dangle. We will fix this up later once
- * the races in eventfd are fixed
+ * We must check if someone deactivated the irqfd before
+ * we could acquire the irqfds.lock since the item is
+ * deactivated from the KVM side before it is unhooked from
+ * the wait-queue. If it is already deactivated, we can
+ * simply return knowing the other side will cleanup for us.
+ * We cannot race against the irqfd going away since the
+ * other side is required to acquire wqh->lock, which we hold
*/
- __remove_wait_queue(irqfd->wqh, &irqfd->wait);
- irqfd->wqh = NULL;
+ if (irqfd_is_active(irqfd))
+ irqfd_deactivate(irqfd);
+
+ spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
}

return 0;
@@ -102,6 +166,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
{
struct _irqfd *irqfd;
struct file *file = NULL;
+ struct eventfd_ctx *eventfd = NULL;
int ret;
unsigned int events;

@@ -113,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
irqfd->gsi = gsi;
INIT_LIST_HEAD(&irqfd->list);
INIT_WORK(&irqfd->inject, irqfd_inject);
+ INIT_WORK(&irqfd->shutdown, irqfd_shutdown);

file = eventfd_fget(fd);
if (IS_ERR(file)) {
@@ -120,6 +186,14 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
goto fail;
}

+ eventfd = eventfd_ctx_fileget(file);
+ if (IS_ERR(eventfd)) {
+ ret = PTR_ERR(eventfd);
+ goto fail;
+ }
+
+ irqfd->eventfd = eventfd;
+
/*
* Install our own custom wake-up handling so we are notified via
* a callback whenever someone signals the underlying eventfd
@@ -129,12 +203,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)

events = file->f_op->poll(file, &irqfd->pt);

- mutex_lock(&kvm->lock);
- list_add_tail(&irqfd->list, &kvm->irqfds);
- mutex_unlock(&kvm->lock);
+ spin_lock_irq(&kvm->irqfds.lock);
+ list_add_tail(&irqfd->list, &kvm->irqfds.items);
+ spin_unlock_irq(&kvm->irqfds.lock);

/*
- * Check if there was an event already queued
+ * Check if there was an event already pending on the eventfd
+ * before we registered, and trigger it as if we didn't miss it.
*/
if (events & POLLIN)
schedule_work(&irqfd->inject);
@@ -148,6 +223,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
return 0;

fail:
+ if (eventfd && !IS_ERR(eventfd))
+ eventfd_ctx_put(eventfd);
+
if (file && !IS_ERR(file))
fput(file);

@@ -158,24 +236,52 @@ fail:
void
kvm_irqfd_init(struct kvm *kvm)
{
- INIT_LIST_HEAD(&kvm->irqfds);
+ spin_lock_init(&kvm->irqfds.lock);
+ INIT_LIST_HEAD(&kvm->irqfds.items);
}

+/*
+ * This function is called as the kvm VM fd is being released. Shutdown all
+ * irqfds that still remain open
+ */
void
kvm_irqfd_release(struct kvm *kvm)
{
struct _irqfd *irqfd, *tmp;

- list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
- if (irqfd->wqh)
- remove_wait_queue(irqfd->wqh, &irqfd->wait);
+ spin_lock_irq(&kvm->irqfds.lock);

- flush_work(&irqfd->inject);
+ list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)
+ irqfd_deactivate(irqfd);

- mutex_lock(&kvm->lock);
- list_del(&irqfd->list);
- mutex_unlock(&kvm->lock);
+ spin_unlock_irq(&kvm->irqfds.lock);
+
+ /*
+ * Block until we know all outstanding shutdown jobs have completed
+ * since we do not take a kvm* reference.
+ */
+ flush_workqueue(irqfd_cleanup_wq);

- kfree(irqfd);
- }
}
+
+/*
+ * create a host-wide workqueue for issuing deferred shutdown requests
+ * aggregated from all vm* instances. We need our own isolated single-thread
+ * queue to prevent deadlock against flushing the normal work-queue.
+ */
+static int __init irqfd_module_init(void)
+{
+ irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
+ if (!irqfd_cleanup_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void __exit irqfd_module_exit(void)
+{
+ destroy_workqueue(irqfd_cleanup_wq);
+}
+
+module_init(irqfd_module_init);
+module_exit(irqfd_module_exit);

2009-07-02 15:39:35

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

We currently create this wq on module_init, which may be wasteful if the
host never creates a guest that uses irqfd. This patch changes the
algorithm so that the workqueue is only created when at least one guest
is using irqfd. The queue is cleaned up when the last guest using irqfd
is shutdown.

To keep things simple, we only check whether the guest has tried to create
an irqfd, not whether there are actually irqfds active.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/kvm_host.h | 1
virt/kvm/eventfd.c | 100 ++++++++++++++++++++++++++++++++++------------
2 files changed, 75 insertions(+), 26 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8e04a34..cd1a0f3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -149,6 +149,7 @@ struct kvm {
struct {
spinlock_t lock;
struct list_head items;
+ bool init;
} irqfds;
#endif
struct kvm_vm_stat stat;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 4092b8d..fcc3469 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -49,7 +49,16 @@ struct _irqfd {
struct work_struct shutdown;
};

-static struct workqueue_struct *irqfd_cleanup_wq;
+struct _irqfd_cleanup {
+ struct mutex lock;
+ int refs;
+ struct workqueue_struct *wq;
+};
+
+static struct _irqfd_cleanup irqfd_cleanup = {
+ .lock = __MUTEX_INITIALIZER(irqfd_cleanup.lock),
+ .refs = 0,
+};

static void
irqfd_inject(struct work_struct *work)
@@ -110,7 +119,7 @@ irqfd_deactivate(struct _irqfd *irqfd)

list_del_init(&irqfd->list);

- queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
+ queue_work(irqfd_cleanup.wq, &irqfd->shutdown);
}

/*
@@ -161,6 +170,62 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
add_wait_queue(wqh, &irqfd->wait);
}

+/*
+ * create a host-wide workqueue for issuing deferred shutdown requests
+ * aggregated from all vm* instances. We need our own isolated single-thread
+ * queue to prevent deadlock against flushing the normal work-queue.
+ */
+static int
+irqfd_cleanup_init(struct kvm *kvm)
+{
+ int ret = 0;
+
+ mutex_lock(&irqfd_cleanup.lock);
+
+ /*
+ * Check the current init state from within the lock so that we
+ * sync all users to the thread creation.
+ */
+ if (kvm->irqfds.init)
+ goto out;
+
+ if (!irqfd_cleanup.refs) {
+ struct workqueue_struct *wq;
+
+ wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
+ if (!wq) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ irqfd_cleanup.wq = wq;
+ }
+
+ irqfd_cleanup.refs++;
+ kvm->irqfds.init = true;
+
+out:
+ mutex_unlock(&irqfd_cleanup.lock);
+
+ return ret;
+}
+
+static void
+irqfd_cleanup_release(struct kvm *kvm)
+{
+ if (!kvm->irqfds.init)
+ return;
+
+ mutex_lock(&irqfd_cleanup.lock);
+
+ if (!(--irqfd_cleanup.refs))
+ destroy_workqueue(irqfd_cleanup.wq);
+
+ mutex_unlock(&irqfd_cleanup.lock);
+
+ kvm->irqfds.init = false;
+}
+
static int
kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
{
@@ -170,6 +235,10 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
int ret;
unsigned int events;

+ ret = irqfd_cleanup_init(kvm);
+ if (ret < 0)
+ return ret;
+
irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
if (!irqfd)
return -ENOMEM;
@@ -268,7 +337,7 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
* so that we guarantee there will not be any more interrupts on this
* gsi once this deassign function returns.
*/
- flush_workqueue(irqfd_cleanup_wq);
+ flush_workqueue(irqfd_cleanup.wq);

return 0;
}
@@ -302,28 +371,7 @@ kvm_irqfd_release(struct kvm *kvm)
* Block until we know all outstanding shutdown jobs have completed
* since we do not take a kvm* reference.
*/
- flush_workqueue(irqfd_cleanup_wq);
-
-}
-
-/*
- * create a host-wide workqueue for issuing deferred shutdown requests
- * aggregated from all vm* instances. We need our own isolated single-thread
- * queue to prevent deadlock against flushing the normal work-queue.
- */
-static int __init irqfd_module_init(void)
-{
- irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
- if (!irqfd_cleanup_wq)
- return -ENOMEM;
-
- return 0;
-}
+ flush_workqueue(irqfd_cleanup.wq);
+ irqfd_cleanup_release(kvm);

-static void __exit irqfd_module_exit(void)
-{
- destroy_workqueue(irqfd_cleanup_wq);
}
-
-module_init(irqfd_module_init);
-module_exit(irqfd_module_exit);

2009-07-02 15:48:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On 07/02/2009 06:37 PM, Gregory Haskins wrote:
> (Applies to kvm.git/master:1f9050fd)
>
> The following is the latest attempt to fix the races in irqfd/eventfd, as
> well as restore DEASSIGN support. For more details, please read the patch
> headers.
>
> As always, this series has been tested against the kvm-eventfd unit test
> and everything appears to be functioning properly. You can download this
> test here:
>

Applied, thanks.

--
error compiling committee.c: too many arguments to function

2009-07-02 16:43:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM PATCH v9 2/5] eventfd: use locked POLLHUP

On Thu, 2 Jul 2009, Gregory Haskins wrote:

> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> "release" callback. This lets eventfd clients know if the eventfd is about
> to go away and is very useful particularly for in-kernel clients. However,
> as it stands today it is not possible to use this feature of eventfd in a
> race-free way. This patch changes the POLLHUP code to use the locked variant
> to rectify this problem.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Davide Libenzi <[email protected]>
> ---
>
> fs/eventfd.c | 7 +------
> 1 files changed, 1 insertions(+), 6 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index d9849a1..31d12de 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -105,12 +105,7 @@ static int eventfd_release(struct inode *inode, struct file *file)
> {
> struct eventfd_ctx *ctx = file->private_data;
>
> - /*
> - * No need to hold the lock here, since we are on the file cleanup
> - * path and the ones still attached to the wait queue will be
> - * serialized by wake_up_locked_poll().
> - */
> - wake_up_locked_poll(&ctx->wqh, POLLHUP);
> + wake_up_poll(&ctx->wqh, POLLHUP);
> eventfd_ctx_put(ctx);
> return 0;
> }

ACK.
BTW, mainline already has the correct wake_up_poll().


- Davide

2009-07-05 09:26:57

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On 07/02/2009 06:50 PM, Avi Kivity wrote:
> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
>> (Applies to kvm.git/master:1f9050fd)
>>
>> The following is the latest attempt to fix the races in
>> irqfd/eventfd, as
>> well as restore DEASSIGN support. For more details, please read the
>> patch
>> headers.
>>
>> As always, this series has been tested against the kvm-eventfd unit test
>> and everything appears to be functioning properly. You can download this
>> test here:
>
> Applied, thanks.
>

... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
reference is taken for each irqfd, but dropped for each guest. This
causes an oops if a guest with no irqfds is created and destroyed:

IP: [<ffffffff8105254a>] flush_workqueue+0x23/0x6e
PGD 0
Oops: 0000 [4] SMP
CPU 1
Modules linked in: kvm_intel kvm nfs lockd nfs_acl sco bridge stp bnep
l2cap bluetooth autofs4 sunrpc ipv6 dm_multipath uinput i5000_edac
e1000e edac_core iTCO_wdt
iTCO_vendor_support i2c_i801 i2c_core e100 mii floppy pcspkr shpchp
serio_raw ata_generic pata_acpi [last unloaded: kvm]
Pid: 2088, comm: qemu Tainted: G D 2.6.27.19-170.2.35.fc10.x86_64
#1 TYAN Transport GT20-B5372
RIP: 0010:[<ffffffff8105254a>] [<ffffffff8105254a>]
flush_workqueue+0x23/0x6e
RSP: 0018:ffff8801077d1b08 EFLAGS: 00010292
RAX: ffffffff8156de08 RBX: ffff8801097e8a50 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000019f RDI: 0000000000000000
RBP: ffff8801077d1b28 R08: 0000000000000000 R09: ffffffff81140027
R10: ffff88012f402340 R11: ffff880125daf820 R12: ffffffff8156de10
R13: 0000000000000000 R14: ffff88012f449cd8 R15: ffff88012f002900
FS: 00007f9b67295950(0000) GS:ffff88012fc04980(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000020 CR3: 0000000000201000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu (pid: 2088, threadinfo ffff8801077d0000, task ffff88012984c530)
Stack: ffff8801077d1b28 ffff8801097e8a50 ffff8801097e8000 ffff8801097e8a68
ffff8801077d1b58 ffffffffa01c4eec 0000000800000000 ffff8801097e8000
ffff88012f449cd8 ffff88012c80c600 ffff8801077d1b78 ffffffffa01af34b
Call Trace:
[<ffffffffa01c4eec>] kvm_irqfd_release+0x7a/0xcc [kvm]
[<ffffffffa01af34b>] kvm_vm_release+0x18/0x27 [kvm]
[<ffffffff810c14c7>] __fput+0xca/0x16d
[<ffffffff810c157f>] fput+0x15/0x17
[<ffffffff810bea29>] filp_close+0x67/0x72
[<ffffffff810433ec>] put_files_struct+0x74/0xc8
[<ffffffff81043488>] exit_files+0x48/0x51
[<ffffffff81044de9>] do_exit+0x26a/0x8a0
[<ffffffffa01da330>] ? vmx_vcpu_put+0x9/0xb [kvm_intel]
[<ffffffff810454a1>] do_group_exit+0x82/0xaf
[<ffffffff8104eabb>] get_signal_to_deliver+0x2b0/0x2dc
[<ffffffff81010379>] ? sysret_signal+0x42/0x71
[<ffffffff8100f45f>] do_notify_resume+0x90/0x93f
[<ffffffff81060cca>] ? do_futex+0x90/0x973
[<ffffffffa01ad956>] ? kvm_vcpu_ioctl+0x470/0x485 [kvm]
[<ffffffff81333801>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[<ffffffff810616a2>] ? sys_futex+0xf5/0x113
[<ffffffff81010379>] ? sysret_signal+0x42/0x71
[<ffffffff81010737>] ptregscall_common+0x67/0xb0

irqfd_cleanup.wq has never been initialized, but is destroyed.

--
error compiling committee.c: too many arguments to function

2009-07-05 10:17:12

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On Sun, Jul 05, 2009 at 12:28:30PM +0300, Avi Kivity wrote:
> On 07/02/2009 06:50 PM, Avi Kivity wrote:
>> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
>>> (Applies to kvm.git/master:1f9050fd)
>>>
>>> The following is the latest attempt to fix the races in
>>> irqfd/eventfd, as
>>> well as restore DEASSIGN support. For more details, please read the
>>> patch
>>> headers.
>>>
>>> As always, this series has been tested against the kvm-eventfd unit test
>>> and everything appears to be functioning properly. You can download this
>>> test here:
>>
>> Applied, thanks.
>>
>
> ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
> reference is taken for each irqfd, but dropped for each guest. This
> causes an oops if a guest with no irqfds is created and destroyed:

Ugh, apparently this logic has been changed between I acked v7 of the
patches and between Avi applied v9. Will have to find the time to redo
the review - or maybe just go back to v7? Is on-demand wq creation
really that important?

--
MST

2009-07-05 10:21:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On Sun, Jul 05, 2009 at 01:16:24PM +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 05, 2009 at 12:28:30PM +0300, Avi Kivity wrote:
> > On 07/02/2009 06:50 PM, Avi Kivity wrote:
> >> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
> >>> (Applies to kvm.git/master:1f9050fd)
> >>>
> >>> The following is the latest attempt to fix the races in
> >>> irqfd/eventfd, as
> >>> well as restore DEASSIGN support. For more details, please read the
> >>> patch
> >>> headers.
> >>>
> >>> As always, this series has been tested against the kvm-eventfd unit test
> >>> and everything appears to be functioning properly. You can download this
> >>> test here:
> >>
> >> Applied, thanks.
> >>
> >
> > ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
> > reference is taken for each irqfd, but dropped for each guest. This
> > causes an oops if a guest with no irqfds is created and destroyed:
>
> Ugh, apparently this logic has been changed between I acked v7 of the
> patches and between Avi applied v9. Will have to find the time to redo
> the review - or maybe just go back to v7? Is on-demand wq creation
> really that important?

Avi, is it true that just reverting the last patch in series,
c9a2686e39e9095772ec6453f89c417a8e166f11, fixes the issue?

--
MST

2009-07-05 10:40:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On 07/05/2009 01:38 PM, Michael S. Tsirkin wrote:
>> ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
>> reference is taken for each irqfd, but dropped for each guest. This
>> causes an oops if a guest with no irqfds is created and destroyed:
>>
>
> Avi, I verified and it seems that reverting just the last patch
> in series should be enough. Comments?
>

I'll reapply the first N-1 patches.

--
error compiling committee.c: too many arguments to function

2009-07-05 10:39:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On Sun, Jul 05, 2009 at 12:28:30PM +0300, Avi Kivity wrote:
> On 07/02/2009 06:50 PM, Avi Kivity wrote:
>> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
>>> (Applies to kvm.git/master:1f9050fd)
>>>
>>> The following is the latest attempt to fix the races in
>>> irqfd/eventfd, as
>>> well as restore DEASSIGN support. For more details, please read the
>>> patch
>>> headers.
>>>
>>> As always, this series has been tested against the kvm-eventfd unit test
>>> and everything appears to be functioning properly. You can download this
>>> test here:
>>
>> Applied, thanks.
>>
>
> ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
> reference is taken for each irqfd, but dropped for each guest. This
> causes an oops if a guest with no irqfds is created and destroyed:

Avi, I verified and it seems that reverting just the last patch
in series should be enough. Comments?

--
MST

2009-07-05 21:21:37

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

Avi Kivity wrote:
> On 07/02/2009 06:50 PM, Avi Kivity wrote:
>> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
>>> (Applies to kvm.git/master:1f9050fd)
>>>
>>> The following is the latest attempt to fix the races in
>>> irqfd/eventfd, as
>>> well as restore DEASSIGN support. For more details, please read the
>>> patch
>>> headers.
>>>
>>> As always, this series has been tested against the kvm-eventfd unit
>>> test
>>> and everything appears to be functioning properly. You can download
>>> this
>>> test here:
>>
>> Applied, thanks.
>>
>
> ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
> reference is taken for each irqfd, but dropped for each guest.

Doh!

Note that the kvm->irqfds.init boolean was supposed to make it both
acquire and release the reference on a per-vm basis. Obviously,
something is wrong. :)

But I see later in the thread you guys have already figured out that
only the last patch needs to be reverted. Ill take a look at the last
patch and fix the issue once I am back in the office tomorrow. Sorry
for the trouble.

Kind Regards,
-Greg

> This causes an oops if a guest with no irqfds is created and destroyed:
>
> IP: [<ffffffff8105254a>] flush_workqueue+0x23/0x6e
> PGD 0
> Oops: 0000 [4] SMP
> CPU 1
> Modules linked in: kvm_intel kvm nfs lockd nfs_acl sco bridge stp bnep
> l2cap bluetooth autofs4 sunrpc ipv6 dm_multipath uinput i5000_edac
> e1000e edac_core iTCO_wdt
> iTCO_vendor_support i2c_i801 i2c_core e100 mii floppy pcspkr shpchp
> serio_raw ata_generic pata_acpi [last unloaded: kvm]
> Pid: 2088, comm: qemu Tainted: G D
> 2.6.27.19-170.2.35.fc10.x86_64 #1 TYAN Transport GT20-B5372
> RIP: 0010:[<ffffffff8105254a>] [<ffffffff8105254a>]
> flush_workqueue+0x23/0x6e
> RSP: 0018:ffff8801077d1b08 EFLAGS: 00010292
> RAX: ffffffff8156de08 RBX: ffff8801097e8a50 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 000000000000019f RDI: 0000000000000000
> RBP: ffff8801077d1b28 R08: 0000000000000000 R09: ffffffff81140027
> R10: ffff88012f402340 R11: ffff880125daf820 R12: ffffffff8156de10
> R13: 0000000000000000 R14: ffff88012f449cd8 R15: ffff88012f002900
> FS: 00007f9b67295950(0000) GS:ffff88012fc04980(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: 0000000000000020 CR3: 0000000000201000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process qemu (pid: 2088, threadinfo ffff8801077d0000, task
> ffff88012984c530)
> Stack: ffff8801077d1b28 ffff8801097e8a50 ffff8801097e8000
> ffff8801097e8a68
> ffff8801077d1b58 ffffffffa01c4eec 0000000800000000 ffff8801097e8000
> ffff88012f449cd8 ffff88012c80c600 ffff8801077d1b78 ffffffffa01af34b
> Call Trace:
> [<ffffffffa01c4eec>] kvm_irqfd_release+0x7a/0xcc [kvm]
> [<ffffffffa01af34b>] kvm_vm_release+0x18/0x27 [kvm]
> [<ffffffff810c14c7>] __fput+0xca/0x16d
> [<ffffffff810c157f>] fput+0x15/0x17
> [<ffffffff810bea29>] filp_close+0x67/0x72
> [<ffffffff810433ec>] put_files_struct+0x74/0xc8
> [<ffffffff81043488>] exit_files+0x48/0x51
> [<ffffffff81044de9>] do_exit+0x26a/0x8a0
> [<ffffffffa01da330>] ? vmx_vcpu_put+0x9/0xb [kvm_intel]
> [<ffffffff810454a1>] do_group_exit+0x82/0xaf
> [<ffffffff8104eabb>] get_signal_to_deliver+0x2b0/0x2dc
> [<ffffffff81010379>] ? sysret_signal+0x42/0x71
> [<ffffffff8100f45f>] do_notify_resume+0x90/0x93f
> [<ffffffff81060cca>] ? do_futex+0x90/0x973
> [<ffffffffa01ad956>] ? kvm_vcpu_ioctl+0x470/0x485 [kvm]
> [<ffffffff81333801>] ? trace_hardirqs_on_thunk+0x3a/0x3c
> [<ffffffff810616a2>] ? sys_futex+0xf5/0x113
> [<ffffffff81010379>] ? sysret_signal+0x42/0x71
> [<ffffffff81010737>] ptregscall_common+0x67/0xb0
>
> irqfd_cleanup.wq has never been initialized, but is destroyed.
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-07-06 14:56:32

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

Avi Kivity wrote:
> On 07/02/2009 06:50 PM, Avi Kivity wrote:
>> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
>>> (Applies to kvm.git/master:1f9050fd)
>>>
>>> The following is the latest attempt to fix the races in
>>> irqfd/eventfd, as
>>> well as restore DEASSIGN support. For more details, please read the
>>> patch
>>> headers.
>>>
>>> As always, this series has been tested against the kvm-eventfd unit
>>> test
>>> and everything appears to be functioning properly. You can download
>>> this
>>> test here:
>>
>> Applied, thanks.
>>
>
> ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
> reference is taken for each irqfd, but dropped for each guest. This
> causes an oops if a guest with no irqfds is created and destroyed:

I was able to reproduce this issue. The problem turned out to be that I
inadvertently always did a flush_workqueue(), even if the work-queue was
never initialized.

The following interdiff applied to the reverted patch has been confirmed
to fix the issue:

-------------------

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index fcc3469..52b0e04 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -318,6 +318,9 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
struct _irqfd *irqfd, *tmp;
struct eventfd_ctx *eventfd;

+ if (!kvm->irqfds.init)
+ return -ENOENT;
+
eventfd = eventfd_ctx_fdget(fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
@@ -360,6 +363,9 @@ kvm_irqfd_release(struct kvm *kvm)
{
struct _irqfd *irqfd, *tmp;

+ if (!kvm->irqfds.init)
+ return;
+
spin_lock_irq(&kvm->irqfds.lock);

list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)

---------------------

You can pick up this fix folded into the original v9:5/5 patch here:

git pull
git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git
for-avi

Sorry for the sloppy patch in v9. :( Will strive to do better next time.

Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-07-06 15:59:40

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

On Thu, Jul 02, 2009 at 11:38:22AM -0400, Gregory Haskins wrote:
> We currently create this wq on module_init, which may be wasteful if the
> host never creates a guest that uses irqfd. This patch changes the
> algorithm so that the workqueue is only created when at least one guest
> is using irqfd. The queue is cleaned up when the last guest using irqfd
> is shutdown.
>
> To keep things simple, we only check whether the guest has tried to create
> an irqfd, not whether there are actually irqfds active.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> include/linux/kvm_host.h | 1
> virt/kvm/eventfd.c | 100 ++++++++++++++++++++++++++++++++++------------
> 2 files changed, 75 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8e04a34..cd1a0f3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -149,6 +149,7 @@ struct kvm {
> struct {
> spinlock_t lock;
> struct list_head items;
> + bool init;
> } irqfds;
> #endif
> struct kvm_vm_stat stat;
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 4092b8d..fcc3469 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -49,7 +49,16 @@ struct _irqfd {
> struct work_struct shutdown;
> };
>
> -static struct workqueue_struct *irqfd_cleanup_wq;
> +struct _irqfd_cleanup {
> + struct mutex lock;
> + int refs;
> + struct workqueue_struct *wq;
> +};
> +
> +static struct _irqfd_cleanup irqfd_cleanup = {
> + .lock = __MUTEX_INITIALIZER(irqfd_cleanup.lock),
> + .refs = 0,
> +};
>
> static void
> irqfd_inject(struct work_struct *work)
> @@ -110,7 +119,7 @@ irqfd_deactivate(struct _irqfd *irqfd)
>
> list_del_init(&irqfd->list);
>
> - queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
> + queue_work(irqfd_cleanup.wq, &irqfd->shutdown);
> }
>
> /*
> @@ -161,6 +170,62 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> add_wait_queue(wqh, &irqfd->wait);
> }
>
> +/*
> + * create a host-wide workqueue for issuing deferred shutdown requests
> + * aggregated from all vm* instances. We need our own isolated single-thread
> + * queue to prevent deadlock against flushing the normal work-queue.
> + */
> +static int
> +irqfd_cleanup_init(struct kvm *kvm)
> +{
> + int ret = 0;
> +
> + mutex_lock(&irqfd_cleanup.lock);
> +
> + /*
> + * Check the current init state from within the lock so that we
> + * sync all users to the thread creation.
> + */
> + if (kvm->irqfds.init)
> + goto out;
> +
> + if (!irqfd_cleanup.refs) {
> + struct workqueue_struct *wq;
> +
> + wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
> + if (!wq) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + irqfd_cleanup.wq = wq;
> + }
> +
> + irqfd_cleanup.refs++;
> + kvm->irqfds.init = true;
> +
> +out:
> + mutex_unlock(&irqfd_cleanup.lock);
> +
> + return ret;
> +}
> +
> +static void
> +irqfd_cleanup_release(struct kvm *kvm)
> +{
> + if (!kvm->irqfds.init)
> + return;

init is checked outside the lock here.
Why?

> +
> + mutex_lock(&irqfd_cleanup.lock);
> +
> + if (!(--irqfd_cleanup.refs))
> + destroy_workqueue(irqfd_cleanup.wq);
> +
> + mutex_unlock(&irqfd_cleanup.lock);
> +
> + kvm->irqfds.init = false;

... and cleaned outside the lock as well.

> +}
> +
> static int
> kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> {
> @@ -170,6 +235,10 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> int ret;
> unsigned int events;
>
> + ret = irqfd_cleanup_init(kvm);
> + if (ret < 0)
> + return ret;
> +
> irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> if (!irqfd)
> return -ENOMEM;
> @@ -268,7 +337,7 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> * so that we guarantee there will not be any more interrupts on this
> * gsi once this deassign function returns.
> */
> - flush_workqueue(irqfd_cleanup_wq);
> + flush_workqueue(irqfd_cleanup.wq);
>
> return 0;
> }
> @@ -302,28 +371,7 @@ kvm_irqfd_release(struct kvm *kvm)
> * Block until we know all outstanding shutdown jobs have completed
> * since we do not take a kvm* reference.
> */
> - flush_workqueue(irqfd_cleanup_wq);
> -
> -}
> -
> -/*
> - * create a host-wide workqueue for issuing deferred shutdown requests
> - * aggregated from all vm* instances. We need our own isolated single-thread
> - * queue to prevent deadlock against flushing the normal work-queue.
> - */
> -static int __init irqfd_module_init(void)
> -{
> - irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
> - if (!irqfd_cleanup_wq)
> - return -ENOMEM;
> -
> - return 0;
> -}
> + flush_workqueue(irqfd_cleanup.wq);
> + irqfd_cleanup_release(kvm);
>
> -static void __exit irqfd_module_exit(void)
> -{
> - destroy_workqueue(irqfd_cleanup_wq);
> }
> -
> -module_init(irqfd_module_init);
> -module_exit(irqfd_module_exit);

2009-07-06 16:03:51

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

Michael S. Tsirkin wrote:
> On Thu, Jul 02, 2009 at 11:38:22AM -0400, Gregory Haskins wrote:
>
>> We currently create this wq on module_init, which may be wasteful if the
>> host never creates a guest that uses irqfd. This patch changes the
>> algorithm so that the workqueue is only created when at least one guest
>> is using irqfd. The queue is cleaned up when the last guest using irqfd
>> is shutdown.
>>
>> To keep things simple, we only check whether the guest has tried to create
>> an irqfd, not whether there are actually irqfds active.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> ---
>>
>> include/linux/kvm_host.h | 1
>> virt/kvm/eventfd.c | 100 ++++++++++++++++++++++++++++++++++------------
>> 2 files changed, 75 insertions(+), 26 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 8e04a34..cd1a0f3 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -149,6 +149,7 @@ struct kvm {
>> struct {
>> spinlock_t lock;
>> struct list_head items;
>> + bool init;
>> } irqfds;
>> #endif
>> struct kvm_vm_stat stat;
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 4092b8d..fcc3469 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -49,7 +49,16 @@ struct _irqfd {
>> struct work_struct shutdown;
>> };
>>
>> -static struct workqueue_struct *irqfd_cleanup_wq;
>> +struct _irqfd_cleanup {
>> + struct mutex lock;
>> + int refs;
>> + struct workqueue_struct *wq;
>> +};
>> +
>> +static struct _irqfd_cleanup irqfd_cleanup = {
>> + .lock = __MUTEX_INITIALIZER(irqfd_cleanup.lock),
>> + .refs = 0,
>> +};
>>
>> static void
>> irqfd_inject(struct work_struct *work)
>> @@ -110,7 +119,7 @@ irqfd_deactivate(struct _irqfd *irqfd)
>>
>> list_del_init(&irqfd->list);
>>
>> - queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
>> + queue_work(irqfd_cleanup.wq, &irqfd->shutdown);
>> }
>>
>> /*
>> @@ -161,6 +170,62 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>> add_wait_queue(wqh, &irqfd->wait);
>> }
>>
>> +/*
>> + * create a host-wide workqueue for issuing deferred shutdown requests
>> + * aggregated from all vm* instances. We need our own isolated single-thread
>> + * queue to prevent deadlock against flushing the normal work-queue.
>> + */
>> +static int
>> +irqfd_cleanup_init(struct kvm *kvm)
>> +{
>> + int ret = 0;
>> +
>> + mutex_lock(&irqfd_cleanup.lock);
>> +
>> + /*
>> + * Check the current init state from within the lock so that we
>> + * sync all users to the thread creation.
>> + */
>> + if (kvm->irqfds.init)
>> + goto out;
>> +
>> + if (!irqfd_cleanup.refs) {
>> + struct workqueue_struct *wq;
>> +
>> + wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
>> + if (!wq) {
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>> +
>> + irqfd_cleanup.wq = wq;
>> + }
>> +
>> + irqfd_cleanup.refs++;
>> + kvm->irqfds.init = true;
>> +
>> +out:
>> + mutex_unlock(&irqfd_cleanup.lock);
>> +
>> + return ret;
>> +}
>> +
>> +static void
>> +irqfd_cleanup_release(struct kvm *kvm)
>> +{
>> + if (!kvm->irqfds.init)
>> + return;
>>
>
> init is checked outside the lock here.
> Why?
>

Guest is shutting down via vmfd->f_ops->release() and is already
guaranteed to be single-threaded, so proper locking is not really
important. Probably should document that, though ;)
>
>> +
>> + mutex_lock(&irqfd_cleanup.lock);
>> +
>> + if (!(--irqfd_cleanup.refs))
>> + destroy_workqueue(irqfd_cleanup.wq);
>> +
>> + mutex_unlock(&irqfd_cleanup.lock);
>> +
>> + kvm->irqfds.init = false;
>>
>
> ... and cleaned outside the lock as well.
>
>
Ditto

>> +}
>> +
>> static int
>> kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>> {
>> @@ -170,6 +235,10 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>> int ret;
>> unsigned int events;
>>
>> + ret = irqfd_cleanup_init(kvm);
>> + if (ret < 0)
>> + return ret;
>> +
>> irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
>> if (!irqfd)
>> return -ENOMEM;
>> @@ -268,7 +337,7 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
>> * so that we guarantee there will not be any more interrupts on this
>> * gsi once this deassign function returns.
>> */
>> - flush_workqueue(irqfd_cleanup_wq);
>> + flush_workqueue(irqfd_cleanup.wq);
>>
>> return 0;
>> }
>> @@ -302,28 +371,7 @@ kvm_irqfd_release(struct kvm *kvm)
>> * Block until we know all outstanding shutdown jobs have completed
>> * since we do not take a kvm* reference.
>> */
>> - flush_workqueue(irqfd_cleanup_wq);
>> -
>> -}
>> -
>> -/*
>> - * create a host-wide workqueue for issuing deferred shutdown requests
>> - * aggregated from all vm* instances. We need our own isolated single-thread
>> - * queue to prevent deadlock against flushing the normal work-queue.
>> - */
>> -static int __init irqfd_module_init(void)
>> -{
>> - irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
>> - if (!irqfd_cleanup_wq)
>> - return -ENOMEM;
>> -
>> - return 0;
>> -}
>> + flush_workqueue(irqfd_cleanup.wq);
>> + irqfd_cleanup_release(kvm);
>>
>> -static void __exit irqfd_module_exit(void)
>> -{
>> - destroy_workqueue(irqfd_cleanup_wq);
>> }
>> -
>> -module_init(irqfd_module_init);
>> -module_exit(irqfd_module_exit);
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-07-06 16:14:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On Mon, Jul 06, 2009 at 10:56:02AM -0400, Gregory Haskins wrote:
> Avi Kivity wrote:
> > On 07/02/2009 06:50 PM, Avi Kivity wrote:
> >> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
> >>> (Applies to kvm.git/master:1f9050fd)
> >>>
> >>> The following is the latest attempt to fix the races in
> >>> irqfd/eventfd, as
> >>> well as restore DEASSIGN support. For more details, please read the
> >>> patch
> >>> headers.
> >>>
> >>> As always, this series has been tested against the kvm-eventfd unit
> >>> test
> >>> and everything appears to be functioning properly. You can download
> >>> this
> >>> test here:
> >>
> >> Applied, thanks.
> >>
> >
> > ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
> > reference is taken for each irqfd, but dropped for each guest. This
> > causes an oops if a guest with no irqfds is created and destroyed:
>
> I was able to reproduce this issue. The problem turned out to be that I
> inadvertently always did a flush_workqueue(), even if the work-queue was
> never initialized.
>
> The following interdiff applied to the reverted patch has been confirmed
> to fix the issue:

Could you document the init boolean and its locking rules?
The best place to put it would be where the field is declared btw.
Is it true that init === list_empty(&kvm->irqfds.items)?
If yes maybe we don't need this field at all.


> -------------------
>
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index fcc3469..52b0e04 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -318,6 +318,9 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> struct _irqfd *irqfd, *tmp;
> struct eventfd_ctx *eventfd;
>
> + if (!kvm->irqfds.init)
> + return -ENOENT;
> +
> eventfd = eventfd_ctx_fdget(fd);
> if (IS_ERR(eventfd))
> return PTR_ERR(eventfd);

wouldn't it be cleaner to error out in the for each loop if we don't
find an entry to deactivate? Might be helpful for apps to get an error
if they didn't deassign anything.

> @@ -360,6 +363,9 @@ kvm_irqfd_release(struct kvm *kvm)
> {
> struct _irqfd *irqfd, *tmp;
>
> + if (!kvm->irqfds.init)
> + return;
> +

So here, I recall some old comment that flush below was
needed even if list is empty. Is this no longer true?
If not it might be cleaner to only flush if list is not empty.


> spin_lock_irq(&kvm->irqfds.lock);
>
> list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)
>
> ---------------------
>
> You can pick up this fix folded into the original v9:5/5 patch here:
>
> git pull
> git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git
> for-avi
>
> Sorry for the sloppy patch in v9. :( Will strive to do better next time.
>
> Regards,
> -Greg
>

2009-07-06 16:15:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

On Mon, Jul 06, 2009 at 12:03:28PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jul 02, 2009 at 11:38:22AM -0400, Gregory Haskins wrote:
> >
> >> We currently create this wq on module_init, which may be wasteful if the
> >> host never creates a guest that uses irqfd. This patch changes the
> >> algorithm so that the workqueue is only created when at least one guest
> >> is using irqfd. The queue is cleaned up when the last guest using irqfd
> >> is shutdown.
> >>
> >> To keep things simple, we only check whether the guest has tried to create
> >> an irqfd, not whether there are actually irqfds active.
> >>
> >> Signed-off-by: Gregory Haskins <[email protected]>
> >> ---
> >>
> >> include/linux/kvm_host.h | 1
> >> virt/kvm/eventfd.c | 100 ++++++++++++++++++++++++++++++++++------------
> >> 2 files changed, 75 insertions(+), 26 deletions(-)
> >>
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index 8e04a34..cd1a0f3 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -149,6 +149,7 @@ struct kvm {
> >> struct {
> >> spinlock_t lock;
> >> struct list_head items;
> >> + bool init;
> >> } irqfds;
> >> #endif
> >> struct kvm_vm_stat stat;
> >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> >> index 4092b8d..fcc3469 100644
> >> --- a/virt/kvm/eventfd.c
> >> +++ b/virt/kvm/eventfd.c
> >> @@ -49,7 +49,16 @@ struct _irqfd {
> >> struct work_struct shutdown;
> >> };
> >>
> >> -static struct workqueue_struct *irqfd_cleanup_wq;
> >> +struct _irqfd_cleanup {
> >> + struct mutex lock;
> >> + int refs;
> >> + struct workqueue_struct *wq;
> >> +};
> >> +
> >> +static struct _irqfd_cleanup irqfd_cleanup = {
> >> + .lock = __MUTEX_INITIALIZER(irqfd_cleanup.lock),
> >> + .refs = 0,
> >> +};
> >>
> >> static void
> >> irqfd_inject(struct work_struct *work)
> >> @@ -110,7 +119,7 @@ irqfd_deactivate(struct _irqfd *irqfd)
> >>
> >> list_del_init(&irqfd->list);
> >>
> >> - queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
> >> + queue_work(irqfd_cleanup.wq, &irqfd->shutdown);
> >> }
> >>
> >> /*
> >> @@ -161,6 +170,62 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> >> add_wait_queue(wqh, &irqfd->wait);
> >> }
> >>
> >> +/*
> >> + * create a host-wide workqueue for issuing deferred shutdown requests
> >> + * aggregated from all vm* instances. We need our own isolated single-thread
> >> + * queue to prevent deadlock against flushing the normal work-queue.
> >> + */
> >> +static int
> >> +irqfd_cleanup_init(struct kvm *kvm)
> >> +{
> >> + int ret = 0;
> >> +
> >> + mutex_lock(&irqfd_cleanup.lock);
> >> +
> >> + /*
> >> + * Check the current init state from within the lock so that we
> >> + * sync all users to the thread creation.
> >> + */
> >> + if (kvm->irqfds.init)
> >> + goto out;
> >> +
> >> + if (!irqfd_cleanup.refs) {
> >> + struct workqueue_struct *wq;
> >> +
> >> + wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
> >> + if (!wq) {
> >> + ret = -ENOMEM;
> >> + goto out;
> >> + }
> >> +
> >> + irqfd_cleanup.wq = wq;
> >> + }
> >> +
> >> + irqfd_cleanup.refs++;
> >> + kvm->irqfds.init = true;
> >> +
> >> +out:
> >> + mutex_unlock(&irqfd_cleanup.lock);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static void
> >> +irqfd_cleanup_release(struct kvm *kvm)
> >> +{
> >> + if (!kvm->irqfds.init)
> >> + return;
> >>
> >
> > init is checked outside the lock here.
> > Why?
> >
>
> Guest is shutting down via vmfd->f_ops->release() and is already
> guaranteed to be single-threaded, so proper locking is not really
> important. Probably should document that, though ;)

Is this an impirtant optimization? Simple locking is better..

> >> +
> >> + mutex_lock(&irqfd_cleanup.lock);
> >> +
> >> + if (!(--irqfd_cleanup.refs))
> >> + destroy_workqueue(irqfd_cleanup.wq);
> >> +
> >> + mutex_unlock(&irqfd_cleanup.lock);
> >> +
> >> + kvm->irqfds.init = false;
> >>
> >
> > ... and cleaned outside the lock as well.
> >
> >
> Ditto
>
> >> +}
> >> +
> >> static int
> >> kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> >> {
> >> @@ -170,6 +235,10 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> >> int ret;
> >> unsigned int events;
> >>
> >> + ret = irqfd_cleanup_init(kvm);
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> >> if (!irqfd)
> >> return -ENOMEM;
> >> @@ -268,7 +337,7 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> >> * so that we guarantee there will not be any more interrupts on this
> >> * gsi once this deassign function returns.
> >> */
> >> - flush_workqueue(irqfd_cleanup_wq);
> >> + flush_workqueue(irqfd_cleanup.wq);
> >>
> >> return 0;
> >> }
> >> @@ -302,28 +371,7 @@ kvm_irqfd_release(struct kvm *kvm)
> >> * Block until we know all outstanding shutdown jobs have completed
> >> * since we do not take a kvm* reference.
> >> */
> >> - flush_workqueue(irqfd_cleanup_wq);
> >> -
> >> -}
> >> -
> >> -/*
> >> - * create a host-wide workqueue for issuing deferred shutdown requests
> >> - * aggregated from all vm* instances. We need our own isolated single-thread
> >> - * queue to prevent deadlock against flushing the normal work-queue.
> >> - */
> >> -static int __init irqfd_module_init(void)
> >> -{
> >> - irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
> >> - if (!irqfd_cleanup_wq)
> >> - return -ENOMEM;
> >> -
> >> - return 0;
> >> -}
> >> + flush_workqueue(irqfd_cleanup.wq);
> >> + irqfd_cleanup_release(kvm);
> >>
> >> -static void __exit irqfd_module_exit(void)
> >> -{
> >> - destroy_workqueue(irqfd_cleanup_wq);
> >> }
> >> -
> >> -module_init(irqfd_module_init);
> >> -module_exit(irqfd_module_exit);
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
>

2009-07-06 16:33:10

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

Michael S. Tsirkin wrote:
> On Mon, Jul 06, 2009 at 12:03:28PM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Thu, Jul 02, 2009 at 11:38:22AM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> We currently create this wq on module_init, which may be wasteful if the
>>>> host never creates a guest that uses irqfd. This patch changes the
>>>> algorithm so that the workqueue is only created when at least one guest
>>>> is using irqfd. The queue is cleaned up when the last guest using irqfd
>>>> is shutdown.
>>>>
>>>> To keep things simple, we only check whether the guest has tried to create
>>>> an irqfd, not whether there are actually irqfds active.
>>>>
>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>>> ---
>>>>
>>>> include/linux/kvm_host.h | 1
>>>> virt/kvm/eventfd.c | 100 ++++++++++++++++++++++++++++++++++------------
>>>> 2 files changed, 75 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>>> index 8e04a34..cd1a0f3 100644
>>>> --- a/include/linux/kvm_host.h
>>>> +++ b/include/linux/kvm_host.h
>>>> @@ -149,6 +149,7 @@ struct kvm {
>>>> struct {
>>>> spinlock_t lock;
>>>> struct list_head items;
>>>> + bool init;
>>>> } irqfds;
>>>> #endif
>>>> struct kvm_vm_stat stat;
>>>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>>>> index 4092b8d..fcc3469 100644
>>>> --- a/virt/kvm/eventfd.c
>>>> +++ b/virt/kvm/eventfd.c
>>>> @@ -49,7 +49,16 @@ struct _irqfd {
>>>> struct work_struct shutdown;
>>>> };
>>>>
>>>> -static struct workqueue_struct *irqfd_cleanup_wq;
>>>> +struct _irqfd_cleanup {
>>>> + struct mutex lock;
>>>> + int refs;
>>>> + struct workqueue_struct *wq;
>>>> +};
>>>> +
>>>> +static struct _irqfd_cleanup irqfd_cleanup = {
>>>> + .lock = __MUTEX_INITIALIZER(irqfd_cleanup.lock),
>>>> + .refs = 0,
>>>> +};
>>>>
>>>> static void
>>>> irqfd_inject(struct work_struct *work)
>>>> @@ -110,7 +119,7 @@ irqfd_deactivate(struct _irqfd *irqfd)
>>>>
>>>> list_del_init(&irqfd->list);
>>>>
>>>> - queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
>>>> + queue_work(irqfd_cleanup.wq, &irqfd->shutdown);
>>>> }
>>>>
>>>> /*
>>>> @@ -161,6 +170,62 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>>>> add_wait_queue(wqh, &irqfd->wait);
>>>> }
>>>>
>>>> +/*
>>>> + * create a host-wide workqueue for issuing deferred shutdown requests
>>>> + * aggregated from all vm* instances. We need our own isolated single-thread
>>>> + * queue to prevent deadlock against flushing the normal work-queue.
>>>> + */
>>>> +static int
>>>> +irqfd_cleanup_init(struct kvm *kvm)
>>>> +{
>>>> + int ret = 0;
>>>> +
>>>> + mutex_lock(&irqfd_cleanup.lock);
>>>> +
>>>> + /*
>>>> + * Check the current init state from within the lock so that we
>>>> + * sync all users to the thread creation.
>>>> + */
>>>> + if (kvm->irqfds.init)
>>>> + goto out;
>>>> +
>>>> + if (!irqfd_cleanup.refs) {
>>>> + struct workqueue_struct *wq;
>>>> +
>>>> + wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
>>>> + if (!wq) {
>>>> + ret = -ENOMEM;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + irqfd_cleanup.wq = wq;
>>>> + }
>>>> +
>>>> + irqfd_cleanup.refs++;
>>>> + kvm->irqfds.init = true;
>>>> +
>>>> +out:
>>>> + mutex_unlock(&irqfd_cleanup.lock);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static void
>>>> +irqfd_cleanup_release(struct kvm *kvm)
>>>> +{
>>>> + if (!kvm->irqfds.init)
>>>> + return;
>>>>
>>>>
>>> init is checked outside the lock here.
>>> Why?
>>>
>>>
>> Guest is shutting down via vmfd->f_ops->release() and is already
>> guaranteed to be single-threaded, so proper locking is not really
>> important. Probably should document that, though ;)
>>
>
> Is this an impirtant optimization? Simple locking is better..
>

Its not "important" from a performance perspective or anything like
that. Its just that locking here is inappropriate. The init value is
per-vm and we only need to guard it going from false->true in the assign
path to cover the unlikely case that more than one threads tries to
assign an irqfd at the same time. The shutdown path has no such
requirement since the VM has already implicitly ceased to exist and can
therefore not possibly assign another irqfd. Therefore, locking here,
while harmless in every dimension, is simply gratuitous. I would rather
just document it, personally: i.e. I think overlocking is just sloppy.
Otherwise we should be documenting that the lock isn't really needed, etc.

For the record, I would have rather just used my originally proposed
slow-work thread instead of all this ;)

-Greg

>
>>>> +
>>>> + mutex_lock(&irqfd_cleanup.lock);
>>>> +
>>>> + if (!(--irqfd_cleanup.refs))
>>>> + destroy_workqueue(irqfd_cleanup.wq);
>>>> +
>>>> + mutex_unlock(&irqfd_cleanup.lock);
>>>> +
>>>> + kvm->irqfds.init = false;
>>>>
>>>>
>>> ... and cleaned outside the lock as well.
>>>
>>>
>>>
>> Ditto
>>
>>
>>>> +}
>>>> +
>>>> static int
>>>> kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>>>> {
>>>> @@ -170,6 +235,10 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>>>> int ret;
>>>> unsigned int events;
>>>>
>>>> + ret = irqfd_cleanup_init(kvm);
>>>> + if (ret < 0)
>>>> + return ret;
>>>> +
>>>> irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
>>>> if (!irqfd)
>>>> return -ENOMEM;
>>>> @@ -268,7 +337,7 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
>>>> * so that we guarantee there will not be any more interrupts on this
>>>> * gsi once this deassign function returns.
>>>> */
>>>> - flush_workqueue(irqfd_cleanup_wq);
>>>> + flush_workqueue(irqfd_cleanup.wq);
>>>>
>>>> return 0;
>>>> }
>>>> @@ -302,28 +371,7 @@ kvm_irqfd_release(struct kvm *kvm)
>>>> * Block until we know all outstanding shutdown jobs have completed
>>>> * since we do not take a kvm* reference.
>>>> */
>>>> - flush_workqueue(irqfd_cleanup_wq);
>>>> -
>>>> -}
>>>> -
>>>> -/*
>>>> - * create a host-wide workqueue for issuing deferred shutdown requests
>>>> - * aggregated from all vm* instances. We need our own isolated single-thread
>>>> - * queue to prevent deadlock against flushing the normal work-queue.
>>>> - */
>>>> -static int __init irqfd_module_init(void)
>>>> -{
>>>> - irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
>>>> - if (!irqfd_cleanup_wq)
>>>> - return -ENOMEM;
>>>> -
>>>> - return 0;
>>>> -}
>>>> + flush_workqueue(irqfd_cleanup.wq);
>>>> + irqfd_cleanup_release(kvm);
>>>>
>>>> -static void __exit irqfd_module_exit(void)
>>>> -{
>>>> - destroy_workqueue(irqfd_cleanup_wq);
>>>> }
>>>> -
>>>> -module_init(irqfd_module_init);
>>>> -module_exit(irqfd_module_exit);
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>>>
>>
>
>
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-07-06 16:42:25

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

Michael S. Tsirkin wrote:
> On Mon, Jul 06, 2009 at 10:56:02AM -0400, Gregory Haskins wrote:
>
>> Avi Kivity wrote:
>>
>>> On 07/02/2009 06:50 PM, Avi Kivity wrote:
>>>
>>>> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
>>>>
>>>>> (Applies to kvm.git/master:1f9050fd)
>>>>>
>>>>> The following is the latest attempt to fix the races in
>>>>> irqfd/eventfd, as
>>>>> well as restore DEASSIGN support. For more details, please read the
>>>>> patch
>>>>> headers.
>>>>>
>>>>> As always, this series has been tested against the kvm-eventfd unit
>>>>> test
>>>>> and everything appears to be functioning properly. You can download
>>>>> this
>>>>> test here:
>>>>>
>>>> Applied, thanks.
>>>>
>>>>
>>> ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
>>> reference is taken for each irqfd, but dropped for each guest. This
>>> causes an oops if a guest with no irqfds is created and destroyed:
>>>
>> I was able to reproduce this issue. The problem turned out to be that I
>> inadvertently always did a flush_workqueue(), even if the work-queue was
>> never initialized.
>>
>> The following interdiff applied to the reverted patch has been confirmed
>> to fix the issue:
>>
>
> Could you document the init boolean and its locking rules?
> The best place to put it would be where the field is declared btw.
>

Will do

> Is it true that init === list_empty(&kvm->irqfds.items)?
> If yes maybe we don't need this field at all.
>
>
No, because its more difficult to maintain the work-queue when
referenced against active irqfds (*). So instead, its maintained
against guests that use irqfd, whether they have an active irqfd or
not. Otherwise you have to contend with the eventfd-side release, which
is a little tricky.

(*) I'm sure its not rocket science to get this working, but it was
getting more complex than I thought it was worth, so I simplified the
model to be per-vm. Note that this design decision/limitation is
declared in the patch header.
>
>> -------------------
>>
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index fcc3469..52b0e04 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -318,6 +318,9 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
>> struct _irqfd *irqfd, *tmp;
>> struct eventfd_ctx *eventfd;
>>
>> + if (!kvm->irqfds.init)
>> + return -ENOENT;
>> +
>> eventfd = eventfd_ctx_fdget(fd);
>> if (IS_ERR(eventfd))
>> return PTR_ERR(eventfd);
>>
>
> wouldn't it be cleaner to error out in the for each loop if we don't
> find an entry to deactivate? Might be helpful for apps to get an error
> if they didn't deassign anything.
>

Again, irqfds.init is somewhat orthogonal to whether the list is
populated or not. This check is for sanity (how can you deassign if you
didnt assign, etc). Normally this would be a simple BUG_ON() sanity
check, but I don't want a malicious/broken userspace to gain an easy
attack vector ;)

>
>> @@ -360,6 +363,9 @@ kvm_irqfd_release(struct kvm *kvm)
>> {
>> struct _irqfd *irqfd, *tmp;
>>
>> + if (!kvm->irqfds.init)
>> + return;
>> +
>>
>
> So here, I recall some old comment that flush below was
> needed even if list is empty. Is this no longer true?
>

If you are using irqfd, its true. If irqfds.init == false, you are not
using irqfd and thus the flush cannot be needed.

> If not it might be cleaner to only flush if list is not empty.
>
>
You have to flush if irqfds.init == true even if the list is empty
because you need to be sure that eventfd-side releases complete. They
may have already removed themselves from the list, but the work-item is
still in flight.

Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-07-06 16:50:44

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

On Mon, Jul 06, 2009 at 12:41:59PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Jul 06, 2009 at 10:56:02AM -0400, Gregory Haskins wrote:
> >
> >> Avi Kivity wrote:
> >>
> >>> On 07/02/2009 06:50 PM, Avi Kivity wrote:
> >>>
> >>>> On 07/02/2009 06:37 PM, Gregory Haskins wrote:
> >>>>
> >>>>> (Applies to kvm.git/master:1f9050fd)
> >>>>>
> >>>>> The following is the latest attempt to fix the races in
> >>>>> irqfd/eventfd, as
> >>>>> well as restore DEASSIGN support. For more details, please read the
> >>>>> patch
> >>>>> headers.
> >>>>>
> >>>>> As always, this series has been tested against the kvm-eventfd unit
> >>>>> test
> >>>>> and everything appears to be functioning properly. You can download
> >>>>> this
> >>>>> test here:
> >>>>>
> >>>> Applied, thanks.
> >>>>
> >>>>
> >>> ... and unapplied. There's a refcounting mismatch in irqfd_cleanup: a
> >>> reference is taken for each irqfd, but dropped for each guest. This
> >>> causes an oops if a guest with no irqfds is created and destroyed:
> >>>
> >> I was able to reproduce this issue. The problem turned out to be that I
> >> inadvertently always did a flush_workqueue(), even if the work-queue was
> >> never initialized.
> >>
> >> The following interdiff applied to the reverted patch has been confirmed
> >> to fix the issue:
> >>
> >
> > Could you document the init boolean and its locking rules?
> > The best place to put it would be where the field is declared btw.
> >
>
> Will do
>
> > Is it true that init === list_empty(&kvm->irqfds.items)?
> > If yes maybe we don't need this field at all.
> >
> >
> No,

OK, I thought it is. I'll wait for the documentation patch then.

> because its more difficult to maintain the work-queue when
> referenced against active irqfds (*). So instead, its maintained
> against guests that use irqfd, whether they have an active irqfd or
> not. Otherwise you have to contend with the eventfd-side release, which
> is a little tricky.
>
> (*) I'm sure its not rocket science to get this working, but it was
> getting more complex than I thought it was worth, so I simplified the
> model to be per-vm. Note that this design decision/limitation is
> declared in the patch header.
> >
> >> -------------------
> >>
> >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> >> index fcc3469..52b0e04 100644
> >> --- a/virt/kvm/eventfd.c
> >> +++ b/virt/kvm/eventfd.c
> >> @@ -318,6 +318,9 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> >> struct _irqfd *irqfd, *tmp;
> >> struct eventfd_ctx *eventfd;
> >>
> >> + if (!kvm->irqfds.init)
> >> + return -ENOENT;
> >> +
> >> eventfd = eventfd_ctx_fdget(fd);
> >> if (IS_ERR(eventfd))
> >> return PTR_ERR(eventfd);
> >>
> >
> > wouldn't it be cleaner to error out in the for each loop if we don't
> > find an entry to deactivate? Might be helpful for apps to get an error
> > if they didn't deassign anything.
> >
>
> Again, irqfds.init is somewhat orthogonal to whether the list is
> populated or not. This check is for sanity (how can you deassign if you
> didnt assign, etc). Normally this would be a simple BUG_ON() sanity
> check, but I don't want a malicious/broken userspace to gain an easy
> attack vector ;)

what I'm saying is that deassign should return an error if it's passed
and entry that is not on the list. And if you do this and return before
flush, this check won't be needed.

> >
> >> @@ -360,6 +363,9 @@ kvm_irqfd_release(struct kvm *kvm)
> >> {
> >> struct _irqfd *irqfd, *tmp;
> >>
> >> + if (!kvm->irqfds.init)
> >> + return;
> >> +
> >>
> >
> > So here, I recall some old comment that flush below was
> > needed even if list is empty. Is this no longer true?
> >
>
> If you are using irqfd, its true. If irqfds.init == false, you are not
> using irqfd and thus the flush cannot be needed.
>
> > If not it might be cleaner to only flush if list is not empty.
> >
> >
> You have to flush if irqfds.init == true even if the list is empty
> because you need to be sure that eventfd-side releases complete. They
> may have already removed themselves from the list, but the work-item is
> still in flight.
>
> Regards,
> -Greg
>

2009-07-06 16:51:31

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

On Mon, Jul 06, 2009 at 12:32:42PM -0400, Gregory Haskins wrote:
> For the record, I would have rather just used my originally proposed
> slow-work thread instead of all this ;)

Or just keep the idle thread around, it's not a big deal.

--
MST

2009-07-06 18:28:41

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

Michael S. Tsirkin wrote:
> On Mon, Jul 06, 2009 at 12:32:42PM -0400, Gregory Haskins wrote:
>
>> For the record, I would have rather just used my originally proposed
>> slow-work thread instead of all this ;)
>>
>
> Or just keep the idle thread around, it's not a big deal.
>
>

If everyone is fine with that, lets do that then. It's certainly the
simplest solution to just drop this patch.

Avi?

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-07-06 18:48:52

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 0/5] irqfd fixes and enhancements

Michael S. Tsirkin wrote:
> On Mon, Jul 06, 2009 at 12:41:59PM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>>>
>>>>
>>> wouldn't it be cleaner to error out in the for each loop if we don't
>>> find an entry to deactivate? Might be helpful for apps to get an error
>>> if they didn't deassign anything.
>>>
>>>
>> Again, irqfds.init is somewhat orthogonal to whether the list is
>> populated or not. This check is for sanity (how can you deassign if you
>> didnt assign, etc). Normally this would be a simple BUG_ON() sanity
>> check, but I don't want a malicious/broken userspace to gain an easy
>> attack vector ;)
>>
>
> what I'm saying is that deassign should return an error if it's passed
> and entry that is not on the list.

This isn't an unreasonable request, and I believe this is actually the
way the original deassign logic worked before we yanked the feature a
few weeks ago. Its only slightly complicated by the fact that we may
match multiple irqfds, but I think we solved that before by returning
the number we matched.

If Avi answers the other mail stating he wants to still see the
on-demand work go in, lets use your suggestion.

Regards,
-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-07-07 05:17:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

On 07/06/2009 09:28 PM, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
>
>> On Mon, Jul 06, 2009 at 12:32:42PM -0400, Gregory Haskins wrote:
>>
>>
>>> For the record, I would have rather just used my originally proposed
>>> slow-work thread instead of all this ;)
>>>
>>>
>> Or just keep the idle thread around, it's not a big deal.
>>
>>
>>
>
> If everyone is fine with that, lets do that then. It's certainly the
> simplest solution to just drop this patch.
>
> Avi?
>
>

Sure. If someone complains we can address it then.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-07-07 11:26:31

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v9 5/5] KVM: create irqfd-cleanup-wq on demand

Avi Kivity wrote:
> On 07/06/2009 09:28 PM, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>
>>> On Mon, Jul 06, 2009 at 12:32:42PM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> For the record, I would have rather just used my originally proposed
>>>> slow-work thread instead of all this ;)
>>>>
>>>>
>>> Or just keep the idle thread around, it's not a big deal.
>>>
>>>
>>>
>>
>> If everyone is fine with that, lets do that then. It's certainly the
>> simplest solution to just drop this patch.
>>
>> Avi?
>>
>>
>
> Sure. If someone complains we can address it then.
>
Ok, sounds good.

Thanks guys.
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature