2009-06-16 02:30:18

by Gregory Haskins

[permalink] [raw]
Subject: [KVM-RFC PATCH 0/2] eventfd enhancements for irqfd/iosignalfd

(Applies to kvm.git/master:c27b64a0)

For more details, please read the patch headers. This series adds
a new feature to eventfd (explicit notifications) which addresses
wqh based limitations, as well as provides fixes for several issues
in irqfd pointed out by Michael Tsirkin. You can find reference to these
issues here:

http://lkml.org/lkml/2009/6/15/218
http://lkml.org/lkml/2009/6/11/249

This series has been tested against the kvm-eventfd unit test, and
appears to be functioning properly. You can download this test here:

ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2

---

Gregory Haskins (2):
eventfd: add module reference counting support for registered notifiers
eventfd: add an explicit srcu based notifier interface


fs/eventfd.c | 123 ++++++++++++++++++++++++++++++++++++++++++++---
include/linux/eventfd.h | 33 +++++++++++++
virt/kvm/eventfd.c | 114 ++++++++++++++++++++------------------------
3 files changed, 199 insertions(+), 71 deletions(-)

--
Signature


2009-06-16 02:30:39

by Gregory Haskins

[permalink] [raw]
Subject: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

irqfd and its underlying implementation, eventfd, currently utilize
the embedded wait-queue in eventfd for signal notification. The nice thing
about this design decision is that it re-uses the existing
eventfd/wait-queue code and it generally works well....with several
limitations.

One of the limitations is that notification callbacks are always called
inside a spin_lock_irqsave critical section. Another limitation is
that it is very difficult to build a system that can recieve release
notification without being racy.

Therefore, we introduce a new registration interface that is SRCU based
instead of wait-queue based, and implement the internal wait-queue
infrastructure in terms of this new interface. We then convert irqfd
to use this new interface instead of the existing wait-queue code.

The end result is that we now have the opportunity to run the interrupt
injection code serially to the callback (when the signal is raised from
process-context, at least) instead of always deferring the injection to a
work-queue.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Davide Libenzi <[email protected]>
---

fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
include/linux/eventfd.h | 30 ++++++++++++
virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
3 files changed, 188 insertions(+), 71 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 72f5f8d..505d5de 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -30,8 +30,47 @@ struct eventfd_ctx {
*/
__u64 count;
unsigned int flags;
+ struct srcu_struct srcu;
+ struct list_head nh;
+ struct eventfd_notifier notifier;
};

+static void _eventfd_wqh_notify(struct eventfd_notifier *en)
+{
+ struct eventfd_ctx *ctx = container_of(en,
+ struct eventfd_ctx,
+ notifier);
+
+ if (waitqueue_active(&ctx->wqh))
+ wake_up_poll(&ctx->wqh, POLLIN);
+}
+
+static void _eventfd_notify(struct eventfd_ctx *ctx)
+{
+ struct eventfd_notifier *en;
+ int idx;
+
+ idx = srcu_read_lock(&ctx->srcu);
+
+ /*
+ * The goal here is to allow the notification to be preemptible
+ * as often as possible. We cannot achieve this with the basic
+ * wqh mechanism because it requires the wqh->lock. Therefore
+ * we have an internal srcu list mechanism of which the wqh is
+ * a client.
+ *
+ * Not all paths will invoke this function in process context.
+ * Callers should check for suitable state before assuming they
+ * can sleep (such as with preemptible()). Paul McKenney assures
+ * me that srcu_read_lock is compatible with in-atomic, as long as
+ * the code within the critical section is also compatible.
+ */
+ list_for_each_entry_rcu(en, &ctx->nh, list)
+ en->ops->signal(en);
+
+ srcu_read_unlock(&ctx->srcu, idx);
+}
+
/*
* Adds "n" to the eventfd counter "count". Returns "n" in case of
* success, or a value lower then "n" in case of coutner overflow.
@@ -51,10 +90,10 @@ int eventfd_signal(struct file *file, int n)
if (ULLONG_MAX - ctx->count < n)
n = (int) (ULLONG_MAX - ctx->count);
ctx->count += n;
- if (waitqueue_active(&ctx->wqh))
- wake_up_locked_poll(&ctx->wqh, POLLIN);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);

+ _eventfd_notify(ctx);
+
return n;
}
EXPORT_SYMBOL_GPL(eventfd_signal);
@@ -62,13 +101,21 @@ EXPORT_SYMBOL_GPL(eventfd_signal);
static int eventfd_release(struct inode *inode, struct file *file)
{
struct eventfd_ctx *ctx = file->private_data;
+ struct eventfd_notifier *en, *tmp;

/*
* No need to hold the lock here, since we are on the file cleanup
- * path and the ones still attached to the wait queue will be
- * serialized by wake_up_locked_poll().
+ * path
*/
- wake_up_locked_poll(&ctx->wqh, POLLHUP);
+ list_for_each_entry_safe(en, tmp, &ctx->nh, list) {
+ list_del(&en->list);
+ if (en->ops->release)
+ en->ops->release(en);
+ }
+
+ synchronize_srcu(&ctx->srcu);
+ cleanup_srcu_struct(&ctx->srcu);
+
kfree(ctx);
return 0;
}
@@ -176,13 +223,13 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
}
- if (likely(res > 0)) {
+ if (likely(res > 0))
ctx->count += ucnt;
- if (waitqueue_active(&ctx->wqh))
- wake_up_locked_poll(&ctx->wqh, POLLIN);
- }
spin_unlock_irq(&ctx->wqh.lock);

+ if (likely(res > 0))
+ _eventfd_notify(ctx);
+
return res;
}

@@ -209,6 +256,50 @@ struct file *eventfd_fget(int fd)
}
EXPORT_SYMBOL_GPL(eventfd_fget);

+static int _eventfd_notifier_register(struct eventfd_ctx *ctx,
+ struct eventfd_notifier *en)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&ctx->wqh.lock, flags);
+ list_add_tail_rcu(&en->list, &ctx->nh);
+ spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+
+ return 0;
+}
+
+int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+
+ return _eventfd_notifier_register(ctx, en);
+}
+EXPORT_SYMBOL_GPL(eventfd_notifier_register);
+
+int eventfd_notifier_unregister(struct file *file, struct eventfd_notifier *en)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+
+ spin_lock_irq(&ctx->wqh.lock);
+ list_del_rcu(&en->list);
+ spin_unlock_irq(&ctx->wqh.lock);
+
+ synchronize_srcu(&ctx->srcu);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_notifier_unregister);
+
+static const struct eventfd_notifier_ops eventfd_wqh_ops = {
+ .signal = _eventfd_wqh_notify,
+};
+
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{
int fd;
@@ -229,6 +320,12 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
ctx->count = count;
ctx->flags = flags;

+ init_srcu_struct(&ctx->srcu);
+ INIT_LIST_HEAD(&ctx->nh);
+
+ eventfd_notifier_init(&ctx->notifier, &eventfd_wqh_ops);
+ _eventfd_notifier_register(ctx, &ctx->notifier);
+
/*
* When we call this, the initialization must be complete, since
* anon_inode_getfd() will install the fd.
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..0218cf6 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,6 +8,28 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

+#include <linux/list.h>
+
+struct eventfd_notifier;
+
+struct eventfd_notifier_ops {
+ void (*signal)(struct eventfd_notifier *en);
+ void (*release)(struct eventfd_notifier *en);
+};
+
+struct eventfd_notifier {
+ struct list_head list;
+ const struct eventfd_notifier_ops *ops;
+};
+
+static inline void eventfd_notifier_init(struct eventfd_notifier *en,
+ const struct eventfd_notifier_ops *ops)
+{
+ memset(en, 0, sizeof(*en));
+ INIT_LIST_HEAD(&en->list);
+ en->ops = ops;
+}
+
#ifdef CONFIG_EVENTFD

/* For O_CLOEXEC and O_NONBLOCK */
@@ -29,12 +51,20 @@

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, int n);
+int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en);
+int eventfd_notifier_unregister(struct file *file, struct eventfd_notifier *en);

#else /* CONFIG_EVENTFD */

#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }
+static inline int eventfd_notifier_register(struct file *file,
+ struct eventfd_notifier *en)
+{ return -ENOENT; }
+static inline int eventfd_notifier_unregister(struct file *file,
+ struct eventfd_notifier *en)
+{ return -ENOENT; }

#endif /* CONFIG_EVENTFD */

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2c8028c..efc3d77 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -43,33 +43,11 @@ struct _irqfd {
struct kvm *kvm;
int gsi;
struct list_head list;
- poll_table pt;
- wait_queue_head_t *wqh;
- wait_queue_t wait;
struct work_struct inject;
+ struct eventfd_notifier notifier;
};

static void
-irqfd_inject(struct work_struct *work)
-{
- struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
- struct kvm *kvm;
- int idx;
-
- idx = srcu_read_lock(&irqfd->srcu);
-
- kvm = rcu_dereference(irqfd->kvm);
- if (kvm) {
- mutex_lock(&kvm->irq_lock);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
- mutex_unlock(&kvm->irq_lock);
- }
-
- srcu_read_unlock(&irqfd->srcu, idx);
-}
-
-static void
irqfd_disconnect(struct _irqfd *irqfd)
{
struct kvm *kvm;
@@ -97,47 +75,67 @@ irqfd_disconnect(struct _irqfd *irqfd)
kvm_put_kvm(kvm);
}

-static int
-irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
+static void
+_irqfd_inject(struct _irqfd *irqfd)
{
- struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
- unsigned long flags = (unsigned long)key;
-
- if (flags & POLLIN)
- /*
- * The POLLIN wake_up is called with interrupts disabled.
- * Therefore we need to defer the IRQ injection until later
- * since we need to acquire the kvm->lock to do so.
- */
- schedule_work(&irqfd->inject);
+ struct kvm *kvm;
+ int idx;

- if (flags & POLLHUP) {
- /*
- * The POLLHUP is called unlocked, so it theoretically should
- * be safe to remove ourselves from the wqh using the locked
- * variant of remove_wait_queue()
- */
- remove_wait_queue(irqfd->wqh, &irqfd->wait);
- flush_work(&irqfd->inject);
- irqfd_disconnect(irqfd);
+ idx = srcu_read_lock(&irqfd->srcu);

- cleanup_srcu_struct(&irqfd->srcu);
- kfree(irqfd);
+ kvm = rcu_dereference(irqfd->kvm);
+ if (kvm) {
+ mutex_lock(&kvm->irq_lock);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+ mutex_unlock(&kvm->irq_lock);
}

- return 0;
+ srcu_read_unlock(&irqfd->srcu, idx);
+}
+
+static void
+irqfd_inject(struct work_struct *work)
+{
+ struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+
+ _irqfd_inject(irqfd);
+}
+
+static void
+irqfd_eventfd_signal(struct eventfd_notifier *notifier)
+{
+ struct _irqfd *irqfd = container_of(notifier, struct _irqfd, notifier);
+
+ /*
+ * Our signal may or may not be called from atomic context. We can
+ * detect if this can safely sleep by checking preemptible() on
+ * CONFIG_PREEMPT kernels. CONFIG_PREEMPT=n, or in_atomic() will fail
+ * this check and therefore defer the injection via a work-queue
+ */
+ if (preemptible())
+ _irqfd_inject(irqfd);
+ else
+ schedule_work(&irqfd->inject);
}

static void
-irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
- poll_table *pt)
+irqfd_eventfd_release(struct eventfd_notifier *notifier)
{
- struct _irqfd *irqfd = container_of(pt, struct _irqfd, pt);
+ struct _irqfd *irqfd = container_of(notifier, struct _irqfd, notifier);

- irqfd->wqh = wqh;
- add_wait_queue(wqh, &irqfd->wait);
+ flush_work(&irqfd->inject);
+ irqfd_disconnect(irqfd);
+
+ cleanup_srcu_struct(&irqfd->srcu);
+ kfree(irqfd);
}

+static const struct eventfd_notifier_ops irqfd_notifier_ops = {
+ .signal = irqfd_eventfd_signal,
+ .release = irqfd_eventfd_release,
+};
+
int
kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
{
@@ -165,14 +163,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
goto fail;
}

- /*
- * Install our own custom wake-up handling so we are notified via
- * a callback whenever someone signals the underlying eventfd
- */
- init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
- init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
+ eventfd_notifier_init(&irqfd->notifier, &irqfd_notifier_ops);

- ret = file->f_op->poll(file, &irqfd->pt);
+ ret = eventfd_notifier_register(file, &irqfd->notifier);
if (ret < 0)
goto fail;

@@ -191,9 +184,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
return 0;

fail:
- if (irqfd->wqh)
- remove_wait_queue(irqfd->wqh, &irqfd->wait);
-
if (file && !IS_ERR(file))
fput(file);

2009-06-16 02:30:52

by Gregory Haskins

[permalink] [raw]
Subject: [KVM-RFC PATCH 2/2] eventfd: add module reference counting support for registered notifiers

Michael Tsirkin found a race condition in the irqfd code where we may
allow the underlying eventfd object to race with the rmmod of kvm.ko.

Since we now use eventfd_notifier for irqfd, lets add a struct module *owner
field to properly maintain references to our registered signal handlers.

Found-by: Michael S. Tsirkin <[email protected]>
CC: Davide Libenzi <[email protected]>
Signed-off-by: Gregory Haskins <[email protected]>
---

fs/eventfd.c | 8 ++++++++
include/linux/eventfd.h | 3 +++
2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 505d5de..babedb3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -108,9 +108,12 @@ static int eventfd_release(struct inode *inode, struct file *file)
* path
*/
list_for_each_entry_safe(en, tmp, &ctx->nh, list) {
+ struct module *owner = en->owner;
+
list_del(&en->list);
if (en->ops->release)
en->ops->release(en);
+ module_put(owner);
}

synchronize_srcu(&ctx->srcu);
@@ -261,6 +264,9 @@ static int _eventfd_notifier_register(struct eventfd_ctx *ctx,
{
unsigned long flags;

+ if (!try_module_get(en->owner))
+ return -EINVAL;
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
list_add_tail_rcu(&en->list, &ctx->nh);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
@@ -292,6 +298,8 @@ int eventfd_notifier_unregister(struct file *file, struct eventfd_notifier *en)

synchronize_srcu(&ctx->srcu);

+ module_put(en->owner);
+
return 0;
}
EXPORT_SYMBOL_GPL(eventfd_notifier_unregister);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 0218cf6..f534bcd 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -9,6 +9,7 @@
#define _LINUX_EVENTFD_H

#include <linux/list.h>
+#include <linux/module.h>

struct eventfd_notifier;

@@ -18,6 +19,7 @@ struct eventfd_notifier_ops {
};

struct eventfd_notifier {
+ struct module *owner;
struct list_head list;
const struct eventfd_notifier_ops *ops;
};
@@ -26,6 +28,7 @@ static inline void eventfd_notifier_init(struct eventfd_notifier *en,
const struct eventfd_notifier_ops *ops)
{
memset(en, 0, sizeof(*en));
+ en->owner = THIS_MODULE;
INIT_LIST_HEAD(&en->list);
en->ops = ops;
}

2009-06-16 14:03:17

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
> irqfd and its underlying implementation, eventfd, currently utilize
> the embedded wait-queue in eventfd for signal notification. The nice thing
> about this design decision is that it re-uses the existing
> eventfd/wait-queue code and it generally works well....with several
> limitations.
>
> One of the limitations is that notification callbacks are always called
> inside a spin_lock_irqsave critical section. Another limitation is
> that it is very difficult to build a system that can recieve release
> notification without being racy.
>
> Therefore, we introduce a new registration interface that is SRCU based
> instead of wait-queue based, and implement the internal wait-queue
> infrastructure in terms of this new interface. We then convert irqfd
> to use this new interface instead of the existing wait-queue code.
>
> The end result is that we now have the opportunity to run the interrupt
> injection code serially to the callback (when the signal is raised from
> process-context, at least) instead of always deferring the injection to a
> work-queue.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Paul E. McKenney <[email protected]>
> CC: Davide Libenzi <[email protected]>
> ---
>
> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
> include/linux/eventfd.h | 30 ++++++++++++
> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
> 3 files changed, 188 insertions(+), 71 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 72f5f8d..505d5de 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -30,8 +30,47 @@ struct eventfd_ctx {
> */
> __u64 count;
> unsigned int flags;
> + struct srcu_struct srcu;
> + struct list_head nh;
> + struct eventfd_notifier notifier;
> };
>
> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
> +{
> + struct eventfd_ctx *ctx = container_of(en,
> + struct eventfd_ctx,
> + notifier);
> +
> + if (waitqueue_active(&ctx->wqh))
> + wake_up_poll(&ctx->wqh, POLLIN);
> +}
> +
> +static void _eventfd_notify(struct eventfd_ctx *ctx)
> +{
> + struct eventfd_notifier *en;
> + int idx;
> +
> + idx = srcu_read_lock(&ctx->srcu);
> +
> + /*
> + * The goal here is to allow the notification to be preemptible
> + * as often as possible. We cannot achieve this with the basic
> + * wqh mechanism because it requires the wqh->lock. Therefore
> + * we have an internal srcu list mechanism of which the wqh is
> + * a client.
> + *
> + * Not all paths will invoke this function in process context.
> + * Callers should check for suitable state before assuming they
> + * can sleep (such as with preemptible()). Paul McKenney assures
> + * me that srcu_read_lock is compatible with in-atomic, as long as
> + * the code within the critical section is also compatible.
> + */
> + list_for_each_entry_rcu(en, &ctx->nh, list)
> + en->ops->signal(en);
> +
> + srcu_read_unlock(&ctx->srcu, idx);
> +}
> +
> /*
> * Adds "n" to the eventfd counter "count". Returns "n" in case of
> * success, or a value lower then "n" in case of coutner overflow.

This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.

Further, to do useful things it might not be enough that you can sleep:
with iofd you also want to access current task with e.g. copy from user.

Here's an idea: let's pass a flag to ->signal, along the lines of
signal_is_task, that tells us that it is safe to use current, and add
eventfd_signal_task() which is the same as eventfd_signal but lets everyone
know that it's safe to both sleep and use current->mm.

Makes sense?

--
MST

2009-06-16 14:11:29

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
>
>> irqfd and its underlying implementation, eventfd, currently utilize
>> the embedded wait-queue in eventfd for signal notification. The nice thing
>> about this design decision is that it re-uses the existing
>> eventfd/wait-queue code and it generally works well....with several
>> limitations.
>>
>> One of the limitations is that notification callbacks are always called
>> inside a spin_lock_irqsave critical section. Another limitation is
>> that it is very difficult to build a system that can recieve release
>> notification without being racy.
>>
>> Therefore, we introduce a new registration interface that is SRCU based
>> instead of wait-queue based, and implement the internal wait-queue
>> infrastructure in terms of this new interface. We then convert irqfd
>> to use this new interface instead of the existing wait-queue code.
>>
>> The end result is that we now have the opportunity to run the interrupt
>> injection code serially to the callback (when the signal is raised from
>> process-context, at least) instead of always deferring the injection to a
>> work-queue.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Paul E. McKenney <[email protected]>
>> CC: Davide Libenzi <[email protected]>
>> ---
>>
>> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
>> include/linux/eventfd.h | 30 ++++++++++++
>> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
>> 3 files changed, 188 insertions(+), 71 deletions(-)
>>
>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>> index 72f5f8d..505d5de 100644
>> --- a/fs/eventfd.c
>> +++ b/fs/eventfd.c
>> @@ -30,8 +30,47 @@ struct eventfd_ctx {
>> */
>> __u64 count;
>> unsigned int flags;
>> + struct srcu_struct srcu;
>> + struct list_head nh;
>> + struct eventfd_notifier notifier;
>> };
>>
>> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
>> +{
>> + struct eventfd_ctx *ctx = container_of(en,
>> + struct eventfd_ctx,
>> + notifier);
>> +
>> + if (waitqueue_active(&ctx->wqh))
>> + wake_up_poll(&ctx->wqh, POLLIN);
>> +}
>> +
>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>> +{
>> + struct eventfd_notifier *en;
>> + int idx;
>> +
>> + idx = srcu_read_lock(&ctx->srcu);
>> +
>> + /*
>> + * The goal here is to allow the notification to be preemptible
>> + * as often as possible. We cannot achieve this with the basic
>> + * wqh mechanism because it requires the wqh->lock. Therefore
>> + * we have an internal srcu list mechanism of which the wqh is
>> + * a client.
>> + *
>> + * Not all paths will invoke this function in process context.
>> + * Callers should check for suitable state before assuming they
>> + * can sleep (such as with preemptible()). Paul McKenney assures
>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>> + * the code within the critical section is also compatible.
>> + */
>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>> + en->ops->signal(en);
>> +
>> + srcu_read_unlock(&ctx->srcu, idx);
>> +}
>> +
>> /*
>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>> * success, or a value lower then "n" in case of coutner overflow.
>>
>
> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>
> Further, to do useful things it might not be enough that you can sleep:
> with iofd you also want to access current task with e.g. copy from user.
>
> Here's an idea: let's pass a flag to ->signal, along the lines of
> signal_is_task, that tells us that it is safe to use current, and add
> eventfd_signal_task() which is the same as eventfd_signal but lets everyone
> know that it's safe to both sleep and use current->mm.
>
> Makes sense?
>

It does make sense, yes. What I am not clear on is how would eventfd
detect this state such as to populate such flags, and why cant the
->signal() CB do the same?

Thanks Michael,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 14:17:53

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
>
>> irqfd and its underlying implementation, eventfd, currently utilize
>> the embedded wait-queue in eventfd for signal notification. The nice thing
>> about this design decision is that it re-uses the existing
>> eventfd/wait-queue code and it generally works well....with several
>> limitations.
>>
>> One of the limitations is that notification callbacks are always called
>> inside a spin_lock_irqsave critical section. Another limitation is
>> that it is very difficult to build a system that can recieve release
>> notification without being racy.
>>
>> Therefore, we introduce a new registration interface that is SRCU based
>> instead of wait-queue based, and implement the internal wait-queue
>> infrastructure in terms of this new interface. We then convert irqfd
>> to use this new interface instead of the existing wait-queue code.
>>
>> The end result is that we now have the opportunity to run the interrupt
>> injection code serially to the callback (when the signal is raised from
>> process-context, at least) instead of always deferring the injection to a
>> work-queue.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Paul E. McKenney <[email protected]>
>> CC: Davide Libenzi <[email protected]>
>> ---
>>
>> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
>> include/linux/eventfd.h | 30 ++++++++++++
>> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
>> 3 files changed, 188 insertions(+), 71 deletions(-)
>>
>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>> index 72f5f8d..505d5de 100644
>> --- a/fs/eventfd.c
>> +++ b/fs/eventfd.c
>> @@ -30,8 +30,47 @@ struct eventfd_ctx {
>> */
>> __u64 count;
>> unsigned int flags;
>> + struct srcu_struct srcu;
>> + struct list_head nh;
>> + struct eventfd_notifier notifier;
>> };
>>
>> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
>> +{
>> + struct eventfd_ctx *ctx = container_of(en,
>> + struct eventfd_ctx,
>> + notifier);
>> +
>> + if (waitqueue_active(&ctx->wqh))
>> + wake_up_poll(&ctx->wqh, POLLIN);
>> +}
>> +
>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>> +{
>> + struct eventfd_notifier *en;
>> + int idx;
>> +
>> + idx = srcu_read_lock(&ctx->srcu);
>> +
>> + /*
>> + * The goal here is to allow the notification to be preemptible
>> + * as often as possible. We cannot achieve this with the basic
>> + * wqh mechanism because it requires the wqh->lock. Therefore
>> + * we have an internal srcu list mechanism of which the wqh is
>> + * a client.
>> + *
>> + * Not all paths will invoke this function in process context.
>> + * Callers should check for suitable state before assuming they
>> + * can sleep (such as with preemptible()). Paul McKenney assures
>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>> + * the code within the critical section is also compatible.
>> + */
>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>> + en->ops->signal(en);
>> +
>> + srcu_read_unlock(&ctx->srcu, idx);
>> +}
>> +
>> /*
>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>> * success, or a value lower then "n" in case of coutner overflow.
>>
>
> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>

As an aside, this is something I would like to address. I keep running
into this pattern where I could do something in-line if I had a
"can_sleep()" context. Otherwise, I have to punt to something like a
workqueue which adds latency. The closest thing I have to "can_sleep()"
is preemptible(), which, as you correctly pointed out is limited to only
working with CONFIG_PREEMPT=y.

Its been a while since I looked into it, but one of the barriers that
would need to be overcome is the fact that the preempt_count stuff gets
compiled away with CONFIG_PREEMPT=n. It is conceivable that we could
make the preempt_count logic an independent config variable from
CONFIG_PREEMPT to provide a can_sleep() macro without requiring
full-blow preemption to be enabled. So my questions would be as follows:

a) Is the community conducive to such an idea?
b) Are there other things to consider/fix besides the lack of
preempt_count in order to implement such a beast?

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 14:22:38

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

[Adding Ingo]

Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
>
>> On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
>>
>>
>>> irqfd and its underlying implementation, eventfd, currently utilize
>>> the embedded wait-queue in eventfd for signal notification. The nice thing
>>> about this design decision is that it re-uses the existing
>>> eventfd/wait-queue code and it generally works well....with several
>>> limitations.
>>>
>>> One of the limitations is that notification callbacks are always called
>>> inside a spin_lock_irqsave critical section. Another limitation is
>>> that it is very difficult to build a system that can recieve release
>>> notification without being racy.
>>>
>>> Therefore, we introduce a new registration interface that is SRCU based
>>> instead of wait-queue based, and implement the internal wait-queue
>>> infrastructure in terms of this new interface. We then convert irqfd
>>> to use this new interface instead of the existing wait-queue code.
>>>
>>> The end result is that we now have the opportunity to run the interrupt
>>> injection code serially to the callback (when the signal is raised from
>>> process-context, at least) instead of always deferring the injection to a
>>> work-queue.
>>>
>>> Signed-off-by: Gregory Haskins <[email protected]>
>>> CC: Paul E. McKenney <[email protected]>
>>> CC: Davide Libenzi <[email protected]>
>>> ---
>>>
>>> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
>>> include/linux/eventfd.h | 30 ++++++++++++
>>> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
>>> 3 files changed, 188 insertions(+), 71 deletions(-)
>>>
>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>> index 72f5f8d..505d5de 100644
>>> --- a/fs/eventfd.c
>>> +++ b/fs/eventfd.c
>>> @@ -30,8 +30,47 @@ struct eventfd_ctx {
>>> */
>>> __u64 count;
>>> unsigned int flags;
>>> + struct srcu_struct srcu;
>>> + struct list_head nh;
>>> + struct eventfd_notifier notifier;
>>> };
>>>
>>> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
>>> +{
>>> + struct eventfd_ctx *ctx = container_of(en,
>>> + struct eventfd_ctx,
>>> + notifier);
>>> +
>>> + if (waitqueue_active(&ctx->wqh))
>>> + wake_up_poll(&ctx->wqh, POLLIN);
>>> +}
>>> +
>>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>>> +{
>>> + struct eventfd_notifier *en;
>>> + int idx;
>>> +
>>> + idx = srcu_read_lock(&ctx->srcu);
>>> +
>>> + /*
>>> + * The goal here is to allow the notification to be preemptible
>>> + * as often as possible. We cannot achieve this with the basic
>>> + * wqh mechanism because it requires the wqh->lock. Therefore
>>> + * we have an internal srcu list mechanism of which the wqh is
>>> + * a client.
>>> + *
>>> + * Not all paths will invoke this function in process context.
>>> + * Callers should check for suitable state before assuming they
>>> + * can sleep (such as with preemptible()). Paul McKenney assures
>>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>>> + * the code within the critical section is also compatible.
>>> + */
>>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>>> + en->ops->signal(en);
>>> +
>>> + srcu_read_unlock(&ctx->srcu, idx);
>>> +}
>>> +
>>> /*
>>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>>> * success, or a value lower then "n" in case of coutner overflow.
>>>
>>>
>> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>>
>>
>
> As an aside, this is something I would like to address. I keep running
> into this pattern where I could do something in-line if I had a
> "can_sleep()" context. Otherwise, I have to punt to something like a
> workqueue which adds latency. The closest thing I have to "can_sleep()"
> is preemptible(), which, as you correctly pointed out is limited to only
> working with CONFIG_PREEMPT=y.
>
> Its been a while since I looked into it, but one of the barriers that
> would need to be overcome is the fact that the preempt_count stuff gets
> compiled away with CONFIG_PREEMPT=n. It is conceivable that we could
> make the preempt_count logic an independent config variable from
> CONFIG_PREEMPT to provide a can_sleep() macro without requiring
> full-blow preemption to be enabled. So my questions would be as follows:
>
> a) Is the community conducive to such an idea?
> b) Are there other things to consider/fix besides the lack of
> preempt_count in order to implement such a beast?
>
> -Greg
>
>
>
Hi Ingo,

Any thoughts here?

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 14:38:50

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, Jun 16, 2009 at 10:11:08AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
> >
> >> irqfd and its underlying implementation, eventfd, currently utilize
> >> the embedded wait-queue in eventfd for signal notification. The nice thing
> >> about this design decision is that it re-uses the existing
> >> eventfd/wait-queue code and it generally works well....with several
> >> limitations.
> >>
> >> One of the limitations is that notification callbacks are always called
> >> inside a spin_lock_irqsave critical section. Another limitation is
> >> that it is very difficult to build a system that can recieve release
> >> notification without being racy.
> >>
> >> Therefore, we introduce a new registration interface that is SRCU based
> >> instead of wait-queue based, and implement the internal wait-queue
> >> infrastructure in terms of this new interface. We then convert irqfd
> >> to use this new interface instead of the existing wait-queue code.
> >>
> >> The end result is that we now have the opportunity to run the interrupt
> >> injection code serially to the callback (when the signal is raised from
> >> process-context, at least) instead of always deferring the injection to a
> >> work-queue.
> >>
> >> Signed-off-by: Gregory Haskins <[email protected]>
> >> CC: Paul E. McKenney <[email protected]>
> >> CC: Davide Libenzi <[email protected]>
> >> ---
> >>
> >> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
> >> include/linux/eventfd.h | 30 ++++++++++++
> >> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
> >> 3 files changed, 188 insertions(+), 71 deletions(-)
> >>
> >> diff --git a/fs/eventfd.c b/fs/eventfd.c
> >> index 72f5f8d..505d5de 100644
> >> --- a/fs/eventfd.c
> >> +++ b/fs/eventfd.c
> >> @@ -30,8 +30,47 @@ struct eventfd_ctx {
> >> */
> >> __u64 count;
> >> unsigned int flags;
> >> + struct srcu_struct srcu;
> >> + struct list_head nh;
> >> + struct eventfd_notifier notifier;
> >> };
> >>
> >> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
> >> +{
> >> + struct eventfd_ctx *ctx = container_of(en,
> >> + struct eventfd_ctx,
> >> + notifier);
> >> +
> >> + if (waitqueue_active(&ctx->wqh))
> >> + wake_up_poll(&ctx->wqh, POLLIN);
> >> +}
> >> +
> >> +static void _eventfd_notify(struct eventfd_ctx *ctx)
> >> +{
> >> + struct eventfd_notifier *en;
> >> + int idx;
> >> +
> >> + idx = srcu_read_lock(&ctx->srcu);
> >> +
> >> + /*
> >> + * The goal here is to allow the notification to be preemptible
> >> + * as often as possible. We cannot achieve this with the basic
> >> + * wqh mechanism because it requires the wqh->lock. Therefore
> >> + * we have an internal srcu list mechanism of which the wqh is
> >> + * a client.
> >> + *
> >> + * Not all paths will invoke this function in process context.
> >> + * Callers should check for suitable state before assuming they
> >> + * can sleep (such as with preemptible()). Paul McKenney assures
> >> + * me that srcu_read_lock is compatible with in-atomic, as long as
> >> + * the code within the critical section is also compatible.
> >> + */
> >> + list_for_each_entry_rcu(en, &ctx->nh, list)
> >> + en->ops->signal(en);
> >> +
> >> + srcu_read_unlock(&ctx->srcu, idx);
> >> +}
> >> +
> >> /*
> >> * Adds "n" to the eventfd counter "count". Returns "n" in case of
> >> * success, or a value lower then "n" in case of coutner overflow.
> >>
> >
> > This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
> >
> > Further, to do useful things it might not be enough that you can sleep:
> > with iofd you also want to access current task with e.g. copy from user.
> >
> > Here's an idea: let's pass a flag to ->signal, along the lines of
> > signal_is_task, that tells us that it is safe to use current, and add
> > eventfd_signal_task() which is the same as eventfd_signal but lets everyone
> > know that it's safe to both sleep and use current->mm.
> >
> > Makes sense?
> >
>
> It does make sense, yes. What I am not clear on is how would eventfd
> detect this state such as to populate such flags, and why cant the
> ->signal() CB do the same?
>
> Thanks Michael,
> -Greg
>

eventfd can't detect this state. But the callers know in what context they are.
So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
you can call eventfd_signal_task() if not, you must call eventfd_signal.


--
MST

2009-06-16 14:41:20

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
>
>> irqfd and its underlying implementation, eventfd, currently utilize
>> the embedded wait-queue in eventfd for signal notification. The nice thing
>> about this design decision is that it re-uses the existing
>> eventfd/wait-queue code and it generally works well....with several
>> limitations.
>>
>> One of the limitations is that notification callbacks are always called
>> inside a spin_lock_irqsave critical section. Another limitation is
>> that it is very difficult to build a system that can recieve release
>> notification without being racy.
>>
>> Therefore, we introduce a new registration interface that is SRCU based
>> instead of wait-queue based, and implement the internal wait-queue
>> infrastructure in terms of this new interface. We then convert irqfd
>> to use this new interface instead of the existing wait-queue code.
>>
>> The end result is that we now have the opportunity to run the interrupt
>> injection code serially to the callback (when the signal is raised from
>> process-context, at least) instead of always deferring the injection to a
>> work-queue.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Paul E. McKenney <[email protected]>
>> CC: Davide Libenzi <[email protected]>
>> ---
>>
>> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
>> include/linux/eventfd.h | 30 ++++++++++++
>> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
>> 3 files changed, 188 insertions(+), 71 deletions(-)
>>
>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>> index 72f5f8d..505d5de 100644
>> --- a/fs/eventfd.c
>> +++ b/fs/eventfd.c
>> @@ -30,8 +30,47 @@ struct eventfd_ctx {
>> */
>> __u64 count;
>> unsigned int flags;
>> + struct srcu_struct srcu;
>> + struct list_head nh;
>> + struct eventfd_notifier notifier;
>> };
>>
>> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
>> +{
>> + struct eventfd_ctx *ctx = container_of(en,
>> + struct eventfd_ctx,
>> + notifier);
>> +
>> + if (waitqueue_active(&ctx->wqh))
>> + wake_up_poll(&ctx->wqh, POLLIN);
>> +}
>> +
>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>> +{
>> + struct eventfd_notifier *en;
>> + int idx;
>> +
>> + idx = srcu_read_lock(&ctx->srcu);
>> +
>> + /*
>> + * The goal here is to allow the notification to be preemptible
>> + * as often as possible. We cannot achieve this with the basic
>> + * wqh mechanism because it requires the wqh->lock. Therefore
>> + * we have an internal srcu list mechanism of which the wqh is
>> + * a client.
>> + *
>> + * Not all paths will invoke this function in process context.
>> + * Callers should check for suitable state before assuming they
>> + * can sleep (such as with preemptible()). Paul McKenney assures
>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>> + * the code within the critical section is also compatible.
>> + */
>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>> + en->ops->signal(en);
>> +
>> + srcu_read_unlock(&ctx->srcu, idx);
>> +}
>> +
>> /*
>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>> * success, or a value lower then "n" in case of coutner overflow.
>>
>
> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>
> Further, to do useful things it might not be enough that you can sleep:
> with iofd you also want to access current task with e.g. copy from user.
>

Something else to consider: For iosignalfd, we can assume we will
always be called from vcpu process context so we might not really need
official affirmation from the system. For irqfd, we cannot predict who
may be injecting the interrupt (for instance, it might be a
PCI-passthrough hard-irq context). I am not sure if this knowledge
actually helps to simplify the problem space, but I thought I should
mention it.

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 14:46:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, Jun 16, 2009 at 10:40:55AM -0400, Gregory Haskins wrote:
> > This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
> >
> > Further, to do useful things it might not be enough that you can sleep:
> > with iofd you also want to access current task with e.g. copy from user.
> >
>
> Something else to consider: For iosignalfd, we can assume we will
> always be called from vcpu process context so we might not really need
> official affirmation from the system. For irqfd, we cannot predict who
> may be injecting the interrupt (for instance, it might be a
> PCI-passthrough hard-irq context). I am not sure if this knowledge
> actually helps to simplify the problem space, but I thought I should
> mention it.
>
> -Greg
>
>

The way this is addressed with eventfd_signal_task proposal is:
- user calls eventfd_signal_task
we look at current->mm and figure out whether this is the right
context or we need a context switch
- everyone else calls eventfd_signal
we know that we need a context switch

2009-06-16 14:48:49

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2009 at 10:11:08AM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> irqfd and its underlying implementation, eventfd, currently utilize
>>>> the embedded wait-queue in eventfd for signal notification. The nice thing
>>>> about this design decision is that it re-uses the existing
>>>> eventfd/wait-queue code and it generally works well....with several
>>>> limitations.
>>>>
>>>> One of the limitations is that notification callbacks are always called
>>>> inside a spin_lock_irqsave critical section. Another limitation is
>>>> that it is very difficult to build a system that can recieve release
>>>> notification without being racy.
>>>>
>>>> Therefore, we introduce a new registration interface that is SRCU based
>>>> instead of wait-queue based, and implement the internal wait-queue
>>>> infrastructure in terms of this new interface. We then convert irqfd
>>>> to use this new interface instead of the existing wait-queue code.
>>>>
>>>> The end result is that we now have the opportunity to run the interrupt
>>>> injection code serially to the callback (when the signal is raised from
>>>> process-context, at least) instead of always deferring the injection to a
>>>> work-queue.
>>>>
>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>>> CC: Paul E. McKenney <[email protected]>
>>>> CC: Davide Libenzi <[email protected]>
>>>> ---
>>>>
>>>> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
>>>> include/linux/eventfd.h | 30 ++++++++++++
>>>> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
>>>> 3 files changed, 188 insertions(+), 71 deletions(-)
>>>>
>>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>>> index 72f5f8d..505d5de 100644
>>>> --- a/fs/eventfd.c
>>>> +++ b/fs/eventfd.c
>>>> @@ -30,8 +30,47 @@ struct eventfd_ctx {
>>>> */
>>>> __u64 count;
>>>> unsigned int flags;
>>>> + struct srcu_struct srcu;
>>>> + struct list_head nh;
>>>> + struct eventfd_notifier notifier;
>>>> };
>>>>
>>>> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
>>>> +{
>>>> + struct eventfd_ctx *ctx = container_of(en,
>>>> + struct eventfd_ctx,
>>>> + notifier);
>>>> +
>>>> + if (waitqueue_active(&ctx->wqh))
>>>> + wake_up_poll(&ctx->wqh, POLLIN);
>>>> +}
>>>> +
>>>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>>>> +{
>>>> + struct eventfd_notifier *en;
>>>> + int idx;
>>>> +
>>>> + idx = srcu_read_lock(&ctx->srcu);
>>>> +
>>>> + /*
>>>> + * The goal here is to allow the notification to be preemptible
>>>> + * as often as possible. We cannot achieve this with the basic
>>>> + * wqh mechanism because it requires the wqh->lock. Therefore
>>>> + * we have an internal srcu list mechanism of which the wqh is
>>>> + * a client.
>>>> + *
>>>> + * Not all paths will invoke this function in process context.
>>>> + * Callers should check for suitable state before assuming they
>>>> + * can sleep (such as with preemptible()). Paul McKenney assures
>>>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>>>> + * the code within the critical section is also compatible.
>>>> + */
>>>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>>>> + en->ops->signal(en);
>>>> +
>>>> + srcu_read_unlock(&ctx->srcu, idx);
>>>> +}
>>>> +
>>>> /*
>>>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>>>> * success, or a value lower then "n" in case of coutner overflow.
>>>>
>>>>
>>> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>>>
>>> Further, to do useful things it might not be enough that you can sleep:
>>> with iofd you also want to access current task with e.g. copy from user.
>>>
>>> Here's an idea: let's pass a flag to ->signal, along the lines of
>>> signal_is_task, that tells us that it is safe to use current, and add
>>> eventfd_signal_task() which is the same as eventfd_signal but lets everyone
>>> know that it's safe to both sleep and use current->mm.
>>>
>>> Makes sense?
>>>
>>>
>> It does make sense, yes. What I am not clear on is how would eventfd
>> detect this state such as to populate such flags, and why cant the
>> ->signal() CB do the same?
>>
>> Thanks Michael,
>> -Greg
>>
>>
>
> eventfd can't detect this state. But the callers know in what context they are.
> So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
> you can call eventfd_signal_task() if not, you must call eventfd_signal.
>
>
>
Hmm, this is an interesting idea, but I think it would be problematic in
real-world applications for the long-term. For instance, the -rt tree
and irq-threads .config option in the process of merging into mainline
changes context types for established code. Therefore, what might be
"hardirq/softirq" logic today may execute in a kthread tomorrow. I
think its dangerous to try to solve the problem with caller provided
info: the caller may be ignorant of its true state. IMO, the ideal
solution needs to be something we can detect at run-time.

Thanks Michael,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 14:54:46

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
>
>> On Tue, Jun 16, 2009 at 10:11:08AM -0400, Gregory Haskins wrote:
>>
>>
>>> Michael S. Tsirkin wrote:
>>>
>>>
>>>> On Mon, Jun 15, 2009 at 10:29:56PM -0400, Gregory Haskins wrote:
>>>>
>>>>
>>>>
>>>>> irqfd and its underlying implementation, eventfd, currently utilize
>>>>> the embedded wait-queue in eventfd for signal notification. The nice thing
>>>>> about this design decision is that it re-uses the existing
>>>>> eventfd/wait-queue code and it generally works well....with several
>>>>> limitations.
>>>>>
>>>>> One of the limitations is that notification callbacks are always called
>>>>> inside a spin_lock_irqsave critical section. Another limitation is
>>>>> that it is very difficult to build a system that can recieve release
>>>>> notification without being racy.
>>>>>
>>>>> Therefore, we introduce a new registration interface that is SRCU based
>>>>> instead of wait-queue based, and implement the internal wait-queue
>>>>> infrastructure in terms of this new interface. We then convert irqfd
>>>>> to use this new interface instead of the existing wait-queue code.
>>>>>
>>>>> The end result is that we now have the opportunity to run the interrupt
>>>>> injection code serially to the callback (when the signal is raised from
>>>>> process-context, at least) instead of always deferring the injection to a
>>>>> work-queue.
>>>>>
>>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>>>> CC: Paul E. McKenney <[email protected]>
>>>>> CC: Davide Libenzi <[email protected]>
>>>>> ---
>>>>>
>>>>> fs/eventfd.c | 115 +++++++++++++++++++++++++++++++++++++++++++----
>>>>> include/linux/eventfd.h | 30 ++++++++++++
>>>>> virt/kvm/eventfd.c | 114 +++++++++++++++++++++--------------------------
>>>>> 3 files changed, 188 insertions(+), 71 deletions(-)
>>>>>
>>>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>>>> index 72f5f8d..505d5de 100644
>>>>> --- a/fs/eventfd.c
>>>>> +++ b/fs/eventfd.c
>>>>> @@ -30,8 +30,47 @@ struct eventfd_ctx {
>>>>> */
>>>>> __u64 count;
>>>>> unsigned int flags;
>>>>> + struct srcu_struct srcu;
>>>>> + struct list_head nh;
>>>>> + struct eventfd_notifier notifier;
>>>>> };
>>>>>
>>>>> +static void _eventfd_wqh_notify(struct eventfd_notifier *en)
>>>>> +{
>>>>> + struct eventfd_ctx *ctx = container_of(en,
>>>>> + struct eventfd_ctx,
>>>>> + notifier);
>>>>> +
>>>>> + if (waitqueue_active(&ctx->wqh))
>>>>> + wake_up_poll(&ctx->wqh, POLLIN);
>>>>> +}
>>>>> +
>>>>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>>>>> +{
>>>>> + struct eventfd_notifier *en;
>>>>> + int idx;
>>>>> +
>>>>> + idx = srcu_read_lock(&ctx->srcu);
>>>>> +
>>>>> + /*
>>>>> + * The goal here is to allow the notification to be preemptible
>>>>> + * as often as possible. We cannot achieve this with the basic
>>>>> + * wqh mechanism because it requires the wqh->lock. Therefore
>>>>> + * we have an internal srcu list mechanism of which the wqh is
>>>>> + * a client.
>>>>> + *
>>>>> + * Not all paths will invoke this function in process context.
>>>>> + * Callers should check for suitable state before assuming they
>>>>> + * can sleep (such as with preemptible()). Paul McKenney assures
>>>>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>>>>> + * the code within the critical section is also compatible.
>>>>> + */
>>>>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>>>>> + en->ops->signal(en);
>>>>> +
>>>>> + srcu_read_unlock(&ctx->srcu, idx);
>>>>> +}
>>>>> +
>>>>> /*
>>>>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>>>>> * success, or a value lower then "n" in case of coutner overflow.
>>>>>
>>>>>
>>>>>
>>>> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>>>>
>>>> Further, to do useful things it might not be enough that you can sleep:
>>>> with iofd you also want to access current task with e.g. copy from user.
>>>>
>>>> Here's an idea: let's pass a flag to ->signal, along the lines of
>>>> signal_is_task, that tells us that it is safe to use current, and add
>>>> eventfd_signal_task() which is the same as eventfd_signal but lets everyone
>>>> know that it's safe to both sleep and use current->mm.
>>>>
>>>> Makes sense?
>>>>
>>>>
>>>>
>>> It does make sense, yes. What I am not clear on is how would eventfd
>>> detect this state such as to populate such flags, and why cant the
>>> ->signal() CB do the same?
>>>
>>> Thanks Michael,
>>> -Greg
>>>
>>>
>>>
>> eventfd can't detect this state. But the callers know in what context they are.
>> So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
>> you can call eventfd_signal_task() if not, you must call eventfd_signal.
>>
>>
>>
>>
> Hmm, this is an interesting idea, but I think it would be problematic in
> real-world applications for the long-term. For instance, the -rt tree
> and irq-threads .config option in the process of merging into mainline
> changes context types for established code. Therefore, what might be
> "hardirq/softirq" logic today may execute in a kthread tomorrow. I
> think its dangerous to try to solve the problem with caller provided
> info: the caller may be ignorant of its true state.

Also, we need to consider that a process context can still be in-atomic
if the user did something like disabled interrupts, preemption, used a
spinlock, etc, before calling the eventfd_signal_task() function.
Perhaps we can put a stake in the ground that says you must not call
this from atomic context, but I still prefer just being able to detect
this from our state.

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 14:55:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, Jun 16, 2009 at 10:48:27AM -0400, Gregory Haskins wrote:
> >>>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
> >>>> +{
> >>>> + struct eventfd_notifier *en;
> >>>> + int idx;
> >>>> +
> >>>> + idx = srcu_read_lock(&ctx->srcu);
> >>>> +
> >>>> + /*
> >>>> + * The goal here is to allow the notification to be preemptible
> >>>> + * as often as possible. We cannot achieve this with the basic
> >>>> + * wqh mechanism because it requires the wqh->lock. Therefore
> >>>> + * we have an internal srcu list mechanism of which the wqh is
> >>>> + * a client.
> >>>> + *
> >>>> + * Not all paths will invoke this function in process context.
> >>>> + * Callers should check for suitable state before assuming they
> >>>> + * can sleep (such as with preemptible()). Paul McKenney assures
> >>>> + * me that srcu_read_lock is compatible with in-atomic, as long as
> >>>> + * the code within the critical section is also compatible.
> >>>> + */
> >>>> + list_for_each_entry_rcu(en, &ctx->nh, list)
> >>>> + en->ops->signal(en);
> >>>> +
> >>>> + srcu_read_unlock(&ctx->srcu, idx);
> >>>> +}
> >>>> +
> >>>> /*
> >>>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
> >>>> * success, or a value lower then "n" in case of coutner overflow.
> >>>>
> >>>>
> >>> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
> >>>
> >>> Further, to do useful things it might not be enough that you can sleep:
> >>> with iofd you also want to access current task with e.g. copy from user.
> >>>
> >>> Here's an idea: let's pass a flag to ->signal, along the lines of
> >>> signal_is_task, that tells us that it is safe to use current, and add
> >>> eventfd_signal_task() which is the same as eventfd_signal but lets everyone
> >>> know that it's safe to both sleep and use current->mm.
> >>>
> >>> Makes sense?
> >>>
> >>>
> >> It does make sense, yes. What I am not clear on is how would eventfd
> >> detect this state such as to populate such flags, and why cant the
> >> ->signal() CB do the same?
> >>
> >> Thanks Michael,
> >> -Greg
> >>
> >>
> >
> > eventfd can't detect this state. But the callers know in what context they are.
> > So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
> > you can call eventfd_signal_task() if not, you must call eventfd_signal.
> >
> >
> >
> Hmm, this is an interesting idea, but I think it would be problematic in
> real-world applications for the long-term. For instance, the -rt tree
> and irq-threads .config option in the process of merging into mainline
> changes context types for established code. Therefore, what might be
> "hardirq/softirq" logic today may execute in a kthread tomorrow.

That's OK, it's always safe to call eventfd_signal: eventfd_signal_task is just
an optimization. I think everyone not in the context of a system call or vmexit
can just call eventfd_signal_task.

> I
> think its dangerous to try to solve the problem with caller provided
> info: the caller may be ignorant of its true state.

I assume this wasn't clear enough: the idea is that you only
calls eventfd_signal_task if you know you are on a systemcall path.
If you are ignorant of the state, call eventfd_signal.

> IMO, the ideal
> solution needs to be something we can detect at run-time.
>
> Thanks Michael,
> -Greg
>


--
MST

2009-06-16 15:17:08

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, Jun 16, 2009 at 10:54:28AM -0400, Gregory Haskins wrote:
> >>>>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
> >>>>> +{
> >>>>> + struct eventfd_notifier *en;
> >>>>> + int idx;
> >>>>> +
> >>>>> + idx = srcu_read_lock(&ctx->srcu);
> >>>>> +
> >>>>> + /*
> >>>>> + * The goal here is to allow the notification to be preemptible
> >>>>> + * as often as possible. We cannot achieve this with the basic
> >>>>> + * wqh mechanism because it requires the wqh->lock. Therefore
> >>>>> + * we have an internal srcu list mechanism of which the wqh is
> >>>>> + * a client.
> >>>>> + *
> >>>>> + * Not all paths will invoke this function in process context.
> >>>>> + * Callers should check for suitable state before assuming they
> >>>>> + * can sleep (such as with preemptible()). Paul McKenney assures
> >>>>> + * me that srcu_read_lock is compatible with in-atomic, as long as
> >>>>> + * the code within the critical section is also compatible.
> >>>>> + */
> >>>>> + list_for_each_entry_rcu(en, &ctx->nh, list)
> >>>>> + en->ops->signal(en);
> >>>>> +
> >>>>> + srcu_read_unlock(&ctx->srcu, idx);
> >>>>> +}
> >>>>> +
> >>>>> /*
> >>>>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
> >>>>> * success, or a value lower then "n" in case of coutner overflow.
> >>>>>
> >>>>>
> >>>>>
> >>>> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
> >>>>
> >>>> Further, to do useful things it might not be enough that you can sleep:
> >>>> with iofd you also want to access current task with e.g. copy from user.
> >>>>
> >>>> Here's an idea: let's pass a flag to ->signal, along the lines of
> >>>> signal_is_task, that tells us that it is safe to use current, and add
> >>>> eventfd_signal_task() which is the same as eventfd_signal but lets everyone
> >>>> know that it's safe to both sleep and use current->mm.
> >>>>
> >>>> Makes sense?
> >>>>
> >>>>
> >>>>
> >>> It does make sense, yes. What I am not clear on is how would eventfd
> >>> detect this state such as to populate such flags, and why cant the
> >>> ->signal() CB do the same?
> >>>
> >>> Thanks Michael,
> >>> -Greg
> >>>
> >>>
> >>>
> >> eventfd can't detect this state. But the callers know in what context they are.
> >> So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
> >> you can call eventfd_signal_task() if not, you must call eventfd_signal.
> >>
> >>
> >>
> >>
> > Hmm, this is an interesting idea, but I think it would be problematic in
> > real-world applications for the long-term. For instance, the -rt tree
> > and irq-threads .config option in the process of merging into mainline
> > changes context types for established code. Therefore, what might be
> > "hardirq/softirq" logic today may execute in a kthread tomorrow. I
> > think its dangerous to try to solve the problem with caller provided
> > info: the caller may be ignorant of its true state.
>
> Also, we need to consider that a process context can still be in-atomic
> if the user did something like disabled interrupts, preemption, used a
> spinlock, etc, before calling the eventfd_signal_task() function.
> Perhaps we can put a stake in the ground that says you must not call
> this from atomic context,

That's the ticket.

> but I still prefer just being able to detect
> this from our state.
>
> -Greg
>
>

--
MST

2009-06-16 15:20:37

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2009 at 10:48:27AM -0400, Gregory Haskins wrote:
>
>>>>>> +static void _eventfd_notify(struct eventfd_ctx *ctx)
>>>>>> +{
>>>>>> + struct eventfd_notifier *en;
>>>>>> + int idx;
>>>>>> +
>>>>>> + idx = srcu_read_lock(&ctx->srcu);
>>>>>> +
>>>>>> + /*
>>>>>> + * The goal here is to allow the notification to be preemptible
>>>>>> + * as often as possible. We cannot achieve this with the basic
>>>>>> + * wqh mechanism because it requires the wqh->lock. Therefore
>>>>>> + * we have an internal srcu list mechanism of which the wqh is
>>>>>> + * a client.
>>>>>> + *
>>>>>> + * Not all paths will invoke this function in process context.
>>>>>> + * Callers should check for suitable state before assuming they
>>>>>> + * can sleep (such as with preemptible()). Paul McKenney assures
>>>>>> + * me that srcu_read_lock is compatible with in-atomic, as long as
>>>>>> + * the code within the critical section is also compatible.
>>>>>> + */
>>>>>> + list_for_each_entry_rcu(en, &ctx->nh, list)
>>>>>> + en->ops->signal(en);
>>>>>> +
>>>>>> + srcu_read_unlock(&ctx->srcu, idx);
>>>>>> +}
>>>>>> +
>>>>>> /*
>>>>>> * Adds "n" to the eventfd counter "count". Returns "n" in case of
>>>>>> * success, or a value lower then "n" in case of coutner overflow.
>>>>>>
>>>>>>
>>>>>>
>>>>> This is ugly, isn't it? With CONFIG_PREEMPT=no preemptible() is always false.
>>>>>
>>>>> Further, to do useful things it might not be enough that you can sleep:
>>>>> with iofd you also want to access current task with e.g. copy from user.
>>>>>
>>>>> Here's an idea: let's pass a flag to ->signal, along the lines of
>>>>> signal_is_task, that tells us that it is safe to use current, and add
>>>>> eventfd_signal_task() which is the same as eventfd_signal but lets everyone
>>>>> know that it's safe to both sleep and use current->mm.
>>>>>
>>>>> Makes sense?
>>>>>
>>>>>
>>>>>
>>>> It does make sense, yes. What I am not clear on is how would eventfd
>>>> detect this state such as to populate such flags, and why cant the
>>>> ->signal() CB do the same?
>>>>
>>>> Thanks Michael,
>>>> -Greg
>>>>
>>>>
>>>>
>>> eventfd can't detect this state. But the callers know in what context they are.
>>> So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
>>> you can call eventfd_signal_task() if not, you must call eventfd_signal.
>>>
>>>
>>>
>>>
>> Hmm, this is an interesting idea, but I think it would be problematic in
>> real-world applications for the long-term. For instance, the -rt tree
>> and irq-threads .config option in the process of merging into mainline
>> changes context types for established code. Therefore, what might be
>> "hardirq/softirq" logic today may execute in a kthread tomorrow.
>>
>
> That's OK, it's always safe to call eventfd_signal: eventfd_signal_task is just
> an optimization. I think everyone not in the context of a system call or vmexit
> can just call eventfd_signal_task.
>
^^^^^^^^^^^^^^^^^^^^

I assume you meant s/eventfd_signal_task/eventfd_signal there?

>
>> I
>> think its dangerous to try to solve the problem with caller provided
>> info: the caller may be ignorant of its true state.
>>
>
> I assume this wasn't clear enough: the idea is that you only
> calls eventfd_signal_task if you know you are on a systemcall path.
> If you are ignorant of the state, call eventfd_signal.
>

Well, its not a matter of correctness. Its more for optimal
performance. If I have PCI pass-though injecting interrupts from
hardirq in mainline, clearly eventfd_signal() is proper. In -rt, the
hardirq is transparently converted to a kthread, so technically
eventfd_signal_task() would work (at least for the can_sleep() part, not
for current->mm per se). But in this case, the PCI logic would not know
it was converted to a kthread. It all happens transparently in the
low-level code and the pci code is unmodified.

In this case, your proposal would have the passthrough path invoking
irqfd with eventfd_signal(). It would therefore still shunt to a
workqueue to inject the interrupt, even though it would have been
perfectly fine to just inject it directly because taking
mutex_lock(&kvm->irq_lock) is legal. Perhaps I am over-optimizing, but
this is the scenario I am concerned about and what I was trying to
address with preemptible()/can_sleep().

I think your idea is a good one to address the current->mm portion. It
would only ever be safe to access the MM context from syscall/vmexit
context, as you point out. Therefore, I see no problem with
implementing something like iosignalfd with eventfd_signal_task().

But accessing current->mm is only a subset of the use-cases. The other
use-cases would include the ability to sleep, and possibly the ability
to address other->mm. For these latter cases, I really only need the
"can_sleep()" behavior, not the full blown "can_access_current_mm()".
Additionally, the eventfd_signal_task() data at least for iosignalfd is
superfluous: I already know that I can access current->mm by virtue of
the design.

So since I cannot use it accurately for the hardirq/threaded-irq type
case, and I don't actually need it for the iosignalfd case, I am not
sure its the right direction (at least for now). I do think it might
have merit for syscal/vmexit uses outside of iosignalfd, but I do not
see a current use-case for it so perhaps it can wait until one arises.

-Greg

>
>> IMO, the ideal
>> solution needs to be something we can detect at run-time.
>>
>> Thanks Michael,
>> -Greg
>>
>>
>
>
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 15:42:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, Jun 16, 2009 at 11:20:18AM -0400, Gregory Haskins wrote:
> >>> eventfd can't detect this state. But the callers know in what context they are.
> >>> So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
> >>> you can call eventfd_signal_task() if not, you must call eventfd_signal.
> >>>
> >>>
> >>>
> >>>
> >> Hmm, this is an interesting idea, but I think it would be problematic in
> >> real-world applications for the long-term. For instance, the -rt tree
> >> and irq-threads .config option in the process of merging into mainline
> >> changes context types for established code. Therefore, what might be
> >> "hardirq/softirq" logic today may execute in a kthread tomorrow.
> >>
> >
> > That's OK, it's always safe to call eventfd_signal: eventfd_signal_task is just
> > an optimization. I think everyone not in the context of a system call or vmexit
> > can just call eventfd_signal_task.
> >
> ^^^^^^^^^^^^^^^^^^^^
>
> I assume you meant s/eventfd_signal_task/eventfd_signal there?

Yea. Sorry.

> >
> >> I
> >> think its dangerous to try to solve the problem with caller provided
> >> info: the caller may be ignorant of its true state.
> >>
> >
> > I assume this wasn't clear enough: the idea is that you only
> > calls eventfd_signal_task if you know you are on a systemcall path.
> > If you are ignorant of the state, call eventfd_signal.
> >
>
> Well, its not a matter of correctness. Its more for optimal
> performance. If I have PCI pass-though injecting interrupts from
> hardirq in mainline, clearly eventfd_signal() is proper. In -rt, the
> hardirq is transparently converted to a kthread, so technically
> eventfd_signal_task() would work

I think it's wrong to sleep for a long time while handling interrupts even if
you technically can with threaded interrupts. For that matter, I think you can
sleep while within a spinlock if preempt is on, but that does not mean you
should - and I think you will get warnings in the log if you do. No?

> (at least for the can_sleep() part, not
> for current->mm per se). But in this case, the PCI logic would not know
> it was converted to a kthread. It all happens transparently in the
> low-level code and the pci code is unmodified.
>
> In this case, your proposal would have the passthrough path invoking
> irqfd with eventfd_signal(). It would therefore still shunt to a
> workqueue to inject the interrupt, even though it would have been
> perfectly fine to just inject it directly because taking
> mutex_lock(&kvm->irq_lock) is legal.

This specific issue should just be addressed by making it possible
to inject the interrupt from an atomic context.

> Perhaps I am over-optimizing, but
> this is the scenario I am concerned about and what I was trying to
> address with preemptible()/can_sleep().

What, a path that is never invoked without threaded interrupts?
Yes, I would say it currently looks like an over-optimization.

> I think your idea is a good one to address the current->mm portion. It
> would only ever be safe to access the MM context from syscall/vmexit
> context, as you point out. Therefore, I see no problem with
> implementing something like iosignalfd with eventfd_signal_task().
>
> But accessing current->mm is only a subset of the use-cases. The other
> use-cases would include the ability to sleep, and possibly the ability
> to address other->mm. For these latter cases, I really only need the
> "can_sleep()" behavior, not the full blown "can_access_current_mm()".
> Additionally, the eventfd_signal_task() data at least for iosignalfd is
> superfluous: I already know that I can access current->mm by virtue of
> the design.

Maybe I misunderstand what iosignalfd is - isn't it true that you get eventfd
and kvm will signal that when guest performs io write to a specific
address, and then drivers can get eventfd and poll it to detect
these writes?

If yes, you have no way to know that the other end of eventfd
is connected to kvm, so you don't know you can access current->mm.

> So since I cannot use it accurately for the hardirq/threaded-irq type
> case, and I don't actually need it for the iosignalfd case, I am not
> sure its the right direction (at least for now). I do think it might
> have merit for syscal/vmexit uses outside of iosignalfd, but I do not
> see a current use-case for it so perhaps it can wait until one arises.
>
> -Greg

But, it addresses the CONFIG_PREEMPT off case, which your design doesn't
seem to.

> >
> >> IMO, the ideal
> >> solution needs to be something we can detect at run-time.
> >>
> >> Thanks Michael,
> >> -Greg
> >>
> >>
> >
> >
> >
>
>

2009-06-16 16:17:44

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2009 at 11:20:18AM -0400, Gregory Haskins wrote:
>
>>>>> eventfd can't detect this state. But the callers know in what context they are.
>>>>> So the *caller* of eventfd_signal_task makes sure of this: if you are in a task,
>>>>> you can call eventfd_signal_task() if not, you must call eventfd_signal.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Hmm, this is an interesting idea, but I think it would be problematic in
>>>> real-world applications for the long-term. For instance, the -rt tree
>>>> and irq-threads .config option in the process of merging into mainline
>>>> changes context types for established code. Therefore, what might be
>>>> "hardirq/softirq" logic today may execute in a kthread tomorrow.
>>>>
>>>>
>>> That's OK, it's always safe to call eventfd_signal: eventfd_signal_task is just
>>> an optimization. I think everyone not in the context of a system call or vmexit
>>> can just call eventfd_signal_task.
>>>
>>>
>> ^^^^^^^^^^^^^^^^^^^^
>>
>> I assume you meant s/eventfd_signal_task/eventfd_signal there?
>>
>
> Yea. Sorry.
>

np. I knew what you meant ;)
>
>>>
>>>
>>>> I
>>>> think its dangerous to try to solve the problem with caller provided
>>>> info: the caller may be ignorant of its true state.
>>>>
>>>>
>>> I assume this wasn't clear enough: the idea is that you only
>>> calls eventfd_signal_task if you know you are on a systemcall path.
>>> If you are ignorant of the state, call eventfd_signal.
>>>
>>>
>> Well, its not a matter of correctness. Its more for optimal
>> performance. If I have PCI pass-though injecting interrupts from
>> hardirq in mainline, clearly eventfd_signal() is proper. In -rt, the
>> hardirq is transparently converted to a kthread, so technically
>> eventfd_signal_task() would work
>>
>
> I think it's wrong to sleep for a long time while handling interrupts even if
> you technically can with threaded interrupts.

Well, this is somewhat of an orthogonal issue so I don't want to open
this can of worms per se.

But, one of the long-term goals of the threaded-irq construct is to
eliminate the need for the traditional "bh" contexts to do work. The
idea, of course, is that the threaded-irq context can do all of the work
traditionally shunted to tasklets/softirqs/workqueues directly, so why
bother switching contexts. So, the short answer is that its not
necessarily wrong to sleep or to do significant processing from a
threaded-irq.

In any case, threaded-irqs are just one example. I will try to
highlight others below.

> For that matter, I think you can
> sleep while within a spinlock if preempt is on
Yes, in fact the spinlocks are mutexes in this mode, so the locks
themselves can sleep.


> , but that does not mean you
> should - and I think you will get warnings in the log if you do. No?
>
>
Nope, sleeping is fine (voluntary or involuntary). The idea is its all
governed by priority, and priority-inheritance, and a scheduler that is
free to make fine-grained decisions (due to the broadly preemptible
kernel). But this is getting off-topic so I will stop.

>> (at least for the can_sleep() part, not
>> for current->mm per se). But in this case, the PCI logic would not know
>> it was converted to a kthread. It all happens transparently in the
>> low-level code and the pci code is unmodified.
>>
>> In this case, your proposal would have the passthrough path invoking
>> irqfd with eventfd_signal(). It would therefore still shunt to a
>> workqueue to inject the interrupt, even though it would have been
>> perfectly fine to just inject it directly because taking
>> mutex_lock(&kvm->irq_lock) is legal.
>>
>
> This specific issue should just be addressed by making it possible
> to inject the interrupt from an atomic context.
>

I assume you mean
s/mutex_lock(&kvm->irq_lock)/spin_lock(&kvm->irq_lock)? If so, I agree
this is a good idea. TBH: I am more concerned about the iosignalfd path
w.r.t. the premptiblity of the callback. We can optimize the virtio-net
interface, for instance, once we make this ->signal() callback
preemptible. Having irqfd ->signal() preemptible is just a bonus, but
we could work around it by fixing irq_lock as well, I agree.

>
>> Perhaps I am over-optimizing, but
>> this is the scenario I am concerned about and what I was trying to
>> address with preemptible()/can_sleep().
>>
>
> What, a path that is never invoked without threaded interrupts?
> Yes, I would say it currently looks like an over-optimization.
>

You are only seeing part of the picture though. This is a cascading
pattern.
>
>> I think your idea is a good one to address the current->mm portion. It
>> would only ever be safe to access the MM context from syscall/vmexit
>> context, as you point out. Therefore, I see no problem with
>> implementing something like iosignalfd with eventfd_signal_task().
>>
>> But accessing current->mm is only a subset of the use-cases. The other
>> use-cases would include the ability to sleep, and possibly the ability
>> to address other->mm. For these latter cases, I really only need the
>> "can_sleep()" behavior, not the full blown "can_access_current_mm()".
>> Additionally, the eventfd_signal_task() data at least for iosignalfd is
>> superfluous: I already know that I can access current->mm by virtue of
>> the design.
>>
>
> Maybe I misunderstand what iosignalfd is - isn't it true that you get eventfd
> and kvm will signal that when guest performs io write to a specific
> address, and then drivers can get eventfd and poll it to detect
> these writes?
>

Correct.

> If yes, you have no way to know that the other end of eventfd
> is connected to kvm, so you don't know you can access current->mm.
>

Well, my intended use for them *does* know its connected to KVM.
Perhaps there will be others that do not in the future, but like I said
it could be addressed as those actually arise.


>
>> So since I cannot use it accurately for the hardirq/threaded-irq type
>> case, and I don't actually need it for the iosignalfd case, I am not
>> sure its the right direction (at least for now). I do think it might
>> have merit for syscal/vmexit uses outside of iosignalfd, but I do not
>> see a current use-case for it so perhaps it can wait until one arises.
>>
>> -Greg
>>
>
> But, it addresses the CONFIG_PREEMPT off case, which your design doesn't
> seem to.
>

Well, the problem is that it only addresses it partially in a limited
set of circumstances, and doesn't address the broader problem. I'll
give you an example:

(First off, lets assume that we are not going to have
eventfd_signal_task(), but rather eventfd_signal() with two option
flags: EVENTFD_SIGNAL_CANSLEEP, and EVENTFD_SIGNAL_CURRENTVALID

Today vbus-enet has a rx-thread and a tx-thread at least partially
because I need process-context to do the copy_to_user(other->mm) type
stuff (and we will need similar for our virtio-net backend as well). I
also utilize irqfd to do interrupt injection. Now, since I know that I
have kthread_context, I could modify vbus-enet to inject interrupts with
EVENTFD_SIGNAL_CANSLEEP set, and this is fine. I know that is safe.

However, the problem is above that! I would like to optimize out the
tx-thread to begin with with a similar "can_sleep()" pattern whenever
possible.

For instance, if the netif stack calls me to do a transmit and it
happens to be in a sleepable context, it would be nice to just skip
waking up the tx-thread. I would therefore just do the
copy_to_user(other->mm) + irqfd directly in the netif callback, thereby
skipping the context switch.

So the problem is a pattern that I would like to address generally
outside of just eventfd: "can I be sleepy"? If yes, do it now, if not
defer it.

So if the stack calls me in a preemptible state, I would like to detect
that and optimize my deferment mechanisms away. This isn't something
that happens readily today given the way the stacks locking and
softirq-net work, but its moving in that direction (for instance,
threaded softirqs).

This is why I am pushing for a run-time detection mechanism (like
can_sleep()), as opposed to something in the eventfd interface level
(like EVENTFD_SIGNAL_CANSLEEP). I think at least the CURRENTVALID idea
is a great one, I just don't have a current need for it. That said, I
do not have a problem with modifing eventfd to provide such information
if Davide et. al. ack it as well.

Does this all make sense?

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 16:26:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, 16 Jun 2009, Gregory Haskins wrote:

> Does this all make sense?

This conversation has been *really* long, and I haven't had time to look
at the patch yet. But looking at the amount of changes, and the amount of
even more changes talked in this thread, there's a very slim chance that
I'll ACK the eventfd code.
You may want to consider a solution that does not litter eventfd code that
much.


- Davide

2009-06-16 17:01:27

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
> On Tue, 16 Jun 2009, Gregory Haskins wrote:
>
>
>> Does this all make sense?
>>
>
> This conversation has been *really* long, and I haven't had time to look
> at the patch yet. But looking at the amount of changes, and the amount of
> even more changes talked in this thread, there's a very slim chance that
> I'll ACK the eventfd code.
> You may want to consider a solution that does not litter eventfd code that
> much.
>
>
> - Davide
>
>
>
Hi Davide,

I understand your position and value your time/insight into looking at
this things.

Despite the current ongoing discussion, I still stand that the current
patch is my proposed solution (though I have yet to convince Michael).
But in any case, if you have the time, please look it over because I
still think its the right direction to head in.

The general solution is that we use an srcu list instead of the
wait-queue, and thats really it. If we can't eliminate that spinlock
held over the notification, it has usability implications at least for
irqfd/iosignalfd. The only way I can think of to solve the problem
without modifying eventfd is to not use eventfd at all. :(

Since using eventfd really captures the concept we are going for here
really well, reusing it has a ton of advantages including interface
compatibility and, of course, code-reuse of a tested/debugged code
base. Heck, we may hopefully even improve eventfd for other users in
doing this work. It would therefore be a shame to walk away from it if
it can be avoided.

So if what I proposed is not acceptable but you are willing to work with
me to find a solution that is, that would be ideal from my perspective.
Otherwise, I apologize for all the noise. You have been quite the
patient and helpful gentleman with me to date and I appreciate that.

Kind Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-16 17:54:54

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, Jun 16, 2009 at 12:17:22PM -0400, Gregory Haskins wrote:
> > Maybe I misunderstand what iosignalfd is - isn't it true that you get eventfd
> > and kvm will signal that when guest performs io write to a specific
> > address, and then drivers can get eventfd and poll it to detect
> > these writes?
> >
>
> Correct.
>
> > If yes, you have no way to know that the other end of eventfd
> > is connected to kvm, so you don't know you can access current->mm.
> >
>
> Well, my intended use for them *does* know its connected to KVM.

How does it know? You get a file descriptor and you even figure out it's an
eventfd. Now what? Any process can write there.

> Perhaps there will be others that do not in the future, but like I said
> it could be addressed as those actually arise.
>
>
> >
> >> So since I cannot use it accurately for the hardirq/threaded-irq type
> >> case, and I don't actually need it for the iosignalfd case, I am not
> >> sure its the right direction (at least for now). I do think it might
> >> have merit for syscal/vmexit uses outside of iosignalfd, but I do not
> >> see a current use-case for it so perhaps it can wait until one arises.
> >>
> >> -Greg
> >>
> >
> > But, it addresses the CONFIG_PREEMPT off case, which your design doesn't
> > seem to.
> >
>
> Well, the problem is that it only addresses it partially in a limited
> set of circumstances, and doesn't address the broader problem. I'll
> give you an example:
>
> (First off, lets assume that we are not going to have
> eventfd_signal_task(), but rather eventfd_signal() with two option
> flags: EVENTFD_SIGNAL_CANSLEEP, and EVENTFD_SIGNAL_CURRENTVALID
>
> Today vbus-enet has a rx-thread and a tx-thread at least partially
> because I need process-context to do the copy_to_user(other->mm) type
> stuff (and we will need similar for our virtio-net backend as well). I
> also utilize irqfd to do interrupt injection. Now, since I know that I
> have kthread_context, I could modify vbus-enet to inject interrupts with
> EVENTFD_SIGNAL_CANSLEEP set, and this is fine. I know that is safe.
>
> However, the problem is above that! I would like to optimize out the
> tx-thread to begin with with a similar "can_sleep()" pattern whenever
> possible.
>
> For instance, if the netif stack calls me to do a transmit and it
> happens to be in a sleepable context, it would be nice to just skip
> waking up the tx-thread. I would therefore just do the
> copy_to_user(other->mm) + irqfd directly in the netif callback, thereby
> skipping the context switch.

What do you mean by copy_to_user(other->mm) here? If you are going to switch
to another mm, then I think current->mm must be valid (I think it's not enough
that you can sleep). So preemptible() might not be enough.

--
MST

2009-06-16 18:09:56

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2009 at 12:17:22PM -0400, Gregory Haskins wrote:
>
>>> Maybe I misunderstand what iosignalfd is - isn't it true that you get eventfd
>>> and kvm will signal that when guest performs io write to a specific
>>> address, and then drivers can get eventfd and poll it to detect
>>> these writes?
>>>
>>>
>> Correct.
>>
>>
>>> If yes, you have no way to know that the other end of eventfd
>>> is connected to kvm, so you don't know you can access current->mm.
>>>
>>>
>> Well, my intended use for them *does* know its connected to KVM.
>>
>
> How does it know? You get a file descriptor and you even figure out it's an
> eventfd. Now what? Any process can write there.
>

Well, I suppose that is true. Unlike my current hypercall coupling
deployed in vbus v3, anyone can signal the event for the iosignalfd.
However, it wouldn't matter other than looking like an errant signal as
I do not use "current" for anything. (e.g. all the virtio pointers are
stored independently to the iosignalfd)
>
>> Perhaps there will be others that do not in the future, but like I said
>> it could be addressed as those actually arise.
>>
>>
>>
>>>
>>>
>>>> So since I cannot use it accurately for the hardirq/threaded-irq type
>>>> case, and I don't actually need it for the iosignalfd case, I am not
>>>> sure its the right direction (at least for now). I do think it might
>>>> have merit for syscal/vmexit uses outside of iosignalfd, but I do not
>>>> see a current use-case for it so perhaps it can wait until one arises.
>>>>
>>>> -Greg
>>>>
>>>>
>>> But, it addresses the CONFIG_PREEMPT off case, which your design doesn't
>>> seem to.
>>>
>>>
>> Well, the problem is that it only addresses it partially in a limited
>> set of circumstances, and doesn't address the broader problem. I'll
>> give you an example:
>>
>> (First off, lets assume that we are not going to have
>> eventfd_signal_task(), but rather eventfd_signal() with two option
>> flags: EVENTFD_SIGNAL_CANSLEEP, and EVENTFD_SIGNAL_CURRENTVALID
>>
>> Today vbus-enet has a rx-thread and a tx-thread at least partially
>> because I need process-context to do the copy_to_user(other->mm) type
>> stuff (and we will need similar for our virtio-net backend as well). I
>> also utilize irqfd to do interrupt injection. Now, since I know that I
>> have kthread_context, I could modify vbus-enet to inject interrupts with
>> EVENTFD_SIGNAL_CANSLEEP set, and this is fine. I know that is safe.
>>
>> However, the problem is above that! I would like to optimize out the
>> tx-thread to begin with with a similar "can_sleep()" pattern whenever
>> possible.
>>
>> For instance, if the netif stack calls me to do a transmit and it
>> happens to be in a sleepable context, it would be nice to just skip
>> waking up the tx-thread. I would therefore just do the
>> copy_to_user(other->mm) + irqfd directly in the netif callback, thereby
>> skipping the context switch.
>>
>
> What do you mean by copy_to_user(other->mm) here? If you are going to switch
> to another mm, then I think current->mm must be valid (I think it's not enough
> that you can sleep). So preemptible() might not be enough.
>

I dont currently use switch_mm, if that is what you mean. I save the
task_struct into the appropriate context so current->mm doesn't matter
to me. I never use it. All I need (afaik) is to acquire the proper
mutex first. I am not an MM expert, so perhaps I have this wrong but it
does appear to work properly even from kthread context.

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-17 14:46:26

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:
> > What do you mean by copy_to_user(other->mm) here? If you are going to switch
> > to another mm, then I think current->mm must be valid (I think it's not enough
> > that you can sleep). So preemptible() might not be enough.
> >
>
> I dont currently use switch_mm, if that is what you mean. I save the
> task_struct into the appropriate context so current->mm doesn't matter
> to me. I never use it. All I need (afaik) is to acquire the proper
> mutex first. I am not an MM expert, so perhaps I have this wrong but it
> does appear to work properly even from kthread context.
>
> -Greg
>
>

I think I saw get_user_pages + memcpy in your patch. Yes, that works
without switching contexts but it's slower than copy to user if you are
in the right context, and AFAIK it's slower than get_user_pages_fast.

--
MST

2009-06-17 15:02:21

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:
>
>>> What do you mean by copy_to_user(other->mm) here? If you are going to switch
>>> to another mm, then I think current->mm must be valid (I think it's not enough
>>> that you can sleep). So preemptible() might not be enough.
>>>
>>>
>> I dont currently use switch_mm, if that is what you mean. I save the
>> task_struct into the appropriate context so current->mm doesn't matter
>> to me. I never use it. All I need (afaik) is to acquire the proper
>> mutex first. I am not an MM expert, so perhaps I have this wrong but it
>> does appear to work properly even from kthread context.
>>
>> -Greg
>>
>>
>>
>
> I think I saw get_user_pages + memcpy in your patch. Yes, that works
> without switching contexts but it's slower than copy to user if you are
> in the right context, and AFAIK it's slower than get_user_pages_fast.
>
>
Yeah, understood. It will definitely be interesting to try that
optimization with switch_mm that you suggested.

On that front, would "if (p == current)" still work even if we don't
have the "signal_task()" hint?


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-17 16:26:23

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Wed, Jun 17, 2009 at 11:02:06AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:
> >
> >>> What do you mean by copy_to_user(other->mm) here? If you are going to switch
> >>> to another mm, then I think current->mm must be valid (I think it's not enough
> >>> that you can sleep). So preemptible() might not be enough.
> >>>
> >>>
> >> I dont currently use switch_mm, if that is what you mean. I save the
> >> task_struct into the appropriate context so current->mm doesn't matter
> >> to me. I never use it. All I need (afaik) is to acquire the proper
> >> mutex first. I am not an MM expert, so perhaps I have this wrong but it
> >> does appear to work properly even from kthread context.
> >>
> >> -Greg
> >>
> >>
> >>
> >
> > I think I saw get_user_pages + memcpy in your patch. Yes, that works
> > without switching contexts but it's slower than copy to user if you are
> > in the right context, and AFAIK it's slower than get_user_pages_fast.
> >
> >
> Yeah, understood. It will definitely be interesting to try that
> optimization with switch_mm that you suggested.

BTW, I'm kind of confused - in your patch you do get_task_struct: does this
guarantee that mm is not going aways?

> On that front, would "if (p == current)" still work even if we don't
> have the "signal_task()" hint?
>

Donnu.

--
MST

2009-06-17 16:41:43

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Michael S. Tsirkin wrote:
> On Wed, Jun 17, 2009 at 11:02:06AM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Tue, Jun 16, 2009 at 02:09:38PM -0400, Gregory Haskins wrote:
>>>
>>>
>>>>> What do you mean by copy_to_user(other->mm) here? If you are going to switch
>>>>> to another mm, then I think current->mm must be valid (I think it's not enough
>>>>> that you can sleep). So preemptible() might not be enough.
>>>>>
>>>>>
>>>>>
>>>> I dont currently use switch_mm, if that is what you mean. I save the
>>>> task_struct into the appropriate context so current->mm doesn't matter
>>>> to me. I never use it. All I need (afaik) is to acquire the proper
>>>> mutex first. I am not an MM expert, so perhaps I have this wrong but it
>>>> does appear to work properly even from kthread context.
>>>>
>>>> -Greg
>>>>
>>>>
>>>>
>>>>
>>> I think I saw get_user_pages + memcpy in your patch. Yes, that works
>>> without switching contexts but it's slower than copy to user if you are
>>> in the right context, and AFAIK it's slower than get_user_pages_fast.
>>>
>>>
>>>
>> Yeah, understood. It will definitely be interesting to try that
>> optimization with switch_mm that you suggested.
>>
>
> BTW, I'm kind of confused - in your patch you do get_task_struct: does this
> guarantee that mm is not going aways?
>

Well, again, I am not an expert on MM, but I would think so. If p holds
a reference to p->mm (which I would think it would have to), and we hold
a reference to p, p->mm should be indirectly stable coincident with the
lifetime of our p reference. OTOH, I have not actually looked at
whether p actually takes a p->mm reference or not via inspection, so
this is just an educated guess at this point. ;)

I guess if this is broken, the solution is simple: Modify the code to
take an mm reference as well.

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-17 16:45:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Tue, 16 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Tue, 16 Jun 2009, Gregory Haskins wrote:
> >
> >
> >> Does this all make sense?
> >>
> >
> > This conversation has been *really* long, and I haven't had time to look
> > at the patch yet. But looking at the amount of changes, and the amount of
> > even more changes talked in this thread, there's a very slim chance that
> > I'll ACK the eventfd code.
> > You may want to consider a solution that does not litter eventfd code that
> > much.
> >
> >
> > - Davide
> >
> >
> >
> Hi Davide,
>
> I understand your position and value your time/insight into looking at
> this things.
>
> Despite the current ongoing discussion, I still stand that the current
> patch is my proposed solution (though I have yet to convince Michael).
> But in any case, if you have the time, please look it over because I
> still think its the right direction to head in.

I don't think so. You basically upload a bunch of stuff it could have been
inside your irqfd into eventfd. Now the eventfd_signal() can magically
sleep, or not, depending on what the signal functions do. This makes up a
pretty aweful interface if you ask me.
A lot simpler and cleaner if eventfd_signal(), like all the wake up
functions inside the kernel, can be called from atomic context. Always,
not sometimes.



- Davide

2009-06-17 17:29:19

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Hi Davide,

Davide Libenzi wrote:
> On Tue, 16 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>> On Tue, 16 Jun 2009, Gregory Haskins wrote:
>>>
>>>
>>>
>>>> Does this all make sense?
>>>>
>>>>
>>> This conversation has been *really* long, and I haven't had time to look
>>> at the patch yet. But looking at the amount of changes, and the amount of
>>> even more changes talked in this thread, there's a very slim chance that
>>> I'll ACK the eventfd code.
>>> You may want to consider a solution that does not litter eventfd code that
>>> much.
>>>
>>>
>>> - Davide
>>>
>>>
>>>
>>>
>> Hi Davide,
>>
>> I understand your position and value your time/insight into looking at
>> this things.
>>
>> Despite the current ongoing discussion, I still stand that the current
>> patch is my proposed solution (though I have yet to convince Michael).
>> But in any case, if you have the time, please look it over because I
>> still think its the right direction to head in.
>>
>
> I don't think so. You basically upload a bunch of stuff it could have been
> inside your irqfd into eventfd.

Can you elaborate? I currently do not see how I could do the proposed
concept inside of irqfd while still using eventfd. Of course, that
would be possible if we fork irqfd from eventfd, and perhaps this is
what you are proposing. As previously stated I don't want to give up on
the prospect of re-using it quite yet, so bear with me. :)

The issue with eventfd, as I see it, is that eventfd uses a
spin_lock_irqsave (by virtue of the wait-queue stuff) across the
"signal" callback (which today is implemented as a wake-up). This
spin_lock implicitly creates a non-preemptible critical section that
occurs independently of whether eventfd_signal() itself is invoked from
a sleepable context or not.

What I strive to achieve is to remove the creation of this internal
critical section. If eventfd_signal() is called from atomic context, so
be it. We will detect this in the callback and be forced to take the
slow-path, and I am ok with that. *But*, if eventfd_signal() (or
f_ops->write(), for that matter) are called from a sleepable context
*and* eventfd doesn't introduce its own critical section (such as with
my srcu patch), we can potentially optimize within the callback by
executing serially instead of deferring (e.g. via a workqueue).

(Note: The issue of changing eventfd_signal interface is a separate
tangent that Michael proposed, and is not something I am advocating. In
the current proposal, eventfd_signal() retains its exact semantics as it
has in mainline).

> Now the eventfd_signal() can magically
> sleep, or not, depending on what the signal functions do. This makes up a
> pretty aweful interface if you ask me.
> A lot simpler and cleaner if eventfd_signal(), like all the wake up
> functions inside the kernel, can be called from atomic context. Always,
> not sometimes.
>

It can! :) This is not changing from whats in mainline today (covered
above).

Thanks Davide,
-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-17 17:50:52

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Wed, 17 Jun 2009, Gregory Haskins wrote:

> Can you elaborate? I currently do not see how I could do the proposed
> concept inside of irqfd while still using eventfd. Of course, that
> would be possible if we fork irqfd from eventfd, and perhaps this is
> what you are proposing. As previously stated I don't want to give up on
> the prospect of re-using it quite yet, so bear with me. :)
>
> The issue with eventfd, as I see it, is that eventfd uses a
> spin_lock_irqsave (by virtue of the wait-queue stuff) across the
> "signal" callback (which today is implemented as a wake-up). This
> spin_lock implicitly creates a non-preemptible critical section that
> occurs independently of whether eventfd_signal() itself is invoked from
> a sleepable context or not.
>
> What I strive to achieve is to remove the creation of this internal
> critical section. If eventfd_signal() is called from atomic context, so
> be it. We will detect this in the callback and be forced to take the
> slow-path, and I am ok with that. *But*, if eventfd_signal() (or
> f_ops->write(), for that matter) are called from a sleepable context
> *and* eventfd doesn't introduce its own critical section (such as with
> my srcu patch), we can potentially optimize within the callback by
> executing serially instead of deferring (e.g. via a workqueue).

Since when the scheduling (assuming it's not permanently running on
another core due to high frequency work post) of a kernel thread is such
a big impact that interfaces need to be redesigned for that?
How much the (possible, but not certain) kernel thread context switch time
weighs in the overall KVM IRQ service time?



> It can! :) This is not changing from whats in mainline today (covered
> above).

It can/could, if the signal() function takes very accurate care of doing
the magic atomic check.



- Davide

2009-06-17 19:18:01

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
> On Wed, 17 Jun 2009, Gregory Haskins wrote:
>
>
>> Can you elaborate? I currently do not see how I could do the proposed
>> concept inside of irqfd while still using eventfd. Of course, that
>> would be possible if we fork irqfd from eventfd, and perhaps this is
>> what you are proposing. As previously stated I don't want to give up on
>> the prospect of re-using it quite yet, so bear with me. :)
>>
>> The issue with eventfd, as I see it, is that eventfd uses a
>> spin_lock_irqsave (by virtue of the wait-queue stuff) across the
>> "signal" callback (which today is implemented as a wake-up). This
>> spin_lock implicitly creates a non-preemptible critical section that
>> occurs independently of whether eventfd_signal() itself is invoked from
>> a sleepable context or not.
>>
>> What I strive to achieve is to remove the creation of this internal
>> critical section. If eventfd_signal() is called from atomic context, so
>> be it. We will detect this in the callback and be forced to take the
>> slow-path, and I am ok with that. *But*, if eventfd_signal() (or
>> f_ops->write(), for that matter) are called from a sleepable context
>> *and* eventfd doesn't introduce its own critical section (such as with
>> my srcu patch), we can potentially optimize within the callback by
>> executing serially instead of deferring (e.g. via a workqueue).
>>
>
> Since when the scheduling (assuming it's not permanently running on
> another core due to high frequency work post) of a kernel thread is such
> a big impact that interfaces need to be redesigned for that?
>

Low-latency applications, for instance, care about this. Even one
context switch can add a substantial overhead.

> How much the (possible, but not certain) kernel thread context switch time
> weighs in the overall KVM IRQ service time?
>

Generally each one is costing me about 7us on average. For something
like high-speed networking, we have a path that has about 30us of
base-line overhead. So one additional ctx-switch puts me at base+7 ( =
~37us), two puts me in base+2*7 (= ~44us). So in that context (no pun
intended ;), it hurts quite a bit. I'll be the first to admit that not
everyone (most?) will care about latency, though. But FWIW, I do.

>
>
>
>> It can! :) This is not changing from whats in mainline today (covered
>> above).
>>
>
> It can/could, if the signal() function takes very accurate care of doing
> the magic atomic check.
>

True, but thats the notifiee's burden, not eventfd's. And its always
going to be opt-in. Even today, someone is free to either try to sleep
(which will oops on the might_sleep()), or try to check if they can
sleep (it will always say they can't due to the eventfd wqh spinlock).
The only thing that changes with my patch is that someone that opts in
to check if they can sleep may find that they sometimes (mostly?) can.

In any case, I suspect that there are no other clients of eventfd other
than standard wait-queue sleepers, and irqfd. The wake_up() code is
never going to care anyway, so this really comes down to future users of
the notification interfaces (irqfd today, iosignalfd in the near-term).
Therefore, there really aren't any legacy issues to deal with that I am
aware of, even if they mattered.

Thanks Davide,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-17 19:56:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Wed, 17 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
>
> > How much the (possible, but not certain) kernel thread context switch time
> > weighs in the overall KVM IRQ service time?
> >
>
> Generally each one is costing me about 7us on average. For something
> like high-speed networking, we have a path that has about 30us of
> base-line overhead. So one additional ctx-switch puts me at base+7 ( =
> ~37us), two puts me in base+2*7 (= ~44us). So in that context (no pun
> intended ;), it hurts quite a bit. I'll be the first to admit that not
> everyone (most?) will care about latency, though. But FWIW, I do.

And how a frame reception is handled in Linux nowadays?



> True, but thats the notifiee's burden, not eventfd's. And its always
> going to be opt-in. Even today, someone is free to either try to sleep
> (which will oops on the might_sleep()), ...

No, today you just can't sleep. As you can't sleep in any
callback-registered wakeups, like epoll, for example.



- Davide

2009-06-17 21:48:26

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
> On Wed, 17 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>
>>> How much the (possible, but not certain) kernel thread context switch time
>>> weighs in the overall KVM IRQ service time?
>>>
>>>
>> Generally each one is costing me about 7us on average. For something
>> like high-speed networking, we have a path that has about 30us of
>> base-line overhead. So one additional ctx-switch puts me at base+7 ( =
>> ~37us), two puts me in base+2*7 (= ~44us). So in that context (no pun
>> intended ;), it hurts quite a bit. I'll be the first to admit that not
>> everyone (most?) will care about latency, though. But FWIW, I do.
>>
>
> And how a frame reception is handled in Linux nowadays?
>

I am not clear on what you are asking here. So in case this was a
sincere question on how things work, here is a highlight of the typical
flow of a packet that ingresses on a physical adapter and routes to KVM
via vbus.

a) interrupt from eth to host wakes the cpu out of idle, enters
interrupt-context.
b) ISR runs, disables interrupts on the eth, schedules softirq-net-rx
c) ISR completes, kernel checks softirq state before IRET, runs pending
softirq-net-rx in interrupt context to NAPI-poll the ethernet
d) packet is pulled out of eth into a layer-2 bridge, and switched to
the appropriate switch-port (which happens to be a venet-tap (soon to be
virtio-net based) device. The packet is queued to the tap as an xmit
request, and the tap's tx-thread is awoken.
e) the xmit call returns, the napi-poll completes, and the
softirq-net-rx terminates. The kernel does an IRET to exit interrupt
context.
f) the scheduler runs and sees the tap's tx-thread is ready to run,
schedules it in.
g) the tx-thread awakens, dequeues the posted skb, copies it to the
virtio-ring, and finally raises an interrupt on irqfd with eventfd_signal().

At this point, all of the data has been posted to the virtio-ring in
shared memory between the host and guest. All that is left is to inject
the interrupt so the guest knows to process the ring. We call the
eventfd_signal() from kthread context. However, the callback to inject
the interrupt is invoked with the wqh spinlock held so we are forced to
defer the interrupt injection to a workqueue so the kvm->lock can be
safely acquired. This adds additional latency (~7us) in a path that is
only a handful of microseconds to being with. I can post LTTV
screenshots, if it would be helpful to visualize this.

The reasons we need the workqueue is:

A) kvm is currently implemented with a mutex for IRQ injection, so its
sleepy
B) the wqh used in eventfd wake_up() creates an non-preemptible section
when the callback is invoked.

We can (and should) address (A), but this is but one example in a common
pattern. We will run into similar issues in other places (such as with
iosignalfd), so I would like to address (B) as well. My patch attempts
to do that by re-implementing the callback mechanism as something other
than wait-queue based. It then adds a wait-queue as a default client of
the new interface.

Therefore, the eventfd clients that want traditional vanilla wakeups can
go right ahead and use the non-preemptible wait-queue methods as they
always have. However, the clients that want (potentially) preemptive
callbacks can use the new interface: eventfd_notifier_register(), armed
with the knowledge that it can only sleep if the eventfd_signal() caller
was not in-atomic at the time. Thus the dynamic state check ala
preemptible(). Without the check, the client should assume it is not
preemptible, as always.
>
>
>
>> True, but thats the notifiee's burden, not eventfd's. And its always
>> going to be opt-in. Even today, someone is free to either try to sleep
>> (which will oops on the might_sleep()), ...
>>
>
> No, today you just can't sleep. As you can't sleep in any
> callback-registered wakeups, like epoll, for example.
>

Right, understood, and note that this is precisely why I said it would
oops. What I was specifically trying to highlight is that its not like
this change imposes new requirements on the existing callbacks. I also
wanted to highlight that its not really eventfd's concern what the
callback tries to do, per se (if it wants to sleep it can try, it just
wont work). Any reasonable wakeup callback in existence would already
assume it cannot sleep, and they wouldn't even try to begin with.

On the other hand, what I am introducing here (specifically to eventfd
callbacks, not wait-queues [*]) is the possibility of removing this
restriction under the proper circumstances. It would only be apparent
of this change iff the callback in question tried to test for this (e.g.
checking preemptible()) state. It is thus opt-in, and existing code
does not break.

Thanks Davide,
-Greg

[*]: I contemplated solving this problem generally at the wait-queue
level, but quickly abandoned the idea. The reason is that something
like *RCU has the properties of being particularly lightweight in the
fast/common/read-path, but being particularly heavyweight in the
slow/uncommon/writer-path. Contrast that with something like a
traditional lock/mutex which is middleweight in all circumstances.

Something like a wait-queue has a heavy dose of both "writers" (tasks
being added to the queue) as well as "readers" (walking the queue to
invoke callbacks). RCU would be particularly painful here. Contrast
this to the way I wrote the patch for eventfd: The "writer" path is
pretty rare ("add me to be notified" generally happens once at eventfd
init), whereas the "reader" path ("tell me when someone signals") is
fairly common. RCU is a good candidate here to get rid of the spinlock
held during the walk, while still maintaining thread-saftey w.r.t. the
notifier list. Therefore I think updating eventfd makes more sense than
the more general case of wait-queues. More importantly, I do not think
this compromises the integrity of eventfd. Its applicable for a general
signaling idiom, IMHO, and its why I am comfortable pushing the patch
out to you.





Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-17 23:28:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Wed, 17 Jun 2009, Gregory Haskins wrote:

> I am not clear on what you are asking here. So in case this was a
> sincere question on how things work, here is a highlight of the typical
> flow of a packet that ingresses on a physical adapter and routes to KVM
> via vbus.
>
> a) interrupt from eth to host wakes the cpu out of idle, enters
> interrupt-context.
> b) ISR runs, disables interrupts on the eth, schedules softirq-net-rx
> c) ISR completes, kernel checks softirq state before IRET, runs pending
> softirq-net-rx in interrupt context to NAPI-poll the ethernet
> d) packet is pulled out of eth into a layer-2 bridge, and switched to
> the appropriate switch-port (which happens to be a venet-tap (soon to be
> virtio-net based) device. The packet is queued to the tap as an xmit
> request, and the tap's tx-thread is awoken.
> e) the xmit call returns, the napi-poll completes, and the
> softirq-net-rx terminates. The kernel does an IRET to exit interrupt
> context.
> f) the scheduler runs and sees the tap's tx-thread is ready to run,
> schedules it in.
> g) the tx-thread awakens, dequeues the posted skb, copies it to the
> virtio-ring, and finally raises an interrupt on irqfd with eventfd_signal().

Heh, Gregory, this isn't a job interview. You didn't have to actually
detail everything ;) Glad you did though, so we've something to talk
later.



> At this point, all of the data has been posted to the virtio-ring in
> shared memory between the host and guest. All that is left is to inject
> the interrupt so the guest knows to process the ring. We call the
> eventfd_signal() from kthread context. However, the callback to inject
> the interrupt is invoked with the wqh spinlock held so we are forced to
> defer the interrupt injection to a workqueue so the kvm->lock can be
> safely acquired. This adds additional latency (~7us) in a path that is
> only a handful of microseconds to being with. I can post LTTV
> screenshots, if it would be helpful to visualize this.

So, what you're trying to say, is that the extra wakeup required by your
schedule_work() processing, makes actually a difference in the time it
takes to go from a) to g), where there are at least two other kernel
thread wakeups?
Don't think in terms of microbenchs, think in how much are those 7us (are
they? really? this is a sync, on-cpu, kernel thread, wake up) are
impacting the whole path.
Everything looks worthwhile in microbenches.




> Right, understood, and note that this is precisely why I said it would
> oops. What I was specifically trying to highlight is that its not like
> this change imposes new requirements on the existing callbacks. I also
> wanted to highlight that its not really eventfd's concern what the
> callback tries to do, per se (if it wants to sleep it can try, it just
> wont work). Any reasonable wakeup callback in existence would already
> assume it cannot sleep, and they wouldn't even try to begin with.
>
> On the other hand, what I am introducing here (specifically to eventfd
> callbacks, not wait-queues [*]) is the possibility of removing this
> restriction under the proper circumstances. It would only be apparent
> of this change iff the callback in question tried to test for this (e.g.
> checking preemptible()) state. It is thus opt-in, and existing code
> does not break.

The interface is just ugly IMO. You have eventfd_signal() that can sleep,
or not, depending on the registered ->signal() function implementations.
This is pretty bad, a lot worse than the theoretical us spent in the
schedule_work() processing.



- Davide

2009-06-18 06:23:40

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Wed, Jun 17, 2009 at 04:21:19PM -0700, Davide Libenzi wrote:
> The interface is just ugly IMO. You have eventfd_signal() that can sleep,
> or not, depending on the registered ->signal() function implementations.
> This is pretty bad, a lot worse than the theoretical us spent in the
> schedule_work() processing.

I agree. How about the idea of introducing eventfd_signal_from_task
which can sleep? Does this sound same?

--
MST

2009-06-18 09:03:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On 06/16/2009 05:40 PM, Gregory Haskins wrote:
>
> Something else to consider: For iosignalfd, we can assume we will
> always be called from vcpu process context so we might not really need
> official affirmation from the system. For irqfd, we cannot predict who
> may be injecting the interrupt (for instance, it might be a
> PCI-passthrough hard-irq context). I am not sure if this knowledge
> actually helps to simplify the problem space, but I thought I should
> mention it.
>

We can't assume anything. Userspace might hand out that eventfd to anyone.

--
error compiling committee.c: too many arguments to function

2009-06-18 11:44:19

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Avi Kivity wrote:
> On 06/16/2009 05:40 PM, Gregory Haskins wrote:
>>
>> Something else to consider: For iosignalfd, we can assume we will
>> always be called from vcpu process context so we might not really need
>> official affirmation from the system. For irqfd, we cannot predict who
>> may be injecting the interrupt (for instance, it might be a
>> PCI-passthrough hard-irq context). I am not sure if this knowledge
>> actually helps to simplify the problem space, but I thought I should
>> mention it.
>>
>
> We can't assume anything. Userspace might hand out that eventfd to
> anyone.

Yeah agreed, and I misspoke (mistyped? ;). It's not that I assume its
from a vcpu. Its that it doesn't matter (at least for vbus).

The reason is that the memory context is established in vbus via a
different channel (i.e. I never look at "current" in the context of the
eventfd callback). The eventfd is just telling me that the previously
established memory context needs to be looked at (e.g. the virtio-ring
changed). Therefore, the origin of the signal doesn't actually matter
(at least w.r.t. safe handling of the request).

Obviously, the question of whether something has really changed in the
memory and whether an errant signal could cause problems is a separate
issue. But we need to be robust against that anyway, for a multitude of
other reasons (e.g. malicious/broken guest scribbling over the
virtio-ring, etc).

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-18 14:01:28

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
> On Wed, 17 Jun 2009, Gregory Haskins wrote:
>
>
>> I am not clear on what you are asking here. So in case this was a
>> sincere question on how things work, here is a highlight of the typical
>> flow of a packet that ingresses on a physical adapter and routes to KVM
>> via vbus.
>>
>> a) interrupt from eth to host wakes the cpu out of idle, enters
>> interrupt-context.
>> b) ISR runs, disables interrupts on the eth, schedules softirq-net-rx
>> c) ISR completes, kernel checks softirq state before IRET, runs pending
>> softirq-net-rx in interrupt context to NAPI-poll the ethernet
>> d) packet is pulled out of eth into a layer-2 bridge, and switched to
>> the appropriate switch-port (which happens to be a venet-tap (soon to be
>> virtio-net based) device. The packet is queued to the tap as an xmit
>> request, and the tap's tx-thread is awoken.
>> e) the xmit call returns, the napi-poll completes, and the
>> softirq-net-rx terminates. The kernel does an IRET to exit interrupt
>> context.
>> f) the scheduler runs and sees the tap's tx-thread is ready to run,
>> schedules it in.
>> g) the tx-thread awakens, dequeues the posted skb, copies it to the
>> virtio-ring, and finally raises an interrupt on irqfd with eventfd_signal().
>>
>
> Heh, Gregory, this isn't a job interview. You didn't have to actually
> detail everything ;) Glad you did though, so we've something to talk
> later.
>

:)

>
>
>
>> At this point, all of the data has been posted to the virtio-ring in
>> shared memory between the host and guest. All that is left is to inject
>> the interrupt so the guest knows to process the ring. We call the
>> eventfd_signal() from kthread context. However, the callback to inject
>> the interrupt is invoked with the wqh spinlock held so we are forced to
>> defer the interrupt injection to a workqueue so the kvm->lock can be
>> safely acquired. This adds additional latency (~7us) in a path that is
>> only a handful of microseconds to being with. I can post LTTV
>> screenshots, if it would be helpful to visualize this.
>>
>
> So, what you're trying to say, is that the extra wakeup required by your
> schedule_work() processing, makes actually a difference in the time it
> takes to go from a) to g),

Yes

> where there are at least two other kernel
> thread wakeups?
>
Actually there is only one (the tx-thread) aside from the eventfd
imposed workqueue one. Incidentally, I would love to get rid of the
other thread too, so I am not just picking on eventfd ;). The other is
a lot harder since it has to update the virtio-ring and may need to page
in guest memory to do so.


> Don't think in terms of microbenchs,

I'm not. This goes directly to end-application visible performance.
These types of bottlenecks directly affect the IOPS rate in quantifiable
ways of the subsystem in question (and once I re-stablize the vbus tree
against the latest kvm/*fd code, I will post some numbers). This is
particularly true for request-response type operations where stack
latency is the key gating factor. Consider things like high-performance
shared-nothing clusters to a ramdisk (yes, these exist and have
relevance). The overhead of the subsystem can be very low, and is
usually gated by the transaction throughput, which is gated by the IOPS
rate of the medium. A 7us bump when we are only talking about dozens of
microseconds overall is a huge percentage. Other examples might be
high-resolution clock sync (ala ptpd) or FSI trading applications.

The ultimate goal here is to get something like a KVM guest on par with
baremetal to allow the convergence of the HPC space with the
data-center/cloud trend of utilizing VMs as the compute fabric.
Baremetal doesn't have a similar context switch in its stack, and these
little microsecond bumps here and there are the reason why we are not
quite at baremetal levels today with KVM. Therefore, I am working
through trying to eliminate them.

To flip it around on you: try telling a group like the netdev guys that
they should put extra context switches into the stack because they don't
really matter. Be sure to wear extra thick asbestos. ;)

> think in how much are those 7us (are
> they? really? this is a sync, on-cpu, kernel thread, wake up) are
> impacting the whole path.
> Everything looks worthwhile in microbenches.
>
>
>
>
>
>> Right, understood, and note that this is precisely why I said it would
>> oops. What I was specifically trying to highlight is that its not like
>> this change imposes new requirements on the existing callbacks. I also
>> wanted to highlight that its not really eventfd's concern what the
>> callback tries to do, per se (if it wants to sleep it can try, it just
>> wont work). Any reasonable wakeup callback in existence would already
>> assume it cannot sleep, and they wouldn't even try to begin with.
>>
>> On the other hand, what I am introducing here (specifically to eventfd
>> callbacks, not wait-queues [*]) is the possibility of removing this
>> restriction under the proper circumstances. It would only be apparent
>> of this change iff the callback in question tried to test for this (e.g.
>> checking preemptible()) state. It is thus opt-in, and existing code
>> does not break.
>>
>
> The interface is just ugly IMO.

Well, everyone is of course entitled to an opinion, but with all due
respect I think you are looking at this backwards. ;)

> You have eventfd_signal() that can sleep,
> or not, depending on the registered ->signal() function implementations.
> This is pretty bad, a lot worse than the theoretical us spent in the
> schedule_work() processing.
>

I wouldn't describe it that way. Whether eventfd_signal() sleeps or not
even under your control as it is. Really what we have is an interface
(eventfd_signal()) that must support being invoked from atomic-context.
As an interface designer, you should never be saying "can this sleep",
but rather "what contexts is it legal to call this interface". You
have already spec'd that eventfd_signal() is atomic-safe, and I am not
proposing to change that. It is, and always will be (unless we screw
up) atomic safe.

Consider kmalloc(GFP_KERNEL), and kmalloc(GFP_ATOMIC). The former says
you must call this from process context, and the latter says you may
call it from atomic context. If you think about it, this is technically
orthogonal to whether it can sleep or not, even though people have
become accustomed to associating atomic == cant-sleep, because today
that is true (at least in mainline). As a counter example, in -rt,
atomic context *is* process context, and kmalloc(GFP_ATOMIC) can in fact
sleep. But all the code still works unmodified because -rt is still
ensuring that the interface contract is met: it works in atomic-context
(its just that atomic-context is redefined ;)

So, all that aside: I restate that eventfd should *not* care about what
its clients do, as long as they meet the requirements of the interface.
In this case, the requirement is that they need to work from atomic
context (and with my proposal, they still will). Its really a question
of should eventfd artificially create an atomic context in some kind of
attempt to enforce that? The answer, IMO, is that it shouldn't if it
doesn't have to, and there are already community accepted patterns (e.g.
RCU) for accomplishing that.

One could also make a secondary argument that holding a spinlock across
a critical section of arbitrary complexity is probably not ideal. Its
fine for quick wake_ups as you originally designed eventfd for.
However, now that we are exploring the generalization of the callback
interface beyond simple wake-ups, this should probably be re-evaluated
independent of my current requirements for low-latency.

The fact is that eventfd is a really neat general signaling idiom.
However, its currently geared towards "signaling = wakeup". As we have
proven with this KVM *fd effort, this is not necessarily accurate to
describe all use cases, nor is it optimal. I'd like to address that.
An alternative, of course, is that we make a private anon-fd solution
within KVM. However, it will be so similar to eventfd so it just seems
wasteful if it can be avoided.

Kind Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-18 17:50:29

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Thu, 18 Jun 2009, Gregory Haskins wrote:

> Actually there is only one (the tx-thread) aside from the eventfd
> imposed workqueue one. Incidentally, I would love to get rid of the
> other thread too, so I am not just picking on eventfd ;). The other is
> a lot harder since it has to update the virtio-ring and may need to page
> in guest memory to do so.

No, there is the interface rx softirq too, that makes two. Plus, the
process of delivering (especially for KVM & Co.) does not involve only ctx
switching, there's other stuff in the middle too.
You also talk about latency. Really? Probably RT people aren't looking
into KVM if RT is even a mild requirement.



> To flip it around on you: try telling a group like the netdev guys that
> they should put extra context switches into the stack because they don't
> really matter. Be sure to wear extra thick asbestos. ;)

They already do. So you've got to ask yourself why they can achieve Gbps
throughput already, why can't KVM live with it and being compelled to
litter an existing interface.



> The fact is that eventfd is a really neat general signaling idiom.
> However, its currently geared towards "signaling = wakeup". As we have
> proven with this KVM *fd effort, this is not necessarily accurate to
> describe all use cases, nor is it optimal. I'd like to address that.
> An alternative, of course, is that we make a private anon-fd solution
> within KVM. However, it will be so similar to eventfd so it just seems
> wasteful if it can be avoided.

Even though you take that route, you'll have to prove with replicable real
life benchmarks, that the bloating make sense.



- Davide

2009-06-18 17:58:46

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Thu, 18 Jun 2009, Michael S. Tsirkin wrote:

> On Wed, Jun 17, 2009 at 04:21:19PM -0700, Davide Libenzi wrote:
> > The interface is just ugly IMO. You have eventfd_signal() that can sleep,
> > or not, depending on the registered ->signal() function implementations.
> > This is pretty bad, a lot worse than the theoretical us spent in the
> > schedule_work() processing.
>
> I agree. How about the idea of introducing eventfd_signal_from_task
> which can sleep? Does this sound same?

You're basically asking to double the size of eventfd, make the signal
path heavier, make the eventf size bigger, w/out having provided any *real
life* measurement whatsoever to build a case for it.
WAY too much stuff went in by just branding the latest coolest names as
reasons for them.
And all this to remove the wakeup of a *kernel* thread going to run in the
same CPU where the work has been scheduled.
Come back with *replicable* real life benchmarks, and then we'll see what
the best approach to address it will be.



- Davide

2009-06-18 19:04:43

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
> On Thu, 18 Jun 2009, Gregory Haskins wrote:
>
>
>> Actually there is only one (the tx-thread) aside from the eventfd
>> imposed workqueue one. Incidentally, I would love to get rid of the
>> other thread too, so I am not just picking on eventfd ;). The other is
>> a lot harder since it has to update the virtio-ring and may need to page
>> in guest memory to do so.
>>
>
> No, there is the interface rx softirq too, that makes two.

Actually, I believe you are mistaken. It normally executes the softirq
in interrupt context, not a thread.

But I digress. Lets just shelve the SRCU conversation for another day.
It was my bad for introducing it now prematurely to solve a mostly
unrelated problem: the module-reference thing. I didn't realize the
SRCU change would be so controversial, and I didn't think to split
things apart as I have done today.

But the fact is: I do not see any way to actually use your referenceless
POLLHUP release code in a race free way without doing something like I
propose in 3/4, 4/4. Lets keep the discussion focused on that for now,
if we could.

Later, after we get this thing all built, I will re-run the numbers and
post some results so that Davide may have better proof that the context
switch overhead isn't just lost in the noise. I think that is all he is
asking for anyway, which is understandable.

Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-18 22:09:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

On Thu, 18 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Thu, 18 Jun 2009, Gregory Haskins wrote:
> >
> >
> >> Actually there is only one (the tx-thread) aside from the eventfd
> >> imposed workqueue one. Incidentally, I would love to get rid of the
> >> other thread too, so I am not just picking on eventfd ;). The other is
> >> a lot harder since it has to update the virtio-ring and may need to page
> >> in guest memory to do so.
> >>
> >
> > No, there is the interface rx softirq too, that makes two.
>
> Actually, I believe you are mistaken. It normally executes the softirq
> in interrupt context, not a thread.
>
> But I digress. Lets just shelve the SRCU conversation for another day.
> It was my bad for introducing it now prematurely to solve a mostly
> unrelated problem: the module-reference thing. I didn't realize the
> SRCU change would be so controversial, and I didn't think to split
> things apart as I have done today.
>
> But the fact is: I do not see any way to actually use your referenceless
> POLLHUP release code in a race free way without doing something like I
> propose in 3/4, 4/4. Lets keep the discussion focused on that for now,
> if we could.

OK, since I got literally swamped by the amount of talks and patches over
this theoretically simple topic, would you mind 1) posting the global
patch over eventfd 2) describe exactly what races are you talking about
3) explain why this should be any concern of eventfd at all?



- Davide

2009-06-18 22:47:41

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
> On Thu, 18 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>> On Thu, 18 Jun 2009, Gregory Haskins wrote:
>>>
>>>
>>>
>>>> Actually there is only one (the tx-thread) aside from the eventfd
>>>> imposed workqueue one. Incidentally, I would love to get rid of the
>>>> other thread too, so I am not just picking on eventfd ;). The other is
>>>> a lot harder since it has to update the virtio-ring and may need to page
>>>> in guest memory to do so.
>>>>
>>>>
>>> No, there is the interface rx softirq too, that makes two.
>>>
>> Actually, I believe you are mistaken. It normally executes the softirq
>> in interrupt context, not a thread.
>>
>> But I digress. Lets just shelve the SRCU conversation for another day.
>> It was my bad for introducing it now prematurely to solve a mostly
>> unrelated problem: the module-reference thing. I didn't realize the
>> SRCU change would be so controversial, and I didn't think to split
>> things apart as I have done today.
>>
>> But the fact is: I do not see any way to actually use your referenceless
>> POLLHUP release code in a race free way without doing something like I
>> propose in 3/4, 4/4. Lets keep the discussion focused on that for now,
>> if we could.
>>
>
> OK, since I got literally swamped by the amount of talks and patches over
> this theoretically simple topic, would you mind 1) posting the global
> patch over eventfd 2) describe exactly what races are you talking about
> 3) explain why this should be any concern of eventfd at all?
>
>
>
Yes, I can do that. Thanks Davide,

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-19 18:51:46

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

Davide Libenzi wrote:
> On Thu, 18 Jun 2009, Gregory Haskins wrote:
>
>>
>> But the fact is: I do not see any way to actually use your referenceless
>> POLLHUP release code in a race free way without doing something like I
>> propose in 3/4, 4/4. Lets keep the discussion focused on that for now,
>> if we could.
>
> OK, since I got literally swamped by the amount of talks and patches over
> this theoretically simple topic, would you mind 1) posting the global
> patch over eventfd

Done. Should show up as replies to this email. These apply to v2.6.30 and
are devoid of any KVM-isms. :)

(Note-1: I've built and run these patches against additional kvm/irqfd
patches and verified they all work properly)

(Note-2: I included your POLLHUP patch as 1/3 since it currently only exists
in kvm.git, and is a pre-req for the other two)

> 2) describe exactly what races are you talking about


Covered in the patch headers (3/3 probably has the most detail)


> 3) explain why this should be any concern of eventfd at all?
>

To support any in-kernel eventfd client that wants to get both "signal"
and "release" notifications (based on using your POLLHUP patch) in a
race-free way. Without 2/3, 3/3, I cannot see how this can be done. Yet
your 1/3 patch is an important enhancement (to at least one client). I
suspect any other in-kernel users will find this a good feature as well.

Thanks Davide,
-Greg

2009-06-19 18:52:00

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 1/3] eventfd: Allow waiters to be notified about the eventfd file* going away

From: Davide Libenzi <[email protected]>

And give them a change to unregister from the wait queue.

This is turn allows eventfd users to use the eventfd file* w/out holding a
live reference to it.

After the eventfd user callbacks returns, any usage of the eventfd file*
should be dropped. The eventfd user callback can acquire sleepy locks
since it is invoked lockless.

This is a feature, needed by KVM to avoid an awkward workaround when using
eventfd.

Signed-off-by: Davide Libenzi <[email protected]>
Tested-by: Gregory Haskins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Gregory Haskins <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
---

fs/eventfd.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 2a701d5..c71f51d 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -59,7 +59,15 @@ int eventfd_signal(struct file *file, int n)

static int eventfd_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct eventfd_ctx *ctx = file->private_data;
+
+ /*
+ * No need to hold the lock here, since we are on the file cleanup
+ * path and the ones still attached to the wait queue will be
+ * serialized by wake_up_locked_poll().
+ */
+ wake_up_locked_poll(&ctx->wqh, POLLHUP);
+ kfree(ctx);
return 0;
}

2009-06-19 18:52:21

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 2/3] eventfd: add generalized notifier interface

Users that want to register for signal notifications with eventfd have
several choices today: They can do a standard sleep+wakeup against a
->read(), or they can provide their own wakeup handling using the wait-queue
callback mechanism coupled with the the eventfd->poll() interface.

In fact, Davide recently published a patch that allows eventfd to transmit
a "release" event when the underlying eventfd is closed via a POLLHUP
wakeup. This type of event is extremely useful for in-kernel notification
clients. However the wait-queue based notification interface alone is not
sufficient to use this new information race-free since it requires
operating lockless and referenceless. We need to track some additional
data that is independent of the file* pointer, since we need
f_ops->release() to still function.

Therefore, this patch lays the groundwork to try and fix these issues. It
accomplishes this by abstracting eventfd's wait-queue based notification
interface behind eventfd specific register()/unregister() verbs. It also
provides an eventfd specific object (eventfd_notifier) that is intended to
be embedded in the client, but used by eventfd to track proper state.

We will use this interface later in the series to fix the current races.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Davide Libenzi <[email protected]>
CC: Michael S. Tsirkin <[email protected]>
---

fs/eventfd.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/eventfd.h | 33 ++++++++++++++++++++++++
2 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index c71f51d..3d7fb16 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -242,3 +242,67 @@ SYSCALL_DEFINE1(eventfd, unsigned int, count)
return sys_eventfd2(count, 0);
}

+static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode,
+ int sync, void *key)
+{
+ struct eventfd_notifier *en;
+ unsigned long flags = (unsigned long)key;
+
+ en = container_of(wait, struct eventfd_notifier, wait);
+
+ if (flags & POLLIN)
+ /*
+ * The POLLIN wake_up is called with interrupts disabled.
+ */
+ en->ops->signal(en);
+
+ if (flags & POLLHUP) {
+ /*
+ * The POLLHUP is called unlocked, so it theoretically should
+ * be safe to remove ourselves from the wqh using the locked
+ * variant of remove_wait_queue()
+ */
+ remove_wait_queue(en->wqh, &en->wait);
+ en->ops->release(en);
+ }
+
+ return 0;
+}
+
+static void eventfd_notifier_ptable_enqueue(struct file *file,
+ wait_queue_head_t *wqh,
+ poll_table *pt)
+{
+ struct eventfd_notifier *en;
+
+ en = container_of(pt, struct eventfd_notifier, pt);
+
+ en->wqh = wqh;
+ add_wait_queue(wqh, &en->wait);
+}
+
+int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en)
+{
+ unsigned int events;
+
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+
+ /*
+ * Install our own custom wake-up handling so we are notified via
+ * a callback whenever someone signals the underlying eventfd
+ */
+ init_waitqueue_func_entry(&en->wait, eventfd_notifier_wakeup);
+ init_poll_funcptr(&en->pt, eventfd_notifier_ptable_enqueue);
+
+ events = file->f_op->poll(file, &en->pt);
+
+ return (events & POLLIN) ? 1 : 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_notifier_register);
+
+void eventfd_notifier_unregister(struct eventfd_notifier *en)
+{
+ remove_wait_queue(en->wqh, &en->wait);
+}
+EXPORT_SYMBOL_GPL(eventfd_notifier_unregister);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..cb23969 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,6 +8,32 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/list.h>
+
+struct eventfd_notifier;
+
+struct eventfd_notifier_ops {
+ void (*signal)(struct eventfd_notifier *en);
+ void (*release)(struct eventfd_notifier *en);
+};
+
+struct eventfd_notifier {
+ poll_table pt;
+ wait_queue_head_t *wqh;
+ wait_queue_t wait;
+ const struct eventfd_notifier_ops *ops;
+};
+
+static inline void eventfd_notifier_init(struct eventfd_notifier *en,
+ const struct eventfd_notifier_ops *ops)
+{
+ memset(en, 0, sizeof(*en));
+ en->ops = ops;
+}
+
#ifdef CONFIG_EVENTFD

/* For O_CLOEXEC and O_NONBLOCK */
@@ -29,12 +55,19 @@

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, int n);
+int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en);
+void eventfd_notifier_unregister(struct eventfd_notifier *en);

#else /* CONFIG_EVENTFD */

#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }
+static inline int eventfd_notifier_register(struct file *file,
+ struct eventfd_notifier *en)
+{ return -ENOSYS; }
+static inline int eventfd_notifier_unregister(struct eventfd_notifier *en)
+{ return -ENOSYS; }

#endif /* CONFIG_EVENTFD */

2009-06-19 18:52:32

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
notifier->release() callback. This lets notification clients know if
the eventfd is about to go away and is very useful particularly for
in-kernel clients. However, as it stands today it is not possible to
use the notification API in a race-free way. This patch adds some
additional logic to the notification subsystem to rectify this problem.

Background:
-----------------------
Eventfd currently only has one reference count mechanism: fget/fput. This
in of itself is normally fine. However, if a client expects to be
notified if the eventfd is closed, it cannot hold a fget() reference
itself or the underlying f_ops->release() callback will never be invoked
by VFS. Therefore we have this somewhat unusual situation where we may
hold a pointer to an eventfd object (by virtue of having a waiter registered
in its wait-queue), but no reference. This makes it nearly impossible to
design a mutual decoupling algorithm: you cannot unhook one side from the
other (or vice versa) without racing.

The first problem was dealt with by essentially "donating" a surrogate
object to eventfd. In other words, when a client attached to eventfd
and then later detached, it would decouple internally in a race free way
and then leave part of the object still attached to the eventfd. This
decoupled object would correctly detect the end-of-life of the eventfd
object at some point in the future and be deallocated: However, we cannot
guarantee that this operation would not race with a potential rmmod of the
client, and is therefore broken.

Solution Details:
-----------------------

1) We add a private kref to the internal eventfd_ctx object. This
reference can be (transparently) held by notification clients without
affecting the ability for VFS to indicate ->release() notification.

2) We convert the current lockless POLLHUP to a more traditional locked
variant (*) so that we can ensure a race free mutual-decouple
algorithm without requiring an surrogate object.

3) We guard the decouple algorithm with an atomic bit-clear to ensure
mutual exclusion of the decoupling and reference-drop.

4) We hold a reference to the underlying eventfd_ctx until all paths
have satisfactorily completed to ensure we do not race with eventfd
going away.

Between these points, we believe we now have a race-free release
mechanism.

[*] Clients that previously assumed the ->release() could sleep will
need to be refactored.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Davide Libenzi <[email protected]>
CC: Michael S. Tsirkin <[email protected]>
---

fs/eventfd.c | 62 +++++++++++++++++++++++++++++++++--------------
include/linux/eventfd.h | 3 ++
2 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 3d7fb16..934efee 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -16,8 +16,10 @@
#include <linux/anon_inodes.h>
#include <linux/eventfd.h>
#include <linux/syscalls.h>
+#include <linux/kref.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
@@ -57,17 +59,24 @@ int eventfd_signal(struct file *file, int n)
return n;
}

+static void _eventfd_release(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
+static void _eventfd_put(struct kref *kref)
+{
+ kref_put(kref, &_eventfd_release);
+}
+
static int eventfd_release(struct inode *inode, struct file *file)
{
struct eventfd_ctx *ctx = file->private_data;

- /*
- * No need to hold the lock here, since we are on the file cleanup
- * path and the ones still attached to the wait queue will be
- * serialized by wake_up_locked_poll().
- */
- wake_up_locked_poll(&ctx->wqh, POLLHUP);
- kfree(ctx);
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ _eventfd_put(&ctx->kref);
return 0;
}

@@ -222,6 +231,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
if (!ctx)
return -ENOMEM;

+ kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
@@ -242,6 +252,15 @@ SYSCALL_DEFINE1(eventfd, unsigned int, count)
return sys_eventfd2(count, 0);
}

+enum {
+ eventfd_notifier_flag_active,
+};
+
+static int test_and_clear_active(struct eventfd_notifier *en)
+{
+ return test_and_clear_bit(eventfd_notifier_flag_active, &en->flags);
+}
+
static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode,
int sync, void *key)
{
@@ -251,19 +270,15 @@ static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode,
en = container_of(wait, struct eventfd_notifier, wait);

if (flags & POLLIN)
- /*
- * The POLLIN wake_up is called with interrupts disabled.
- */
en->ops->signal(en);

if (flags & POLLHUP) {
- /*
- * The POLLHUP is called unlocked, so it theoretically should
- * be safe to remove ourselves from the wqh using the locked
- * variant of remove_wait_queue()
- */
- remove_wait_queue(en->wqh, &en->wait);
- en->ops->release(en);
+
+ if (test_and_clear_active(en)) {
+ __remove_wait_queue(en->wqh, &en->wait);
+ _eventfd_put(en->eventfd);
+ en->ops->release(en);
+ }
}

return 0;
@@ -283,11 +298,14 @@ static void eventfd_notifier_ptable_enqueue(struct file *file,

int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en)
{
+ struct eventfd_ctx *ctx;
unsigned int events;

if (file->f_op != &eventfd_fops)
return -EINVAL;

+ ctx = file->private_data;
+
/*
* Install our own custom wake-up handling so we are notified via
* a callback whenever someone signals the underlying eventfd
@@ -297,12 +315,20 @@ int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en)

events = file->f_op->poll(file, &en->pt);

+ kref_get(&ctx->kref);
+ en->eventfd = &ctx->kref;
+
+ set_bit(eventfd_notifier_flag_active, &en->flags);
+
return (events & POLLIN) ? 1 : 0;
}
EXPORT_SYMBOL_GPL(eventfd_notifier_register);

void eventfd_notifier_unregister(struct eventfd_notifier *en)
{
- remove_wait_queue(en->wqh, &en->wait);
+ if (test_and_clear_active(en)) {
+ remove_wait_queue(en->wqh, &en->wait);
+ _eventfd_put(en->eventfd);
+ }
}
EXPORT_SYMBOL_GPL(eventfd_notifier_unregister);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index cb23969..2b6e239 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -12,6 +12,7 @@
#include <linux/poll.h>
#include <linux/file.h>
#include <linux/list.h>
+#include <linux/kref.h>

struct eventfd_notifier;

@@ -24,6 +25,8 @@ struct eventfd_notifier {
poll_table pt;
wait_queue_head_t *wqh;
wait_queue_t wait;
+ struct kref *eventfd;
+ unsigned long flags;
const struct eventfd_notifier_ops *ops;
};

2009-06-19 19:17:11

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Fri, 19 Jun 2009, Gregory Haskins wrote:

> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> notifier->release() callback. This lets notification clients know if
> the eventfd is about to go away and is very useful particularly for
> in-kernel clients. However, as it stands today it is not possible to
> use the notification API in a race-free way. This patch adds some
> additional logic to the notification subsystem to rectify this problem.
>
> Background:
> -----------------------
> Eventfd currently only has one reference count mechanism: fget/fput. This
> in of itself is normally fine. However, if a client expects to be
> notified if the eventfd is closed, it cannot hold a fget() reference
> itself or the underlying f_ops->release() callback will never be invoked
> by VFS. Therefore we have this somewhat unusual situation where we may
> hold a pointer to an eventfd object (by virtue of having a waiter registered
> in its wait-queue), but no reference. This makes it nearly impossible to
> design a mutual decoupling algorithm: you cannot unhook one side from the
> other (or vice versa) without racing.

And why is that?

struct xxx {
struct mutex mtx;
struct file *file;
...
};

struct file *xxx_get_file(struct xxx *x) {
struct file *file;

mutex_lock(&x->mtx);
file = x->file;
if (!file)
mutex_unlock(&x->mtx);
return file;
}

void xxx_release_file(struct xxx *x) {
mutex_unlock(&x->mtx);
}

void handle_POLLHUP(struct xxx *x) {
struct file *file;

file = xxx_get_file(x);
if (file) {
unhook_waitqueue(file, ...);
x->file = NULL;
xxx_release_file(x);
}
}


Every time you need to "use" file, you call xxx_get_file(), and if you get
NULL, it means it's gone and you handle it accordigly to your IRQ fd
policies. As soon as you done with the file, you call xxx_release_file().
Replace "mtx" with the lock that fits your needs.



- Davide

2009-06-19 21:17:19

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>
>
>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>> notifier->release() callback. This lets notification clients know if
>> the eventfd is about to go away and is very useful particularly for
>> in-kernel clients. However, as it stands today it is not possible to
>> use the notification API in a race-free way. This patch adds some
>> additional logic to the notification subsystem to rectify this problem.
>>
>> Background:
>> -----------------------
>> Eventfd currently only has one reference count mechanism: fget/fput. This
>> in of itself is normally fine. However, if a client expects to be
>> notified if the eventfd is closed, it cannot hold a fget() reference
>> itself or the underlying f_ops->release() callback will never be invoked
>> by VFS. Therefore we have this somewhat unusual situation where we may
>> hold a pointer to an eventfd object (by virtue of having a waiter registered
>> in its wait-queue), but no reference. This makes it nearly impossible to
>> design a mutual decoupling algorithm: you cannot unhook one side from the
>> other (or vice versa) without racing.
>>
>
> And why is that?
>
> struct xxx {
> struct mutex mtx;
> struct file *file;
> ...
> };
>
> struct file *xxx_get_file(struct xxx *x) {
> struct file *file;
>
> mutex_lock(&x->mtx);
> file = x->file;
> if (!file)
> mutex_unlock(&x->mtx);
> return file;
> }
>
> void xxx_release_file(struct xxx *x) {
> mutex_unlock(&x->mtx);
> }
>
> void handle_POLLHUP(struct xxx *x) {
> struct file *file;
>
> file = xxx_get_file(x);
> if (file) {
> unhook_waitqueue(file, ...);
> x->file = NULL;
> xxx_release_file(x);
> }
> }
>
>
> Every time you need to "use" file, you call xxx_get_file(), and if you get
> NULL, it means it's gone and you handle it accordigly to your IRQ fd
> policies. As soon as you done with the file, you call xxx_release_file().
> Replace "mtx" with the lock that fits your needs.
>

Consider what would happen if the f_ops->release() was preempted inside
the wake_up_locked_polled() after it dereferenced the xxx from the list,
but before it calls the callback(POLLHUP). The xxx object, and/or the
.text for the xxx object may be long gone by the time it comes back
around. Afaict, there is no way to guard against that scenario unless
you do something like 2/3+3/3. Or am I missing something?

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-19 21:33:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Fri, 19 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Fri, 19 Jun 2009, Gregory Haskins wrote:
> >
> >
> >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> >> notifier->release() callback. This lets notification clients know if
> >> the eventfd is about to go away and is very useful particularly for
> >> in-kernel clients. However, as it stands today it is not possible to
> >> use the notification API in a race-free way. This patch adds some
> >> additional logic to the notification subsystem to rectify this problem.
> >>
> >> Background:
> >> -----------------------
> >> Eventfd currently only has one reference count mechanism: fget/fput. This
> >> in of itself is normally fine. However, if a client expects to be
> >> notified if the eventfd is closed, it cannot hold a fget() reference
> >> itself or the underlying f_ops->release() callback will never be invoked
> >> by VFS. Therefore we have this somewhat unusual situation where we may
> >> hold a pointer to an eventfd object (by virtue of having a waiter registered
> >> in its wait-queue), but no reference. This makes it nearly impossible to
> >> design a mutual decoupling algorithm: you cannot unhook one side from the
> >> other (or vice versa) without racing.
> >>
> >
> > And why is that?
> >
> > struct xxx {
> > struct mutex mtx;
> > struct file *file;
> > ...
> > };
> >
> > struct file *xxx_get_file(struct xxx *x) {
> > struct file *file;
> >
> > mutex_lock(&x->mtx);
> > file = x->file;
> > if (!file)
> > mutex_unlock(&x->mtx);
> > return file;
> > }
> >
> > void xxx_release_file(struct xxx *x) {
> > mutex_unlock(&x->mtx);
> > }
> >
> > void handle_POLLHUP(struct xxx *x) {
> > struct file *file;
> >
> > file = xxx_get_file(x);
> > if (file) {
> > unhook_waitqueue(file, ...);
> > x->file = NULL;
> > xxx_release_file(x);
> > }
> > }
> >
> >
> > Every time you need to "use" file, you call xxx_get_file(), and if you get
> > NULL, it means it's gone and you handle it accordigly to your IRQ fd
> > policies. As soon as you done with the file, you call xxx_release_file().
> > Replace "mtx" with the lock that fits your needs.
> >
>
> Consider what would happen if the f_ops->release() was preempted inside
> the wake_up_locked_polled() after it dereferenced the xxx from the list,
> but before it calls the callback(POLLHUP). The xxx object, and/or the
> .text for the xxx object may be long gone by the time it comes back
> around. Afaict, there is no way to guard against that scenario unless
> you do something like 2/3+3/3. Or am I missing something?

Right. Don't you see an easier answer to that, instead of adding 300 lines
of code to eventfd?
For example, turning wake_up_locked() into a nornal wake_up().



- Davide

2009-06-19 21:50:08

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>>>
>>>
>>>
>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>>>> notifier->release() callback. This lets notification clients know if
>>>> the eventfd is about to go away and is very useful particularly for
>>>> in-kernel clients. However, as it stands today it is not possible to
>>>> use the notification API in a race-free way. This patch adds some
>>>> additional logic to the notification subsystem to rectify this problem.
>>>>
>>>> Background:
>>>> -----------------------
>>>> Eventfd currently only has one reference count mechanism: fget/fput. This
>>>> in of itself is normally fine. However, if a client expects to be
>>>> notified if the eventfd is closed, it cannot hold a fget() reference
>>>> itself or the underlying f_ops->release() callback will never be invoked
>>>> by VFS. Therefore we have this somewhat unusual situation where we may
>>>> hold a pointer to an eventfd object (by virtue of having a waiter registered
>>>> in its wait-queue), but no reference. This makes it nearly impossible to
>>>> design a mutual decoupling algorithm: you cannot unhook one side from the
>>>> other (or vice versa) without racing.
>>>>
>>>>
>>> And why is that?
>>>
>>> struct xxx {
>>> struct mutex mtx;
>>> struct file *file;
>>> ...
>>> };
>>>
>>> struct file *xxx_get_file(struct xxx *x) {
>>> struct file *file;
>>>
>>> mutex_lock(&x->mtx);
>>> file = x->file;
>>> if (!file)
>>> mutex_unlock(&x->mtx);
>>> return file;
>>> }
>>>
>>> void xxx_release_file(struct xxx *x) {
>>> mutex_unlock(&x->mtx);
>>> }
>>>
>>> void handle_POLLHUP(struct xxx *x) {
>>> struct file *file;
>>>
>>> file = xxx_get_file(x);
>>> if (file) {
>>> unhook_waitqueue(file, ...);
>>> x->file = NULL;
>>> xxx_release_file(x);
>>> }
>>> }
>>>
>>>
>>> Every time you need to "use" file, you call xxx_get_file(), and if you get
>>> NULL, it means it's gone and you handle it accordigly to your IRQ fd
>>> policies. As soon as you done with the file, you call xxx_release_file().
>>> Replace "mtx" with the lock that fits your needs.
>>>
>>>
>> Consider what would happen if the f_ops->release() was preempted inside
>> the wake_up_locked_polled() after it dereferenced the xxx from the list,
>> but before it calls the callback(POLLHUP). The xxx object, and/or the
>> .text for the xxx object may be long gone by the time it comes back
>> around. Afaict, there is no way to guard against that scenario unless
>> you do something like 2/3+3/3. Or am I missing something?
>>
>
> Right. Don't you see an easier answer to that, instead of adding 300 lines
> of code to eventfd?
>

I tried, but this problem made my head hurt and this was what I came up
with that I felt closes the holes all the way. Also keep in mind that
while I added X lines to eventfd, I took Y lines *out* of irqfd in the
process, too. I just excluded the KVM portions in this thread per your
request, so its not apparent. But technically, any other clients that
may come along can reuse that notification code instead of coding it
again. One way or the other, *someone* has to do that ptable_proc stuff
;) FYI: Its more like 133 lines, fwiw.

fs/eventfd.c | 104
++++++++++++++++++++++++++++++++++++++++++++----
include/linux/eventfd.h | 36 ++++++++++++++++
2 files changed, 133 insertions(+), 7 deletions(-)

In case you care, heres what the complete solution when I include KVM
currently looks like:

fs/eventfd.c | 104 +++++++++++++++++++++++++--
include/linux/eventfd.h | 36 +++++++++
virt/kvm/eventfd.c | 181
+++++++++++++++++++++++++-----------------------
3 files changed, 228 insertions(+), 93 deletions(-)

> For example, turning wake_up_locked() into a nornal wake_up().
>

I am fairly confident it is not that simple after having thought about
this issue over the last few days. But I've been wrong in the past.
Propose a patch and I will review it for races/correctness, if you
like. Perhaps a combination of that plus your asymmetrical locking
scheme would work. One of the challenges you will hit is avoiding ABBA
between your "get" lock and the wqh, but good luck!

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-19 22:00:40

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Fri, 19 Jun 2009, Gregory Haskins wrote:

> I am fairly confident it is not that simple after having thought about
> this issue over the last few days. But I've been wrong in the past.
> Propose a patch and I will review it for races/correctness, if you
> like. Perhaps a combination of that plus your asymmetrical locking
> scheme would work. One of the challenges you will hit is avoiding ABBA
> between your "get" lock and the wqh, but good luck!

A patch for what? The eventfd patch is a one-liner.
It seems hard to believe that the thing cannot be handled on your side.
Once the wake_up_locked() is turned into a wake_up(), what other races are
there?
Let's not try to find the problem that fits and justify the "cool"
solution, but let's see if a problem is there at all.


- Davide

2009-06-19 22:53:31

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Fri, 19 Jun 2009, Davide Libenzi wrote:

> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>
> > I am fairly confident it is not that simple after having thought about
> > this issue over the last few days. But I've been wrong in the past.
> > Propose a patch and I will review it for races/correctness, if you
> > like. Perhaps a combination of that plus your asymmetrical locking
> > scheme would work. One of the challenges you will hit is avoiding ABBA
> > between your "get" lock and the wqh, but good luck!
>
> A patch for what? The eventfd patch is a one-liner.
> It seems hard to believe that the thing cannot be handled on your side.
> Once the wake_up_locked() is turned into a wake_up(), what other races are
> there?

AFAICS, the IRQfd code simply registers the callback to ->poll() and waits
for two events.
In the POLLIN event, you schedule_work(&irqfd->inject) and there are no
races there AFAICS (you basically do not care of anything eventfd memory
related at all).
For POLLHUP, you do:

spin_lock(irqfd->slock);
if (irqfd->wqh)
schedule_work(&irqfd->inject);
irqfd->wqh = NULL;
spin_unlock(irqfd->slock);

In your work function you notice the POLLHUP condition and take proper
action (dunno what it is in your case).
In your kvm_irqfd_release() function:

spin_lock(irqfd->slock);
if (irqfd->wqh)
remove_wait_queue(irqfd->wqh, &irqfd->wait);
irqfd->wqh = NULL;
spin_unlock(irqfd->slock);

Any races in there?



- Davide

2009-06-20 02:09:50

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Davide Libenzi wrote:
>
>
>> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>>
>>
>>> I am fairly confident it is not that simple after having thought about
>>> this issue over the last few days. But I've been wrong in the past.
>>> Propose a patch and I will review it for races/correctness, if you
>>> like. Perhaps a combination of that plus your asymmetrical locking
>>> scheme would work. One of the challenges you will hit is avoiding ABBA
>>> between your "get" lock and the wqh, but good luck!
>>>
>> A patch for what? The eventfd patch is a one-liner.
>>

Yes, understood. What I was trying to gently say is that the one-liner
proposal alone is still broken afaict. However, if there is another
solution that works that you like better than 133-liner I posted, I am
more than willing to help analyze it. In the end, I only care that this
is fixed.

>> It seems hard to believe that the thing cannot be handled on your side.
>> Once the wake_up_locked() is turned into a wake_up(), what other races are
>> there?
>>
>
> AFAICS, the IRQfd code simply registers the callback to ->poll() and waits
> for two events.
> In the POLLIN event, you schedule_work(&irqfd->inject) and there are no
> races there AFAICS (you basically do not care of anything eventfd memory
> related at all).
> For POLLHUP, you do:
>
> spin_lock(irqfd->slock);
> if (irqfd->wqh)
> schedule_work(&irqfd->inject);
> irqfd->wqh = NULL;
> spin_unlock(irqfd->slock);
>
> In your work function you notice the POLLHUP condition and take proper
> action (dunno what it is in your case).
> In your kvm_irqfd_release() function:
>
> spin_lock(irqfd->slock);
> if (irqfd->wqh)
> remove_wait_queue(irqfd->wqh, &irqfd->wait);
> irqfd->wqh = NULL;
> spin_unlock(irqfd->slock);
>
> Any races in there?
>

Yes, for one you have an ABBA deadlock on the irqfd->slock and wqh->lock
(recall wqh has to be locked to fix that other race I mentioned).

(As a hint, I think I fixed 4-5 races with these patches, so there are a
few others still lurking as well)

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-20 21:24:03

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Fri, 19 Jun 2009, Gregory Haskins wrote:

> > In the POLLIN event, you schedule_work(&irqfd->inject) and there are no
> > races there AFAICS (you basically do not care of anything eventfd memory
> > related at all).
> > For POLLHUP, you do:
> >
> > spin_lock(irqfd->slock);
> > if (irqfd->wqh)
> > schedule_work(&irqfd->inject);
> > irqfd->wqh = NULL;
> > spin_unlock(irqfd->slock);
> >
> > In your work function you notice the POLLHUP condition and take proper
> > action (dunno what it is in your case).
> > In your kvm_irqfd_release() function:
> >
> > spin_lock(irqfd->slock);
> > if (irqfd->wqh)
> > remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > irqfd->wqh = NULL;
> > spin_unlock(irqfd->slock);
> >
> > Any races in there?
> >
>
> Yes, for one you have an ABBA deadlock on the irqfd->slock and wqh->lock
> (recall wqh has to be locked to fix that other race I mentioned).

Yep, true. How about we go in little steps?
How about the one below?
When you register your poll callback, you get a reference with
eventfd_refget().
At that point we have no more worries about the eventfd memory (namely
"wqh") going away.
Symmetrically, you use eventfd_refput() once you done with it.
So we basically de-couple the internal VFS refcount, from the eventfd
context memory refcount.
Where are the non-solveable races on the IRQfd side, of an approach like
this one?



> Yes, understood. What I was trying to gently say is that the one-liner
> proposal alone is still broken afaict. However, if there is another
> solution that works that you like better than 133-liner I posted, I am
> more than willing to help analyze it. In the end, I only care that this
> is fixed.

Is it so?
What I noticed here, is that you went from "Detailed Mode" (in the emails
where you were pushing the new bits), to "Hint Mode" (as below) when it
was time to see if there were simpler solutions to the problem.

> (As a hint, I think I fixed 4-5 races with these patches, so there are a
> few others still lurking as well)

Not knowing the details of IRQfd, hinting only is not going to help
anything in this case.



- Davide



---
fs/eventfd.c | 37 ++++++++++++++++++++++++++++++++++++-
include/linux/eventfd.h | 6 ++++++
2 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c 2009-06-20 13:08:22.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c 2009-06-20 14:00:23.000000000 -0700
@@ -17,8 +17,10 @@
#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
@@ -59,9 +61,19 @@ int eventfd_signal(struct file *file, in
}
EXPORT_SYMBOL_GPL(eventfd_signal);

+static void eventfd_free(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
static int eventfd_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct eventfd_ctx *ctx = file->private_data;
+
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ kref_put(&ctx->kref, eventfd_free);
return 0;
}

@@ -201,6 +213,28 @@ struct file *eventfd_fget(int fd)
}
EXPORT_SYMBOL_GPL(eventfd_fget);

+int eventfd_refget(struct file *file)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+ kref_get(&ctx->kref);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_refget);
+
+int eventfd_refput(struct file *file)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+ kref_put(&ctx->kref, eventfd_free);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_refput);
+
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{
int fd;
@@ -217,6 +251,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int,
if (!ctx)
return -ENOMEM;

+ kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-20 13:08:22.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h 2009-06-20 13:59:09.000000000 -0700
@@ -29,12 +29,18 @@

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, int n);
+int eventfd_refget(struct file *file);
+int eventfd_refput(struct file *file);

#else /* CONFIG_EVENTFD */

#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }
+static inline int eventfd_refget(struct file *file)
+{ return -ENOSYS; }
+static inline int eventfd_refput(struct file *file)
+{ return -ENOSYS; }

#endif /* CONFIG_EVENTFD */

2009-06-20 22:17:29

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Sat, 20 Jun 2009, Davide Libenzi wrote:

> How about the one below?

Maybe with an interface that can be undone w/out a file* :)


- Davide


---
fs/eventfd.c | 34 +++++++++++++++++++++++++++++++++-
include/linux/eventfd.h | 8 ++++++++
2 files changed, 41 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c 2009-06-20 13:08:22.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c 2009-06-20 15:07:18.000000000 -0700
@@ -17,8 +17,10 @@
#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
@@ -59,9 +61,19 @@ int eventfd_signal(struct file *file, in
}
EXPORT_SYMBOL_GPL(eventfd_signal);

+static void eventfd_free(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
static int eventfd_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct eventfd_ctx *ctx = file->private_data;
+
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ kref_put(&ctx->kref, eventfd_free);
return 0;
}

@@ -201,6 +213,25 @@ struct file *eventfd_fget(int fd)
}
EXPORT_SYMBOL_GPL(eventfd_fget);

+struct eventfd_kref *eventfd_refget(struct file *file)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+
+ if (file->f_op != &eventfd_fops)
+ return ERR_PTR(-EINVAL);
+ kref_get(&ctx->kref);
+ return (struct eventfd_kref *) ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_refget);
+
+void eventfd_refput(struct eventfd_kref *ref)
+{
+ struct eventfd_ctx *ctx = (struct eventfd_ctx *) ref;
+
+ kref_put(&ctx->kref, eventfd_free);
+}
+EXPORT_SYMBOL_GPL(eventfd_refput);
+
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{
int fd;
@@ -217,6 +248,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int,
if (!ctx)
return -ENOMEM;

+ kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-20 13:08:22.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h 2009-06-20 15:06:03.000000000 -0700
@@ -8,6 +8,8 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

+struct eventfd_kref;
+
#ifdef CONFIG_EVENTFD

/* For O_CLOEXEC and O_NONBLOCK */
@@ -29,12 +31,18 @@

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, int n);
+struct eventfd_kref *eventfd_refget(struct file *file);
+void eventfd_refput(struct eventfd_kref *ref);

#else /* CONFIG_EVENTFD */

#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }
+static inline struct eventfd_kref *eventfd_refget(struct file *file)
+{ return ERR_PTR(-ENOSYS); }
+static inline void eventfd_refput(struct eventfd_kref *ref)
+{ }

#endif /* CONFIG_EVENTFD */

2009-06-20 23:55:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Sat, 20 Jun 2009, Davide Libenzi wrote:

> On Sat, 20 Jun 2009, Davide Libenzi wrote:
>
> > How about the one below?
>
> Maybe with an interface that can be undone w/out a file* :)

This is another alternative, based on a low-carb diet of your notifier
patch.
Same concept of de-coupling VFS refcount from eventfd memory context, and
allowing a poll callback register/unregister.
AFAICS, based on my limited knowledge of the IRQfd policies, your
->release() path needs to eventfd_pollcb_unregister() and wait for all
pending works to be done.




- Davide



---
fs/eventfd.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/eventfd.h | 23 ++++++++++++++++
2 files changed, 91 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c 2009-06-20 16:25:45.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c 2009-06-20 16:35:22.000000000 -0700
@@ -17,8 +17,10 @@
#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
@@ -59,9 +61,29 @@ int eventfd_signal(struct file *file, in
}
EXPORT_SYMBOL_GPL(eventfd_signal);

+static void eventfd_free(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
+static void eventfd_get(struct eventfd_ctx *ctx)
+{
+ kref_get(&ctx->kref);
+}
+
+static void eventfd_put(struct eventfd_ctx *ctx)
+{
+ kref_put(&ctx->kref, eventfd_free);
+}
+
static int eventfd_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct eventfd_ctx *ctx = file->private_data;
+
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ eventfd_put(ctx);
return 0;
}

@@ -217,6 +239,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int,
if (!ctx)
return -ENOMEM;

+ kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
@@ -237,3 +260,47 @@ SYSCALL_DEFINE1(eventfd, unsigned int, c
return sys_eventfd2(count, 0);
}

+static void eventfd_pollcb_ptqueue(struct file *file, wait_queue_head_t *wqh,
+ poll_table *pt)
+{
+ struct eventfd_pollcb *ecb;
+
+ ecb = container_of(pt, struct eventfd_pollcb, pt);
+
+ add_wait_queue(wqh, &ecb->wait);
+}
+
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf)
+{
+ struct eventfd_ctx *ctx;
+ unsigned int events;
+
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+
+ ctx = file->private_data;
+
+ /*
+ * Install our own custom wake-up handling so we are notified via
+ * a callback whenever someone signals the underlying eventfd.
+ */
+ init_waitqueue_func_entry(&ecb->wait, cbf);
+ init_poll_funcptr(&ecb->pt, eventfd_pollcb_ptqueue);
+
+ events = file->f_op->poll(file, &ecb->pt);
+
+ eventfd_get(ctx);
+ ecb->ctx = ctx;
+
+ return (events & POLLIN) ? 1 : 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_register);
+
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
+{
+ remove_wait_queue(&ecb->ctx->wqh, &ecb->wait);
+ eventfd_put(ecb->ctx);
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_unregister);
+
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-20 16:25:45.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h 2009-06-20 16:38:20.000000000 -0700
@@ -8,6 +8,20 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/kref.h>
+
+struct eventfd_ctx;
+
+struct eventfd_pollcb {
+ poll_table pt;
+ struct eventfd_ctx *ctx;
+ wait_queue_t wait;
+};
+
#ifdef CONFIG_EVENTFD

/* For O_CLOEXEC and O_NONBLOCK */
@@ -29,12 +43,21 @@

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, int n);
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf);
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb);

#else /* CONFIG_EVENTFD */

#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }
+static inline int eventfd_pollcb_register(struct file *file,
+ struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf)
+{ return -ENOSYS; }
+static inline void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
+{ }

#endif /* CONFIG_EVENTFD */

2009-06-21 01:05:46

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>
>
>>> In the POLLIN event, you schedule_work(&irqfd->inject) and there are no
>>> races there AFAICS (you basically do not care of anything eventfd memory
>>> related at all).
>>> For POLLHUP, you do:
>>>
>>> spin_lock(irqfd->slock);
>>> if (irqfd->wqh)
>>> schedule_work(&irqfd->inject);
>>> irqfd->wqh = NULL;
>>> spin_unlock(irqfd->slock);
>>>
>>> In your work function you notice the POLLHUP condition and take proper
>>> action (dunno what it is in your case).
>>> In your kvm_irqfd_release() function:
>>>
>>> spin_lock(irqfd->slock);
>>> if (irqfd->wqh)
>>> remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>> irqfd->wqh = NULL;
>>> spin_unlock(irqfd->slock);
>>>
>>> Any races in there?
>>>
>>>
>> Yes, for one you have an ABBA deadlock on the irqfd->slock and wqh->lock
>> (recall wqh has to be locked to fix that other race I mentioned).
>>
>
> Yep, true. How about we go in little steps?
> How about the one below?
> When you register your poll callback, you get a reference with
> eventfd_refget().
>

I was thinking about this last night/this morning and that was
essentially what I was going to propose next as a more palatable+simpler
solution.

> At that point we have no more worries about the eventfd memory (namely
> "wqh") going away.
> Symmetrically, you use eventfd_refput() once you done with it.
> So we basically de-couple the internal VFS refcount, from the eventfd
> context memory refcount.
> Where are the non-solveable races on the IRQfd side, of an approach like
> this one?
>
None... I think that can work. The core requirements boil down to: 1)
locking the POLLHUP (as we already discussed), and 2) holding a kref (or
equivelent) to the eventfd_ctx that the client can release once they are
fully out. After that, it looks like any vanilla wait-queue in the kernel.


>
>
>
>> Yes, understood. What I was trying to gently say is that the one-liner
>> proposal alone is still broken afaict. However, if there is another
>> solution that works that you like better than 133-liner I posted, I am
>> more than willing to help analyze it. In the end, I only care that this
>> is fixed.
>>
>
> Is it so?
> What I noticed here, is that you went from "Detailed Mode" (in the emails
> where you were pushing the new bits), to "Hint Mode" (as below) when it
> was time to see if there were simpler solutions to the problem.
>

I wasn't being purposely coy. It was late friday and I was getting
pressure from the wife to get off the computer ;) Plus, all the details
were in my patch, so I figured there wasn't a giant need to rehash since
it could just be gleened. In any case, based on this current reply I
think we are on the same page now.

>
>> (As a hint, I think I fixed 4-5 races with these patches, so there are a
>> few others still lurking as well)
>>
>
> Not knowing the details of IRQfd, hinting only is not going to help
> anything in this case.
>
>
>
> - Davide
>
>
>
> ---
> fs/eventfd.c | 37 ++++++++++++++++++++++++++++++++++++-
> include/linux/eventfd.h | 6 ++++++
> 2 files changed, 42 insertions(+), 1 deletion(-)
>

I haven't had a chance to go over the patch below in detail, but the
30,000 foot view is that it is in line with what I was thinking this
morning. So, its a preliminary "looks good". I'll take a closer look ASAP

Thanks Davide,
-Greg

> Index: linux-2.6.mod/fs/eventfd.c
> ===================================================================
> --- linux-2.6.mod.orig/fs/eventfd.c 2009-06-20 13:08:22.000000000 -0700
> +++ linux-2.6.mod/fs/eventfd.c 2009-06-20 14:00:23.000000000 -0700
> @@ -17,8 +17,10 @@
> #include <linux/eventfd.h>
> #include <linux/syscalls.h>
> #include <linux/module.h>
> +#include <linux/kref.h>
>
> struct eventfd_ctx {
> + struct kref kref;
> wait_queue_head_t wqh;
> /*
> * Every time that a write(2) is performed on an eventfd, the
> @@ -59,9 +61,19 @@ int eventfd_signal(struct file *file, in
> }
> EXPORT_SYMBOL_GPL(eventfd_signal);
>
> +static void eventfd_free(struct kref *kref)
> +{
> + struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
> +
> + kfree(ctx);
> +}
> +
> static int eventfd_release(struct inode *inode, struct file *file)
> {
> - kfree(file->private_data);
> + struct eventfd_ctx *ctx = file->private_data;
> +
> + wake_up_poll(&ctx->wqh, POLLHUP);
> + kref_put(&ctx->kref, eventfd_free);
> return 0;
> }
>
> @@ -201,6 +213,28 @@ struct file *eventfd_fget(int fd)
> }
> EXPORT_SYMBOL_GPL(eventfd_fget);
>
> +int eventfd_refget(struct file *file)
> +{
> + struct eventfd_ctx *ctx = file->private_data;
> +
> + if (file->f_op != &eventfd_fops)
> + return -EINVAL;
> + kref_get(&ctx->kref);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(eventfd_refget);
> +
> +int eventfd_refput(struct file *file)
> +{
> + struct eventfd_ctx *ctx = file->private_data;
> +
> + if (file->f_op != &eventfd_fops)
> + return -EINVAL;
> + kref_put(&ctx->kref, eventfd_free);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(eventfd_refput);
> +
> SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
> {
> int fd;
> @@ -217,6 +251,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int,
> if (!ctx)
> return -ENOMEM;
>
> + kref_init(&ctx->kref);
> init_waitqueue_head(&ctx->wqh);
> ctx->count = count;
> ctx->flags = flags;
> Index: linux-2.6.mod/include/linux/eventfd.h
> ===================================================================
> --- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-20 13:08:22.000000000 -0700
> +++ linux-2.6.mod/include/linux/eventfd.h 2009-06-20 13:59:09.000000000 -0700
> @@ -29,12 +29,18 @@
>
> struct file *eventfd_fget(int fd);
> int eventfd_signal(struct file *file, int n);
> +int eventfd_refget(struct file *file);
> +int eventfd_refput(struct file *file);
>
> #else /* CONFIG_EVENTFD */
>
> #define eventfd_fget(fd) ERR_PTR(-ENOSYS)
> static inline int eventfd_signal(struct file *file, int n)
> { return 0; }
> +static inline int eventfd_refget(struct file *file)
> +{ return -ENOSYS; }
> +static inline int eventfd_refput(struct file *file)
> +{ return -ENOSYS; }
>
> #endif /* CONFIG_EVENTFD */
>
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-21 01:14:19

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Sat, 20 Jun 2009, Davide Libenzi wrote:
>
>
>> On Sat, 20 Jun 2009, Davide Libenzi wrote:
>>
>>
>>> How about the one below?
>>>
>> Maybe with an interface that can be undone w/out a file* :)
>>
>
> This is another alternative, based on a low-carb diet of your notifier
> patch.
>
Ah, I should always check if I have more mail before responding to a now
stale patch ;)

Will review this version at the next chance I get.

Thanks again, Davide,
-Greg

> Same concept of de-coupling VFS refcount from eventfd memory context, and
> allowing a poll callback register/unregister.
> AFAICS, based on my limited knowledge of the IRQfd policies, your
> ->release() path needs to eventfd_pollcb_unregister() and wait for all
> pending works to be done.
>
>
>
>
> - Davide
>
>
>
> ---
> fs/eventfd.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/eventfd.h | 23 ++++++++++++++++
> 2 files changed, 91 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.mod/fs/eventfd.c
> ===================================================================
> --- linux-2.6.mod.orig/fs/eventfd.c 2009-06-20 16:25:45.000000000 -0700
> +++ linux-2.6.mod/fs/eventfd.c 2009-06-20 16:35:22.000000000 -0700
> @@ -17,8 +17,10 @@
> #include <linux/eventfd.h>
> #include <linux/syscalls.h>
> #include <linux/module.h>
> +#include <linux/kref.h>
>
> struct eventfd_ctx {
> + struct kref kref;
> wait_queue_head_t wqh;
> /*
> * Every time that a write(2) is performed on an eventfd, the
> @@ -59,9 +61,29 @@ int eventfd_signal(struct file *file, in
> }
> EXPORT_SYMBOL_GPL(eventfd_signal);
>
> +static void eventfd_free(struct kref *kref)
> +{
> + struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
> +
> + kfree(ctx);
> +}
> +
> +static void eventfd_get(struct eventfd_ctx *ctx)
> +{
> + kref_get(&ctx->kref);
> +}
> +
> +static void eventfd_put(struct eventfd_ctx *ctx)
> +{
> + kref_put(&ctx->kref, eventfd_free);
> +}
> +
> static int eventfd_release(struct inode *inode, struct file *file)
> {
> - kfree(file->private_data);
> + struct eventfd_ctx *ctx = file->private_data;
> +
> + wake_up_poll(&ctx->wqh, POLLHUP);
> + eventfd_put(ctx);
> return 0;
> }
>
> @@ -217,6 +239,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int,
> if (!ctx)
> return -ENOMEM;
>
> + kref_init(&ctx->kref);
> init_waitqueue_head(&ctx->wqh);
> ctx->count = count;
> ctx->flags = flags;
> @@ -237,3 +260,47 @@ SYSCALL_DEFINE1(eventfd, unsigned int, c
> return sys_eventfd2(count, 0);
> }
>
> +static void eventfd_pollcb_ptqueue(struct file *file, wait_queue_head_t *wqh,
> + poll_table *pt)
> +{
> + struct eventfd_pollcb *ecb;
> +
> + ecb = container_of(pt, struct eventfd_pollcb, pt);
> +
> + add_wait_queue(wqh, &ecb->wait);
> +}
> +
> +int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
> + wait_queue_func_t cbf)
> +{
> + struct eventfd_ctx *ctx;
> + unsigned int events;
> +
> + if (file->f_op != &eventfd_fops)
> + return -EINVAL;
> +
> + ctx = file->private_data;
> +
> + /*
> + * Install our own custom wake-up handling so we are notified via
> + * a callback whenever someone signals the underlying eventfd.
> + */
> + init_waitqueue_func_entry(&ecb->wait, cbf);
> + init_poll_funcptr(&ecb->pt, eventfd_pollcb_ptqueue);
> +
> + events = file->f_op->poll(file, &ecb->pt);
> +
> + eventfd_get(ctx);
> + ecb->ctx = ctx;
> +
> + return (events & POLLIN) ? 1 : 0;
> +}
> +EXPORT_SYMBOL_GPL(eventfd_pollcb_register);
> +
> +void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
> +{
> + remove_wait_queue(&ecb->ctx->wqh, &ecb->wait);
> + eventfd_put(ecb->ctx);
> +}
> +EXPORT_SYMBOL_GPL(eventfd_pollcb_unregister);
> +
> Index: linux-2.6.mod/include/linux/eventfd.h
> ===================================================================
> --- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-20 16:25:45.000000000 -0700
> +++ linux-2.6.mod/include/linux/eventfd.h 2009-06-20 16:38:20.000000000 -0700
> @@ -8,6 +8,20 @@
> #ifndef _LINUX_EVENTFD_H
> #define _LINUX_EVENTFD_H
>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/kref.h>
> +
> +struct eventfd_ctx;
> +
> +struct eventfd_pollcb {
> + poll_table pt;
> + struct eventfd_ctx *ctx;
> + wait_queue_t wait;
> +};
> +
> #ifdef CONFIG_EVENTFD
>
> /* For O_CLOEXEC and O_NONBLOCK */
> @@ -29,12 +43,21 @@
>
> struct file *eventfd_fget(int fd);
> int eventfd_signal(struct file *file, int n);
> +int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
> + wait_queue_func_t cbf);
> +void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb);
>
> #else /* CONFIG_EVENTFD */
>
> #define eventfd_fget(fd) ERR_PTR(-ENOSYS)
> static inline int eventfd_signal(struct file *file, int n)
> { return 0; }
> +static inline int eventfd_pollcb_register(struct file *file,
> + struct eventfd_pollcb *ecb,
> + wait_queue_func_t cbf)
> +{ return -ENOSYS; }
> +static inline void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
> +{ }
>
> #endif /* CONFIG_EVENTFD */
>
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-21 16:57:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Sat, 20 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Sat, 20 Jun 2009, Davide Libenzi wrote:
> >
> >
> >> On Sat, 20 Jun 2009, Davide Libenzi wrote:
> >>
> >>
> >>> How about the one below?
> >>>
> >> Maybe with an interface that can be undone w/out a file* :)
> >>
> >
> > This is another alternative, based on a low-carb diet of your notifier
> > patch.
> >
> Ah, I should always check if I have more mail before responding to a now
> stale patch ;)

Here. I changed the eventfd_pollcb_register() prototype to return the full
registeration time event mask, instead of the POLLIN explicit check.
Let's give this one some more thought before I push it to Andrew.



- Davide


---
fs/eventfd.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/eventfd.h | 24 ++++++++++++
2 files changed, 117 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c 2009-06-20 16:25:45.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c 2009-06-21 09:36:46.000000000 -0700
@@ -17,8 +17,10 @@
#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
@@ -59,9 +61,30 @@ int eventfd_signal(struct file *file, in
}
EXPORT_SYMBOL_GPL(eventfd_signal);

+static void eventfd_free(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
+static struct eventfd_ctx *eventfd_get(struct eventfd_ctx *ctx)
+{
+ kref_get(&ctx->kref);
+ return ctx;
+}
+
+static void eventfd_put(struct eventfd_ctx *ctx)
+{
+ kref_put(&ctx->kref, eventfd_free);
+}
+
static int eventfd_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct eventfd_ctx *ctx = file->private_data;
+
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ eventfd_put(ctx);
return 0;
}

@@ -185,6 +208,14 @@ static const struct file_operations even
.write = eventfd_write,
};

+/**
+ * eventfd_fget - Acquire a reference of an eventfd file descriptor.
+ *
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns: A pointer to the eventfd file structure in case of success, or a
+ * proper error pointer in case of failure.
+ */
struct file *eventfd_fget(int fd)
{
struct file *file;
@@ -217,6 +248,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int,
if (!ctx)
return -ENOMEM;

+ kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
@@ -237,3 +269,63 @@ SYSCALL_DEFINE1(eventfd, unsigned int, c
return sys_eventfd2(count, 0);
}

+static void eventfd_pollcb_ptqueue(struct file *file, wait_queue_head_t *wqh,
+ poll_table *pt)
+{
+ struct eventfd_pollcb *ecb;
+
+ ecb = container_of(pt, struct eventfd_pollcb, pt);
+
+ add_wait_queue(wqh, &ecb->wait);
+}
+
+/**
+ * eventfd_pollcb_register - Registers a wakeup callback with the eventfd
+ * file. The wakeup callback will be called from
+ * atomic context, and should handle the POLLHUP
+ * event accordingly in order to notice the last
+ * instance of the eventfd descriptor going away.
+ *
+ * @file: [in] Pointer to the eventfd file.
+ * @ecb: [out] Pointer to the eventfd callback structure.
+ * @cbf: [in] Pointer to the wakeup callback function.
+ * @events: [out] Pointer to the events that were ready at the time of the
+ * callback registration.
+ *
+ * Returns: Zero in case of success, or a proper error code.
+ */
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf, unsigned int *events)
+{
+ struct eventfd_ctx *ctx;
+
+ ctx = file->private_data;
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+
+ init_waitqueue_func_entry(&ecb->wait, cbf);
+ init_poll_funcptr(&ecb->pt, eventfd_pollcb_ptqueue);
+ ecb->ctx = eventfd_get(ctx);
+
+ *events = file->f_op->poll(file, &ecb->pt);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_register);
+
+/**
+ * eventfd_pollcb_unregister - Unregisters a wakeup callback previously registered
+ * with eventfd_pollcb_register().
+ *
+ * @ecb: [in] Pointer to the eventfd callback structure previously registered with
+ * eventfd_pollcb_register().
+ *
+ * Returns: Nothing.
+ */
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
+{
+ remove_wait_queue(&ecb->ctx->wqh, &ecb->wait);
+ eventfd_put(ecb->ctx);
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_unregister);
+
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-20 16:25:45.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h 2009-06-21 09:37:12.000000000 -0700
@@ -8,6 +8,20 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/kref.h>
+
+struct eventfd_ctx;
+
+struct eventfd_pollcb {
+ poll_table pt;
+ struct eventfd_ctx *ctx;
+ wait_queue_t wait;
+};
+
#ifdef CONFIG_EVENTFD

/* For O_CLOEXEC and O_NONBLOCK */
@@ -29,12 +43,22 @@

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, int n);
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf, unsigned int *events);
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb);

#else /* CONFIG_EVENTFD */

#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }
+static inline int eventfd_pollcb_register(struct file *file,
+ struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf,
+ unsigned int *events)
+{ return -ENOSYS; }
+static inline void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
+{ }

#endif /* CONFIG_EVENTFD */

2009-06-21 18:39:49

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Sat, 20 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>> On Sat, 20 Jun 2009, Davide Libenzi wrote:
>>>
>>>
>>>
>>>> On Sat, 20 Jun 2009, Davide Libenzi wrote:
>>>>
>>>>
>>>>
>>>>> How about the one below?
>>>>>
>>>>>
>>>> Maybe with an interface that can be undone w/out a file* :)
>>>>
>>>>
>>> This is another alternative, based on a low-carb diet of your notifier
>>> patch.
>>>
>>>
>> Ah, I should always check if I have more mail before responding to a now
>> stale patch ;)
>>
>
> Here. I changed the eventfd_pollcb_register() prototype to return the full
> registeration time event mask, instead of the POLLIN explicit check.
> Let's give this one some more thought before I push it to Andrew.
>

This looks great, Davide. I am fairly certain I can now solve the races
and even implement Michael's DEASSIGN feature with this patch in place.
I will actually fire it up tomorrow when I am back in the office and
give it a spin, but I do not spy any more races via visual inspection.

Kind Regards,
-Greg

PS: I was wrong with my previous statement about requiring an embeddable
object (eventfd_notifier for me, eventfd_pollcb for you). I think you
can technically solve this issue minimally by merely locking the POLLHUP
and exposing the kref. However, I think that leads to an more awkward
interface (e.g. we already have eventfd_fget() plus we add a new one
like eventfd_refget(), which might confuse users), so I prefer what you
did here. Just thought I would throw that out there in case you would
prefer to change even fewer lines.



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 00:00:48

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Sun, 21 Jun 2009, Gregory Haskins wrote:

> This looks great, Davide. I am fairly certain I can now solve the races
> and even implement Michael's DEASSIGN feature with this patch in place.
> I will actually fire it up tomorrow when I am back in the office and
> give it a spin, but I do not spy any more races via visual inspection.
>
> Kind Regards,
> -Greg
>
> PS: I was wrong with my previous statement about requiring an embeddable
> object (eventfd_notifier for me, eventfd_pollcb for you). I think you
> can technically solve this issue minimally by merely locking the POLLHUP
> and exposing the kref. However, I think that leads to an more awkward
> interface (e.g. we already have eventfd_fget() plus we add a new one
> like eventfd_refget(), which might confuse users), so I prefer what you
> did here. Just thought I would throw that out there in case you would
> prefer to change even fewer lines.

I actually ended up exposing the eventfd context anyway, since IMO is a
better option instead of holding references to the eventfd file (that
makes VFS people uneasy).
So eventfd_signal() now accepts the eventfd context instead of the file
pointer.
Inside KAIO we had it holding a reference to the underlying file*, same
thing LGUEST (Rusty Cc-ed), that has been changed to fit the new API.
And since eventfd_ctx_put() doesn't sleep, KAIO code got simpler a little.
The other change have been to make KAIO select EVENTFD, instead of making
the ugly empty stubs inside eventfd.h to accomodate (KAIO && !EVENTFD).
The eventfd_pollcb_register() remained to be accepting a file*, first
because it needs it ;) , second because it can unracely detect POLLHUP due
to the caller holding a reference to the file*.




- Davide


---
drivers/lguest/lg.h | 4 -
drivers/lguest/lguest_user.c | 4 -
fs/aio.c | 24 +-----
fs/eventfd.c | 170 ++++++++++++++++++++++++++++++++++++++++---
include/linux/aio.h | 5 -
include/linux/eventfd.h | 28 ++++---
init/Kconfig | 1
7 files changed, 192 insertions(+), 44 deletions(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c 2009-06-21 09:52:10.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c 2009-06-21 16:51:27.000000000 -0700
@@ -17,32 +17,38 @@
#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
* value of the __u64 being written is added to "count" and a
* wakeup is performed on "wqh". A read(2) will return the "count"
* value to userspace, and will reset "count" to zero. The kernel
- * size eventfd_signal() also, adds to the "count" counter and
+ * side eventfd_signal() also, adds to the "count" counter and
* issue a wakeup.
*/
__u64 count;
unsigned int flags;
};

-/*
- * Adds "n" to the eventfd counter "count". Returns "n" in case of
- * success, or a value lower then "n" in case of coutner overflow.
- * This function is supposed to be called by the kernel in paths
- * that do not allow sleeping. In this function we allow the counter
- * to reach the ULLONG_MAX value, and we signal this as overflow
- * condition by returining a POLLERR to poll(2).
+/**
+ * eventfd_signal - Adds @n to the eventfd counter. This function is
+ * supposed to be called by the kernel in paths that do not
+ * allow sleeping. In this function we allow the counter
+ * to reach the ULLONG_MAX value, and we signal this as
+ * overflow condition by returining a POLLERR to poll(2).
+ *
+ * @ctx: [in] Pointer to the eventfd context.
+ * @n: [in] Value of the counter to be added to the eventfd internal counter.
+ *
+ * Returns: In case of success, it returns @n, otherwise (in case of overflow
+ * of the eventfd 64bit internal counter) a value lower than @n.
*/
-int eventfd_signal(struct file *file, int n)
+int eventfd_signal(struct eventfd_ctx *ctx, int n)
{
- struct eventfd_ctx *ctx = file->private_data;
unsigned long flags;

if (n < 0)
@@ -59,9 +65,30 @@ int eventfd_signal(struct file *file, in
}
EXPORT_SYMBOL_GPL(eventfd_signal);

+static void eventfd_free(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
+static struct eventfd_ctx *eventfd_get(struct eventfd_ctx *ctx)
+{
+ kref_get(&ctx->kref);
+ return ctx;
+}
+
+static void eventfd_put(struct eventfd_ctx *ctx)
+{
+ kref_put(&ctx->kref, eventfd_free);
+}
+
static int eventfd_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct eventfd_ctx *ctx = file->private_data;
+
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ eventfd_put(ctx);
return 0;
}

@@ -185,6 +212,14 @@ static const struct file_operations even
.write = eventfd_write,
};

+/**
+ * eventfd_fget - Acquire a reference of an eventfd file descriptor.
+ *
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns: A pointer to the eventfd file structure in case of success, or a
+ * proper error pointer in case of failure.
+ */
struct file *eventfd_fget(int fd)
{
struct file *file;
@@ -201,6 +236,59 @@ struct file *eventfd_fget(int fd)
}
EXPORT_SYMBOL_GPL(eventfd_fget);

+/**
+ * eventfd_ctx_get - Acquires a reference to the internal eventfd context.
+ *
+ * @file: [in] Pointer to the eventfd file.
+ *
+ * Returns: In case of success, returns a pointer to the eventfd context,
+ * otherwise a proper error code.
+ */
+struct eventfd_ctx *eventfd_ctx_get(struct file *file)
+{
+ return file->f_op == &eventfd_fops ? eventfd_get(file->private_data) :
+ ERR_PTR(-EINVAL);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_get);
+
+/**
+ * eventfd_ctx_put - Releases a reference to the internal eventfd context
+ * (previously acquired with eventfd_ctx_get()).
+ *
+ * @ctx: [in] Pointer to eventfd context.
+ *
+ * Returns: Nothing.
+ */
+void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+ eventfd_put(ctx);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_put);
+
+/**
+ * eventfd_ctx_fdget - Acquires a reference to the internal eventfd context
+ * given an eventfd file descriptor.
+ *
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns: In case of success, it returns a pointer to the internal eventfd
+ * context, otherwise a proper error code.
+ */
+struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+ struct file *file;
+ struct eventfd_ctx *ctx;
+
+ file = eventfd_fget(fd);
+ if (IS_ERR(file))
+ return (struct eventfd_ctx *) file;
+ ctx = eventfd_ctx_get(file);
+ fput(file);
+
+ return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fdget);
+
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{
int fd;
@@ -217,6 +305,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int,
if (!ctx)
return -ENOMEM;

+ kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
@@ -237,3 +326,62 @@ SYSCALL_DEFINE1(eventfd, unsigned int, c
return sys_eventfd2(count, 0);
}

+static void eventfd_pollcb_ptqueue(struct file *file, wait_queue_head_t *wqh,
+ poll_table *pt)
+{
+ struct eventfd_pollcb *ecb;
+
+ ecb = container_of(pt, struct eventfd_pollcb, pt);
+ add_wait_queue(wqh, &ecb->wait);
+}
+
+/**
+ * eventfd_pollcb_register - Registers a wakeup callback with the eventfd
+ * file. The wakeup callback will be called from
+ * atomic context, and should handle the POLLHUP
+ * event accordingly in order to notice the last
+ * instance of the eventfd descriptor going away.
+ *
+ * @file: [in] Pointer to the eventfd file.
+ * @ecb: [out] Pointer to the eventfd callback structure.
+ * @cbf: [in] Pointer to the wakeup callback function.
+ * @events: [out] Pointer to the events that were ready at the time of the
+ * callback registration.
+ *
+ * Returns: Zero in case of success, or a proper error code.
+ */
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf, unsigned int *events)
+{
+ struct eventfd_ctx *ctx;
+
+ ctx = file->private_data;
+ if (file->f_op != &eventfd_fops)
+ return -EINVAL;
+
+ init_waitqueue_func_entry(&ecb->wait, cbf);
+ init_poll_funcptr(&ecb->pt, eventfd_pollcb_ptqueue);
+ ecb->ctx = eventfd_get(ctx);
+
+ *events = file->f_op->poll(file, &ecb->pt);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_register);
+
+/**
+ * eventfd_pollcb_unregister - Unregisters a wakeup callback previously registered
+ * with eventfd_pollcb_register().
+ *
+ * @ecb: [in] Pointer to the eventfd callback structure previously registered with
+ * eventfd_pollcb_register().
+ *
+ * Returns: Nothing.
+ */
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
+{
+ remove_wait_queue(&ecb->ctx->wqh, &ecb->wait);
+ eventfd_put(ecb->ctx);
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_unregister);
+
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-21 09:52:10.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h 2009-06-21 16:49:19.000000000 -0700
@@ -8,10 +8,10 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

-#ifdef CONFIG_EVENTFD
-
-/* For O_CLOEXEC and O_NONBLOCK */
#include <linux/fcntl.h>
+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/file.h>

/*
* CAREFUL: Check include/asm-generic/fcntl.h when defining
@@ -27,16 +27,22 @@
#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)

-struct file *eventfd_fget(int fd);
-int eventfd_signal(struct file *file, int n);
-
-#else /* CONFIG_EVENTFD */
+struct eventfd_ctx;

-#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
-static inline int eventfd_signal(struct file *file, int n)
-{ return 0; }
+struct eventfd_pollcb {
+ poll_table pt;
+ struct eventfd_ctx *ctx;
+ wait_queue_t wait;
+};

-#endif /* CONFIG_EVENTFD */
+struct file *eventfd_fget(int fd);
+struct eventfd_ctx *eventfd_ctx_fdget(int fd);
+struct eventfd_ctx *eventfd_ctx_get(struct file *file);
+void eventfd_ctx_put(struct eventfd_ctx *ctx);
+int eventfd_signal(struct eventfd_ctx *ctx, int n);
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+ wait_queue_func_t cbf, unsigned int *events);
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb);

#endif /* _LINUX_EVENTFD_H */

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c 2009-06-21 14:36:52.000000000 -0700
+++ linux-2.6.mod/fs/aio.c 2009-06-21 15:25:44.000000000 -0700
@@ -485,6 +485,8 @@ static inline void really_put_req(struct
{
assert_spin_locked(&ctx->ctx_lock);

+ if (req->ki_eventfd != NULL)
+ eventfd_ctx_put(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
@@ -509,8 +511,6 @@ static void aio_fput_routine(struct work
/* Complete the fput(s) */
if (req->ki_filp != NULL)
__fput(req->ki_filp);
- if (req->ki_eventfd != NULL)
- __fput(req->ki_eventfd);

/* Link the iocb into the context's free list */
spin_lock_irq(&ctx->ctx_lock);
@@ -528,8 +528,6 @@ static void aio_fput_routine(struct work
*/
static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
{
- int schedule_putreq = 0;
-
dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
req, atomic_long_read(&req->ki_filp->f_count));

@@ -549,24 +547,16 @@ static int __aio_put_req(struct kioctx *
* we would not be holding the last reference to the file*, so
* this function will be executed w/out any aio kthread wakeup.
*/
- if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
- schedule_putreq++;
- else
- req->ki_filp = NULL;
- if (req->ki_eventfd != NULL) {
- if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
- schedule_putreq++;
- else
- req->ki_eventfd = NULL;
- }
- if (unlikely(schedule_putreq)) {
+ if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
get_ioctx(ctx);
spin_lock(&fput_lock);
list_add(&req->ki_list, &fput_head);
spin_unlock(&fput_lock);
queue_work(aio_wq, &fput_work);
- } else
+ } else {
+ req->ki_filp = NULL;
really_put_req(ctx, req);
+ }
return 1;
}

@@ -1622,7 +1612,7 @@ static int io_submit_one(struct kioctx *
* an eventfd() fd, and will be signaled for each completed
* event using the eventfd_signal() function.
*/
- req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
+ req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
if (IS_ERR(req->ki_eventfd)) {
ret = PTR_ERR(req->ki_eventfd);
req->ki_eventfd = NULL;
Index: linux-2.6.mod/include/linux/aio.h
===================================================================
--- linux-2.6.mod.orig/include/linux/aio.h 2009-06-21 14:36:52.000000000 -0700
+++ linux-2.6.mod/include/linux/aio.h 2009-06-21 15:32:42.000000000 -0700
@@ -12,6 +12,7 @@
#define AIO_MAXSEGS 4
#define AIO_KIOGRP_NR_ATOMIC 8

+struct eventfd_ctx;
struct kioctx;

/* Notes on cancelling a kiocb:
@@ -121,9 +122,9 @@ struct kiocb {

/*
* If the aio_resfd field of the userspace iocb is not zero,
- * this is the underlying file* to deliver event to.
+ * this is the underlying eventfd context to deliver events to.
*/
- struct file *ki_eventfd;
+ struct eventfd_ctx *ki_eventfd;
};

#define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
Index: linux-2.6.mod/init/Kconfig
===================================================================
--- linux-2.6.mod.orig/init/Kconfig 2009-06-21 15:42:52.000000000 -0700
+++ linux-2.6.mod/init/Kconfig 2009-06-21 15:43:25.000000000 -0700
@@ -925,6 +925,7 @@ config SHMEM

config AIO
bool "Enable AIO support" if EMBEDDED
+ select EVENTFD
default y
help
This option enables POSIX asynchronous I/O which may by used
Index: linux-2.6.mod/drivers/lguest/lg.h
===================================================================
--- linux-2.6.mod.orig/drivers/lguest/lg.h 2009-06-21 15:55:17.000000000 -0700
+++ linux-2.6.mod/drivers/lguest/lg.h 2009-06-21 15:56:00.000000000 -0700
@@ -80,9 +80,11 @@ struct lg_cpu {
struct lg_cpu_arch arch;
};

+struct eventfd_ctx;
+
struct lg_eventfd {
unsigned long addr;
- struct file *event;
+ struct eventfd_ctx *event;
};

struct lg_eventfd_map {
Index: linux-2.6.mod/drivers/lguest/lguest_user.c
===================================================================
--- linux-2.6.mod.orig/drivers/lguest/lguest_user.c 2009-06-21 15:55:17.000000000 -0700
+++ linux-2.6.mod/drivers/lguest/lguest_user.c 2009-06-21 15:57:11.000000000 -0700
@@ -50,7 +50,7 @@ static int add_eventfd(struct lguest *lg

/* Now append new entry. */
new->map[new->num].addr = addr;
- new->map[new->num].event = eventfd_fget(fd);
+ new->map[new->num].event = eventfd_ctx_fdget(fd);
if (IS_ERR(new->map[new->num].event)) {
kfree(new);
return PTR_ERR(new->map[new->num].event);
@@ -357,7 +357,7 @@ static int close(struct inode *inode, st

/* Release any eventfds they registered. */
for (i = 0; i < lg->eventfds->num; i++)
- fput(lg->eventfds->map[i].event);
+ eventfd_ctx_put(lg->eventfds->map[i].event);
kfree(lg->eventfds);

/* If lg->dead doesn't contain an error code it will be NULL or a

2009-06-22 16:07:11

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Sun, 21 Jun 2009, Gregory Haskins wrote:
>
>
>> This looks great, Davide. I am fairly certain I can now solve the races
>> and even implement Michael's DEASSIGN feature with this patch in place.
>> I will actually fire it up tomorrow when I am back in the office and
>> give it a spin, but I do not spy any more races via visual inspection.
>>
>> Kind Regards,
>> -Greg
>>
>> PS: I was wrong with my previous statement about requiring an embeddable
>> object (eventfd_notifier for me, eventfd_pollcb for you). I think you
>> can technically solve this issue minimally by merely locking the POLLHUP
>> and exposing the kref. However, I think that leads to an more awkward
>> interface (e.g. we already have eventfd_fget() plus we add a new one
>> like eventfd_refget(), which might confuse users), so I prefer what you
>> did here. Just thought I would throw that out there in case you would
>> prefer to change even fewer lines.
>>
>
> I actually ended up exposing the eventfd context anyway, since IMO is a
> better option instead of holding references to the eventfd file (that
> makes VFS people uneasy).
>

I liked "version - 1" better ;)

I think ultimately we still want to hold the fget() for
eventfd_signal(), as it is the producer side. Without it, we have no
way of knowing when the last producer goes away if they happen to be an
in-kernel user.

Also, thinking about it some more, I think I like the non-wrapped pollcb
variant better now. This is because I can uhook the wqh in the POLLHUP
and defer the kref_put() until later. I think this combination gives me
the most ideal environment (even though technically I should not expect
more activity on the wait-queue after the POLLHUP).

In either case, I whipped up a super-lite patch with just the bare
minimum and then developed the KVM solution on top. Things look really
good, now, and I am happy with the result. As an aside, it only has the
following impact on the eventfd core:

fs/eventfd.c | 43 ++++++++++++++++++++++++++++++++++++-------
include/linux/eventfd.h | 7 +++++++
2 files changed, 43 insertions(+), 7 deletions(-)

Ill post the series, including the KVM portions, for review.

Thanks Davide,
-Greg




Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 17:07:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Sun, 21 Jun 2009, Gregory Haskins wrote:
> >
> >
> >> This looks great, Davide. I am fairly certain I can now solve the races
> >> and even implement Michael's DEASSIGN feature with this patch in place.
> >> I will actually fire it up tomorrow when I am back in the office and
> >> give it a spin, but I do not spy any more races via visual inspection.
> >>
> >> Kind Regards,
> >> -Greg
> >>
> >> PS: I was wrong with my previous statement about requiring an embeddable
> >> object (eventfd_notifier for me, eventfd_pollcb for you). I think you
> >> can technically solve this issue minimally by merely locking the POLLHUP
> >> and exposing the kref. However, I think that leads to an more awkward
> >> interface (e.g. we already have eventfd_fget() plus we add a new one
> >> like eventfd_refget(), which might confuse users), so I prefer what you
> >> did here. Just thought I would throw that out there in case you would
> >> prefer to change even fewer lines.
> >>
> >
> > I actually ended up exposing the eventfd context anyway, since IMO is a
> > better option instead of holding references to the eventfd file (that
> > makes VFS people uneasy).
> >
>
> I liked "version - 1" better ;)
>
> I think ultimately we still want to hold the fget() for
> eventfd_signal(), as it is the producer side. Without it, we have no
> way of knowing when the last producer goes away if they happen to be an
> in-kernel user.

No you don't. Holding a file* reference does not give you any assurance
whatsoever that the last consumer did not go away. The f_count value will
simply be 1 (just you).
If a producer side want to take proper action when the last consumer
leaves the building, it needs to register a pollcb hook and handle
POLLHUP.
Exposing the opaque eventfd context, and basing the eventfd operations on
that one (instead of the live referenced file*) is a cleaner interface.
Can you please base your IRQfd bits on top of that, so that we can give
this IRQfd thread a closure?


- Davide

2009-06-22 17:43:55

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>> On Sun, 21 Jun 2009, Gregory Haskins wrote:
>>>
>>>
>>>
>>>> This looks great, Davide. I am fairly certain I can now solve the races
>>>> and even implement Michael's DEASSIGN feature with this patch in place.
>>>> I will actually fire it up tomorrow when I am back in the office and
>>>> give it a spin, but I do not spy any more races via visual inspection.
>>>>
>>>> Kind Regards,
>>>> -Greg
>>>>
>>>> PS: I was wrong with my previous statement about requiring an embeddable
>>>> object (eventfd_notifier for me, eventfd_pollcb for you). I think you
>>>> can technically solve this issue minimally by merely locking the POLLHUP
>>>> and exposing the kref. However, I think that leads to an more awkward
>>>> interface (e.g. we already have eventfd_fget() plus we add a new one
>>>> like eventfd_refget(), which might confuse users), so I prefer what you
>>>> did here. Just thought I would throw that out there in case you would
>>>> prefer to change even fewer lines.
>>>>
>>>>
>>> I actually ended up exposing the eventfd context anyway, since IMO is a
>>> better option instead of holding references to the eventfd file (that
>>> makes VFS people uneasy).
>>>
>>>
>> I liked "version - 1" better ;)
>>
>> I think ultimately we still want to hold the fget() for
>> eventfd_signal(), as it is the producer side. Without it, we have no
>> way of knowing when the last producer goes away if they happen to be an
>> in-kernel user.
>>
>
> No you don't. Holding a file* reference does not give you any assurance
> whatsoever that the last consumer did not go away. The f_count value will
> simply be 1 (just you).
>

I am probably confused or perhaps have the wrong terminology, but isnt
that "ok". I am concerned about the consumer (the guy getting the
POLLINs) to be able to detect POLLHUP when the last producer
(f_ops->write() from userspace, eventfd_signal() from kernel) goes away.

Consider the following sequence:

-------------------

userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and
the other to some PCI-passthrough device.

The kvm/irqfd side acquires a kref, the pci side acquires a file. At
this moment, userspace has the fd, and the pci device has the file (for
eventfd_signal()). The fget() count is 2. Userspace closes the fd
because its done with it, and the count drops to 1.

Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up.

-------------------

In this new model, the POLLHUP would have gone out as soon as userspace
closed the fd, even though the intended producer (the PCI device) and
the consumer (the KVM guest) are still up and running. This doesnt seem
right to me. Or am I missing something?

Thanks Davide,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 18:10:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> I am probably confused or perhaps have the wrong terminology, but isnt
> that "ok". I am concerned about the consumer (the guy getting the
> POLLINs) to be able to detect POLLHUP when the last producer
> (f_ops->write() from userspace, eventfd_signal() from kernel) goes away.
>
> Consider the following sequence:
>
> -------------------
>
> userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and
> the other to some PCI-passthrough device.
>
> The kvm/irqfd side acquires a kref, the pci side acquires a file. At
> this moment, userspace has the fd, and the pci device has the file (for
> eventfd_signal()). The fget() count is 2. Userspace closes the fd
> because its done with it, and the count drops to 1.
>
> Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up.
>
> -------------------
>
> In this new model, the POLLHUP would have gone out as soon as userspace
> closed the fd, even though the intended producer (the PCI device) and
> the consumer (the KVM guest) are still up and running. This doesnt seem
> right to me. Or am I missing something?

What you're doing there, is setting up a kernel-to-kernel (since
userspace only role is to create the eventfd) communication, using a file*
as accessory. That IMO is plain wrong.
If userspace is either the producer, or the consumer, and you need to
handle userspace leaving the building, you need to:

file = eventfd_fget(fd);
ctx = eventfd_ctx_get(file); /* Eventually, if producer */
eventfd_pollcb_register(file, ...);
fput(file);

In your case of kernel-to-kernel scenario, why would you need eventfd at
all, if userspace role in that model is simply to create it?
There are more effective ways to have in kernel communication channels,
than resorting to userspace link facilities like eventfd.



- Davide

2009-06-22 18:32:18

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>
>> I am probably confused or perhaps have the wrong terminology, but isnt
>> that "ok". I am concerned about the consumer (the guy getting the
>> POLLINs) to be able to detect POLLHUP when the last producer
>> (f_ops->write() from userspace, eventfd_signal() from kernel) goes away.
>>
>> Consider the following sequence:
>>
>> -------------------
>>
>> userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and
>> the other to some PCI-passthrough device.
>>
>> The kvm/irqfd side acquires a kref, the pci side acquires a file. At
>> this moment, userspace has the fd, and the pci device has the file (for
>> eventfd_signal()). The fget() count is 2. Userspace closes the fd
>> because its done with it, and the count drops to 1.
>>
>> Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up.
>>
>> -------------------
>>
>> In this new model, the POLLHUP would have gone out as soon as userspace
>> closed the fd, even though the intended producer (the PCI device) and
>> the consumer (the KVM guest) are still up and running. This doesnt seem
>> right to me. Or am I missing something?
>>
>
> What you're doing there, is setting up a kernel-to-kernel (since
> userspace only role is to create the eventfd) communication, using a file*
> as accessory. That IMO is plain wrong.
> If userspace is either the producer, or the consumer, and you need to
> handle userspace leaving the building, you need to:
>
> file = eventfd_fget(fd);
> ctx = eventfd_ctx_get(file); /* Eventually, if producer */
> eventfd_pollcb_register(file, ...);
> fput(file);
>
> In your case of kernel-to-kernel scenario, why would you need eventfd at
> all, if userspace role in that model is simply to create it?
>

The general thesis is for decoupling of the two subsystems. In order to
do this, you need some form of polymorphism and an intermediate "handle"
mechanism which is userspace friendly. File-descriptors already fit
this role neatly, with the "int fd" being the handle, and the f_ops
being the polymorphic interface. Eventfd is of course, a subclass of
this concept in that it has these same general properties but with
signaling semantics (non-blocking collapsible events, etc).

Say, for example, you wanted disk IO completion events to generate an
interrupt into a guest. One way to do this would, of course, modify all
the disk-io code so it knows how to directly inject a KVM guest
interrupt. While this would work, someone would undoubtedly get flamed
for such a suggestion ;)

Another way to do it is to treat the AIO eventfd as the hook point.
IIUC AIO already knows how to be an eventfd producer. KVM, by virtue of
irqfd, already knows how to be an eventfd consumer. So now kvm can
consume AIO, or it can consume userspace events equally well, and
without modification. Neither side needs to know about the other per
se, other than the details on how to use the eventfd interface.

Don't get me wrong: We expect userspace to use all this stuff too. I
just expect that we will see all permutations of producer/consumer +
userspace/kernel combinations, so I want to retain that "all producers
have left" notification feature set. Today eventfd supports producers
or consumers in userspace, and producers in the kernel. This new work
we are doing adds consumer support in the kernel. Kernel to kernel is
just a natural extension of that.

-Greg

> There are more effective ways to have in kernel communication channels,
> than resorting to userspace link facilities like eventfd.
>
>
>
> - Davide
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 18:42:27

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
> In your case of kernel-to-kernel scenario, why would you need eventfd at
> all, if userspace role in that model is simply to create it?

That's not 100% true. We have a mode where userspace is the producer
and/or consumer (migration mode) and we switch between that and
direct kernel-to-kernel communication.

2009-06-22 18:47:20

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> The general thesis is for decoupling of the two subsystems. In order to
> do this, you need some form of polymorphism and an intermediate "handle"
> mechanism which is userspace friendly. File-descriptors already fit
> this role neatly, with the "int fd" being the handle, and the f_ops
> being the polymorphic interface. Eventfd is of course, a subclass of
> this concept in that it has these same general properties but with
> signaling semantics (non-blocking collapsible events, etc).
>
> Say, for example, you wanted disk IO completion events to generate an
> interrupt into a guest. One way to do this would, of course, modify all
> the disk-io code so it knows how to directly inject a KVM guest
> interrupt. While this would work, someone would undoubtedly get flamed
> for such a suggestion ;)
>
> Another way to do it is to treat the AIO eventfd as the hook point.
> IIUC AIO already knows how to be an eventfd producer. KVM, by virtue of
> irqfd, already knows how to be an eventfd consumer. So now kvm can
> consume AIO, or it can consume userspace events equally well, and
> without modification. Neither side needs to know about the other per
> se, other than the details on how to use the eventfd interface.
>
> Don't get me wrong: We expect userspace to use all this stuff too. I
> just expect that we will see all permutations of producer/consumer +
> userspace/kernel combinations, so I want to retain that "all producers
> have left" notification feature set. Today eventfd supports producers
> or consumers in userspace, and producers in the kernel. This new work
> we are doing adds consumer support in the kernel. Kernel to kernel is
> just a natural extension of that.

A file* is the VFS link between userspace and the kernel. Is not a magical
polymorphic interface to be used for whatever kernel side reasons.
Basing a kernel internal API over it is flawed.
On top of that, a single reference count does not put you on cover about
the possible combinations of producers and consumers. For that, you'd need
a pipe-like reference handling logic, that is way far from the eventfd scope.
So please stop making hypothetical cases about interface usages, and
*when* we will have a real case, we'll see what the better handling for it
will be.



- Davide

2009-06-22 18:58:00

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Michael S. Tsirkin wrote:

> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
> > In your case of kernel-to-kernel scenario, why would you need eventfd at
> > all, if userspace role in that model is simply to create it?
>
> That's not 100% true. We have a mode where userspace is the producer
> and/or consumer (migration mode) and we switch between that and
> direct kernel-to-kernel communication.

Then you'd need to ask yourself how to handle your complex case inside the
KVM code, so that other eventfd users are not affected by the extra fat
needed to handle your scenarios. Thing that seem to be continuosly tried.
A file* based kernel-to-kernel interface is rather wrong IMO.



- Davide

2009-06-22 19:06:27

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
> A file* based kernel-to-kernel interface is rather wrong IMO.

But eventfd_ctx should work fine.

2009-06-22 19:22:46

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Michael S. Tsirkin wrote:
>
>
>> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
>>
>>> In your case of kernel-to-kernel scenario, why would you need eventfd at
>>> all, if userspace role in that model is simply to create it?
>>>
>> That's not 100% true. We have a mode where userspace is the producer
>> and/or consumer (migration mode) and we switch between that and
>> direct kernel-to-kernel communication.
>>
>
> Then you'd need to ask yourself how to handle your complex case inside the
> KVM code, so that other eventfd users are not affected by the extra fat
> needed to handle your scenarios. Thing that seem to be continuosly tried.
> A file* based kernel-to-kernel interface is rather wrong IMO.
>

Well, I will point out that the interface in question is
eventfd_signal(struct file *), and you were the one that invented it
afaict. Can't help it if we like it :)

BTW: The termination point of that call is incidental. Afterall, it
works the same for kernel-kernel or kernel-userspace. The only relevant
discussion point here is that your proposal breaks POLLHUP for
eventfd_signal() users, and we have a real use case that cares.

Let me ask the question a different way: Lets say we all agree that
eventfd_signal() cannot be file* based to be correct. Is there a way
you can think of that satisfies both the need to get rid of file* AND
still deliver producer-release notification to consumers. Ultimately
that is all I care about.

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 19:27:22

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
>
>> A file* based kernel-to-kernel interface is rather wrong IMO.
>>
>
> But eventfd_ctx should work fine.
>

Yeah, and I guess we can always just say that qemu can't close the fd or
something. Seems hacky, but it might work if Davide insists we need his
change.

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 19:36:24

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Michael S. Tsirkin wrote:
> > On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
> >
> >> A file* based kernel-to-kernel interface is rather wrong IMO.
> >>
> >
> > But eventfd_ctx should work fine.
> >
>
> Yeah, and I guess we can always just say that qemu can't close the fd or
> something. Seems hacky, but it might work if Davide insists we need his
> change.

Continuing here, since there's no reason of having many subthreads talking
about the same thing.
Can you make a detailed example of what you're trying to achieve (no Hint
Mode, please)?
As it sounds to me, that you need a consumer/producer reference counting,
to cover your scenario correctly.



- Davide

2009-06-22 20:00:45

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Mon, 22 Jun 2009, Michael S. Tsirkin wrote:
> >
> >
> >> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
> >>
> >>> In your case of kernel-to-kernel scenario, why would you need eventfd at
> >>> all, if userspace role in that model is simply to create it?
> >>>
> >> That's not 100% true. We have a mode where userspace is the producer
> >> and/or consumer (migration mode) and we switch between that and
> >> direct kernel-to-kernel communication.
> >>
> >
> > Then you'd need to ask yourself how to handle your complex case inside the
> > KVM code, so that other eventfd users are not affected by the extra fat
> > needed to handle your scenarios. Thing that seem to be continuosly tried.
> > A file* based kernel-to-kernel interface is rather wrong IMO.
> >
>
> Well, I will point out that the interface in question is
> eventfd_signal(struct file *), and you were the one that invented it
> afaict. Can't help it if we like it :)

Yes, I did. The case for eventfd was dual. First, it was a communication
link between userspace and kernel, for things like KAIO. Second, it was a
faster and smaller userspace replacement for things people used pipes
before.
In all those cases, at least one reference of the file* is alive in
userspace.
Even with the above case in mind, today I'd still use the eventfd_ctx as
internal kernel API accessory.



- Davide

2009-06-22 20:06:41

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>
>> Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
>>>
>>>
>>>> A file* based kernel-to-kernel interface is rather wrong IMO.
>>>>
>>>>
>>> But eventfd_ctx should work fine.
>>>
>>>
>> Yeah, and I guess we can always just say that qemu can't close the fd or
>> something. Seems hacky, but it might work if Davide insists we need his
>> change.
>>
>
> Continuing here, since there's no reason of having many subthreads talking
> about the same thing.
> Can you make a detailed example of what you're trying to achieve (no Hint
> Mode, please)?
> As it sounds to me, that you need a consumer/producer reference counting,
> to cover your scenario correctly.
>

Well, one of them was already briefly mentioned (the PCI-passthrough
thing). I am not personally working on this part (yet, anyway).

Another example of something I am actually working on as we speak would
be for this thing we are building called "virtual-bus". It is a way to
build/deploy device models directly in the kernel.

In either of these cases, we have this concept of allowing the guest to
notify the host, or vice versa, that something happened. Typically this
would be in reference to some chunk of shared memory, and the signaling
is telling the other side "I changed something, go look".

Without going into a ton of detail (unless, of course, you want it) is
that we are generalizing the signaling infrastructure (irqfd and
iosignalfd) so that something like PCI-passthrough or vbus are not
directly coupled to KVM. They communicate to KVM purely in terms of
(among other things) these irqfd/iosignalfd interfaces.

Using vbus as an example (though others are similar): vbus would
primarily exists as a kernel-model. However, there would be a small
device model in qemu-kem userspace to publish something like a PCI
device that declares its resource requirements to the guest. Some of
those requirements would be things like how many interrupts it needs,
and what IO ranges it supports, etc. When the guest programs the PCI
space, it maps the resources from its own world into the virtual PCI
resources emulated in qemu.

So up in userspace, the vbus pci-device would have an open reference to
the kvm guest (derived from /dev/kvm) and an open reference to a vbus
(derived from /dev/vbus). Lets call these kvmfd, and vbusfd,
respectively. For something like an interrupt, we would hook the point
where the PCI-MSI interrupt is assigned, and would do the following:

gsi = kvm_irq_route_gsi();
fd = eventfd(0, 0);
ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});

So userspace orchestrated the assignment of this one eventfd to a KVM
consumer, and a VBUS producer. The two subsystems do not care about the
details of the other side of the link, per se. VBUS just knows that it
can eventfd_signal() its memory region to tell whomever is listening
that it changed. Likewise, KVM just knows to inject "gsi" when it gets
signalled. You could equally have given "fd" to a userspace thread for
either producer or consumer roles, or any other combination.

If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.

The important thing is that once this is established, userspace doesn't
necessarily care about the fd anymore. So now the question is: do we
keep it around for other things? Do we keep it around because we don't
want KVM to see the POLLHUP, or do we address the "release" code so that
it works even if userspace issued close(fd) at this point. I am not
sure what the answer is, but this is the scenario we are concerned with
in this thread. In the example above, vbus is free to produce events on
its eventfd until it gets a SHMSIGNAL_DEASSIGN request.

-Greg

>
>
> - Davide
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 20:29:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, Jun 22, 2009 at 03:26:41PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
> >
> >> A file* based kernel-to-kernel interface is rather wrong IMO.
> >>
> >
> > But eventfd_ctx should work fine.
> >
>
> Yeah, and I guess we can always just say that qemu can't close the fd or
> something. Seems hacky, but it might work if Davide insists we need his
> change.

Yes, I think that's fine.

--
MST

2009-06-22 23:00:23

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> So up in userspace, the vbus pci-device would have an open reference to
> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
> (derived from /dev/vbus). Lets call these kvmfd, and vbusfd,
> respectively. For something like an interrupt, we would hook the point
> where the PCI-MSI interrupt is assigned, and would do the following:
>
> gsi = kvm_irq_route_gsi();
> fd = eventfd(0, 0);
> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
>
> So userspace orchestrated the assignment of this one eventfd to a KVM
> consumer, and a VBUS producer. The two subsystems do not care about the
> details of the other side of the link, per se. VBUS just knows that it
> can eventfd_signal() its memory region to tell whomever is listening
> that it changed. Likewise, KVM just knows to inject "gsi" when it gets
> signalled. You could equally have given "fd" to a userspace thread for
> either producer or consumer roles, or any other combination.
>
> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
>
> The important thing is that once this is established, userspace doesn't
> necessarily care about the fd anymore. So now the question is: do we
> keep it around for other things? Do we keep it around because we don't
> want KVM to see the POLLHUP, or do we address the "release" code so that
> it works even if userspace issued close(fd) at this point. I am not
> sure what the answer is, but this is the scenario we are concerned with
> in this thread. In the example above, vbus is free to produce events on
> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.

I see.
The thing remains, that in order to reliably handle generic
producer/consumer scenarios you'd need a reference counting similar to
pipes, where the notion of producer and consumer is very well defined.
In your case, it'd be sufficent to expose an:

int eventfd_wakeup(struct eventfd_ctx *ctx, void *key);

So that each end can signal, in an file* f_count independent way, when
they're about to leave. Say with a new status bit.
The problem, that I felt coming since when I introduced the key-based
wakeups, is that right now the "key" start to become a little restrictive
in terms of amount of information carried by it.
If there are other key-aware waiters on "wqh", your new bit cannot
conflict with existing ones, and you (and future users of the interface)
will be polluting the global bit namespace.
Probably turning "key" into a pointer to a structure like:

struct wakeup_key {
unsigned long type;
void *key;
};

So the current:

wake_up_poll(wq, POLLIN);

Would become:

key.type = KT_POLLMASK;
key.key = POLLIN;
wake_up_key(wq, &key);

At that point the waiters (like poll/epoll) can simply discard the wakeup
types they are not interested into (poll/epoll would care only about
KT_POLLMASK), and something more than a simple bitmask can be carried by
the wakups. Like:

key.type = KT_SOMEDATA;
key.key = &data;
wake_up_key(wq, &key);

Of course, we could build another layer on top, to fit this model, but
then we'd have the dual citizenship among normal wakeups and new-interface
wakeups/signaling.
Another thing. Now that the interface exposes the eventfd_ctx context
directly, the pollcb register/unregister does not have any reason to be
inside eventfd (they're totally generic bits at that point), so my thought
was about to drop them.



- Davide

2009-06-23 01:04:03

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>
>> So up in userspace, the vbus pci-device would have an open reference to
>> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
>> (derived from /dev/vbus). Lets call these kvmfd, and vbusfd,
>> respectively. For something like an interrupt, we would hook the point
>> where the PCI-MSI interrupt is assigned, and would do the following:
>>
>> gsi = kvm_irq_route_gsi();
>> fd = eventfd(0, 0);
>> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
>> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
>>
>> So userspace orchestrated the assignment of this one eventfd to a KVM
>> consumer, and a VBUS producer. The two subsystems do not care about the
>> details of the other side of the link, per se. VBUS just knows that it
>> can eventfd_signal() its memory region to tell whomever is listening
>> that it changed. Likewise, KVM just knows to inject "gsi" when it gets
>> signalled. You could equally have given "fd" to a userspace thread for
>> either producer or consumer roles, or any other combination.
>>
>> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
>> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
>>
>> The important thing is that once this is established, userspace doesn't
>> necessarily care about the fd anymore. So now the question is: do we
>> keep it around for other things? Do we keep it around because we don't
>> want KVM to see the POLLHUP, or do we address the "release" code so that
>> it works even if userspace issued close(fd) at this point. I am not
>> sure what the answer is, but this is the scenario we are concerned with
>> in this thread. In the example above, vbus is free to produce events on
>> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.
>>
>
> I see.
> The thing remains, that in order to reliably handle generic
> producer/consumer scenarios you'd need a reference counting similar to
> pipes, where the notion of producer and consumer is very well defined.
>

I see your point.

Well, I think the more important thing here is that we address the
races, and add support for DEASSIGN. We can do both of those things
with any of the patches that you and I have been kicking around. So
what I propose is that we move forward with whatever patch you bless as
proper for now. This producer-release issue is pretty minor in the
grand scheme of things. We can always just have userspace hold the fd.

I can either take in the last one you sent, or it sounds like you wanted
to possibly do another round of cleanup? Whatever the case may be, let
me know and we can coordinate with Andrew/Avi on what tree to pull it
into. It sounds like riding in kvm.git is the perhaps the most logical.

Thanks Davide,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 01:24:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Mon, 22 Jun 2009, Gregory Haskins wrote:
> >
> >
> >> So up in userspace, the vbus pci-device would have an open reference to
> >> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
> >> (derived from /dev/vbus). Lets call these kvmfd, and vbusfd,
> >> respectively. For something like an interrupt, we would hook the point
> >> where the PCI-MSI interrupt is assigned, and would do the following:
> >>
> >> gsi = kvm_irq_route_gsi();
> >> fd = eventfd(0, 0);
> >> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
> >> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
> >>
> >> So userspace orchestrated the assignment of this one eventfd to a KVM
> >> consumer, and a VBUS producer. The two subsystems do not care about the
> >> details of the other side of the link, per se. VBUS just knows that it
> >> can eventfd_signal() its memory region to tell whomever is listening
> >> that it changed. Likewise, KVM just knows to inject "gsi" when it gets
> >> signalled. You could equally have given "fd" to a userspace thread for
> >> either producer or consumer roles, or any other combination.
> >>
> >> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
> >> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
> >>
> >> The important thing is that once this is established, userspace doesn't
> >> necessarily care about the fd anymore. So now the question is: do we
> >> keep it around for other things? Do we keep it around because we don't
> >> want KVM to see the POLLHUP, or do we address the "release" code so that
> >> it works even if userspace issued close(fd) at this point. I am not
> >> sure what the answer is, but this is the scenario we are concerned with
> >> in this thread. In the example above, vbus is free to produce events on
> >> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.
> >>
> >
> > I see.
> > The thing remains, that in order to reliably handle generic
> > producer/consumer scenarios you'd need a reference counting similar to
> > pipes, where the notion of producer and consumer is very well defined.
> >
>
> I see your point.
>
> Well, I think the more important thing here is that we address the
> races, and add support for DEASSIGN. We can do both of those things
> with any of the patches that you and I have been kicking around. So
> what I propose is that we move forward with whatever patch you bless as
> proper for now. This producer-release issue is pretty minor in the
> grand scheme of things. We can always just have userspace hold the fd.

Is it a real problem? Can it be decently handled on the KVM side?
Reason I'm asking, is that I wouldn't want to change the interface, only
to find it unsuitable a few weeks later.




> I can either take in the last one you sent, or it sounds like you wanted
> to possibly do another round of cleanup? Whatever the case may be, let
> me know and we can coordinate with Andrew/Avi on what tree to pull it
> into. It sounds like riding in kvm.git is the perhaps the most logical.

Yes, I'd like to drop the pollcb bits (that you can implement KVM-side),
at least.
And yes, going kvm.git is probably the best path.



- Davide

2009-06-23 01:26:56

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>>>
>>>
>>>
>>>> So up in userspace, the vbus pci-device would have an open reference to
>>>> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
>>>> (derived from /dev/vbus). Lets call these kvmfd, and vbusfd,
>>>> respectively. For something like an interrupt, we would hook the point
>>>> where the PCI-MSI interrupt is assigned, and would do the following:
>>>>
>>>> gsi = kvm_irq_route_gsi();
>>>> fd = eventfd(0, 0);
>>>> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
>>>> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
>>>>
>>>> So userspace orchestrated the assignment of this one eventfd to a KVM
>>>> consumer, and a VBUS producer. The two subsystems do not care about the
>>>> details of the other side of the link, per se. VBUS just knows that it
>>>> can eventfd_signal() its memory region to tell whomever is listening
>>>> that it changed. Likewise, KVM just knows to inject "gsi" when it gets
>>>> signalled. You could equally have given "fd" to a userspace thread for
>>>> either producer or consumer roles, or any other combination.
>>>>
>>>> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
>>>> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
>>>>
>>>> The important thing is that once this is established, userspace doesn't
>>>> necessarily care about the fd anymore. So now the question is: do we
>>>> keep it around for other things? Do we keep it around because we don't
>>>> want KVM to see the POLLHUP, or do we address the "release" code so that
>>>> it works even if userspace issued close(fd) at this point. I am not
>>>> sure what the answer is, but this is the scenario we are concerned with
>>>> in this thread. In the example above, vbus is free to produce events on
>>>> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.
>>>>
>>>>
>>> I see.
>>> The thing remains, that in order to reliably handle generic
>>> producer/consumer scenarios you'd need a reference counting similar to
>>> pipes, where the notion of producer and consumer is very well defined.
>>>
>>>
>> I see your point.
>>
>> Well, I think the more important thing here is that we address the
>> races, and add support for DEASSIGN. We can do both of those things
>> with any of the patches that you and I have been kicking around. So
>> what I propose is that we move forward with whatever patch you bless as
>> proper for now. This producer-release issue is pretty minor in the
>> grand scheme of things. We can always just have userspace hold the fd.
>>
>
> Is it a real problem? Can it be decently handled on the KVM side?
> Reason I'm asking, is that I wouldn't want to change the interface, only
> to find it unsuitable a few weeks later.
>

To be honest, I am not sure. I would guess its not a *huge* deal,
though it was obviously enough of a concern to at least discuss it. I
can definitely say that I think the other issues which are being fixed
are substantially more important.

>
>
>
>
>> I can either take in the last one you sent, or it sounds like you wanted
>> to possibly do another round of cleanup? Whatever the case may be, let
>> me know and we can coordinate with Andrew/Avi on what tree to pull it
>> into. It sounds like riding in kvm.git is the perhaps the most logical.
>>
>
> Yes, I'd like to drop the pollcb bits (that you can implement KVM-side),
> at least.
> And yes, going kvm.git is probably the best path.
>

Sounds good. Thanks for your patience, Davide. I think we are almost
there! ;)

-Greg

>
>
> - Davide
>
>
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 03:25:34

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009 09:24:20 am Davide Libenzi wrote:
> I actually ended up exposing the eventfd context anyway, since IMO is a
> better option instead of holding references to the eventfd file (that
> makes VFS people uneasy).

lguest is a special case since we don't let them close the fds, except by
closing /dev/lguest.

Lguest change seems fine, but:

> Index: linux-2.6.mod/drivers/lguest/lg.h
> ===================================================================
> --- linux-2.6.mod.orig/drivers/lguest/lg.h 2009-06-21 15:55:17.000000000
> -0700 +++ linux-2.6.mod/drivers/lguest/lg.h 2009-06-21 15:56:00.000000000
> -0700 @@ -80,9 +80,11 @@ struct lg_cpu {
> struct lg_cpu_arch arch;
> };
>
> +struct eventfd_ctx;
> +
> struct lg_eventfd {
> unsigned long addr;
> - struct file *event;
> + struct eventfd_ctx *event;
> };

The first 'struct eventfd_ctx;' line is not required.

Thanks,
Rusty.

2009-06-23 14:35:50

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Mon, 22 Jun 2009, Gregory Haskins wrote:

> To be honest, I am not sure. I would guess its not a *huge* deal,
> though it was obviously enough of a concern to at least discuss it. I
> can definitely say that I think the other issues which are being fixed
> are substantially more important.

Ok then, will repost the revised patch later today.


- Davide

2009-06-23 14:37:30

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>
>> To be honest, I am not sure. I would guess its not a *huge* deal,
>> though it was obviously enough of a concern to at least discuss it. I
>> can definitely say that I think the other issues which are being fixed
>> are substantially more important.
>>
>
> Ok then, will repost the revised patch later today.
>

Ok sounds good. I did have a chance to take a closer look at your
proposal for the key data, and I think it makes a lot of sense. Do you
see it as a problem if we defer adding that? Or should we try to add
that notion now?

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 14:38:13

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Tue, 23 Jun 2009, Rusty Russell wrote:

> The first 'struct eventfd_ctx;' line is not required.

Will repost dropping that.


- Davide

2009-06-23 14:41:25

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Tue, 23 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Mon, 22 Jun 2009, Gregory Haskins wrote:
> >
> >
> >> To be honest, I am not sure. I would guess its not a *huge* deal,
> >> though it was obviously enough of a concern to at least discuss it. I
> >> can definitely say that I think the other issues which are being fixed
> >> are substantially more important.
> >>
> >
> > Ok then, will repost the revised patch later today.
> >
>
> Ok sounds good. I did have a chance to take a closer look at your
> proposal for the key data, and I think it makes a lot of sense. Do you
> see it as a problem if we defer adding that? Or should we try to add
> that notion now?

That would need to go eventually via mainline, after some discussion. But
yes, I believe that using the "key" as simple bitmask is a little
restrictive.


- Davide

2009-06-23 14:42:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Tue, 23 Jun 2009, Gregory Haskins wrote:
>
>
>> Davide Libenzi wrote:
>>
>>> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>>>
>>>
>>>
>>>> To be honest, I am not sure. I would guess its not a *huge* deal,
>>>> though it was obviously enough of a concern to at least discuss it. I
>>>> can definitely say that I think the other issues which are being fixed
>>>> are substantially more important.
>>>>
>>>>
>>> Ok then, will repost the revised patch later today.
>>>
>>>
>> Ok sounds good. I did have a chance to take a closer look at your
>> proposal for the key data, and I think it makes a lot of sense. Do you
>> see it as a problem if we defer adding that? Or should we try to add
>> that notion now?
>>
>
> That would need to go eventually via mainline, after some discussion. But
> yes, I believe that using the "key" as simple bitmask is a little
> restrictive.
>

Ok. As long as you do not think we are somehow painting ourselves into
a corner, lets just go with your original plan for the revised patch
sans key changes.

Thanks Davide,
-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 15:05:53

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Tue, Jun 23, 2009 at 07:35:00AM -0700, Davide Libenzi wrote:
> On Tue, 23 Jun 2009, Gregory Haskins wrote:
>
> > Davide Libenzi wrote:
> > > On Mon, 22 Jun 2009, Gregory Haskins wrote:
> > >
> > >
> > >> To be honest, I am not sure. I would guess its not a *huge* deal,
> > >> though it was obviously enough of a concern to at least discuss it. I
> > >> can definitely say that I think the other issues which are being fixed
> > >> are substantially more important.
> > >>
> > >
> > > Ok then, will repost the revised patch later today.
> > >
> >
> > Ok sounds good. I did have a chance to take a closer look at your
> > proposal for the key data, and I think it makes a lot of sense. Do you
> > see it as a problem if we defer adding that? Or should we try to add
> > that notion now?
>
> That would need to go eventually via mainline, after some discussion. But
> yes, I believe that using the "key" as simple bitmask is a little
> restrictive.
>
>
> - Davide

I agree.

2009-06-24 03:26:04

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote:
> What you're doing there, is setting up a kernel-to-kernel (since
> userspace only role is to create the eventfd) communication, using a file*
> as accessory. That IMO is plain wrong.

The most sensible is that userspace can use these fds; an in-kernel variant is
possible too, but not primary IMHO.

It's nice that userspace create the fds; it can then use the same fd for
multiple event sources.

But I didn't see anything wrong with the way eventfd used to work: you have a
kvm ioctl to say "attach this eventfd to this guest notification" and that does
the eventfd_fget. A detach ioctl does the fput (as does release of the kvm
fd).

If they close the eventfd and don't do the detach ioctl, it's their problem.

Rusty.

2009-06-24 22:50:45

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Wed, 24 Jun 2009, Rusty Russell wrote:

> On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote:
> > What you're doing there, is setting up a kernel-to-kernel (since
> > userspace only role is to create the eventfd) communication, using a file*
> > as accessory. That IMO is plain wrong.
>
> The most sensible is that userspace can use these fds; an in-kernel variant is
> possible too, but not primary IMHO.
>
> It's nice that userspace create the fds; it can then use the same fd for
> multiple event sources.
>
> But I didn't see anything wrong with the way eventfd used to work: you have a
> kvm ioctl to say "attach this eventfd to this guest notification" and that does
> the eventfd_fget. A detach ioctl does the fput (as does release of the kvm
> fd).
>
> If they close the eventfd and don't do the detach ioctl, it's their problem.

Some components would like to know if userspace dropped the fd, and take
proper action accordingly (release resources, drop module instances, etc...).
The POLLHUP helps with that, but w/out decoupling file* memory from
eventfd_ctx memory, it becomes pretty tricky (if feasible at all) to
handle the event in a race-free way.
Once the file* is decoupled from the eventfd_ctx, it becomes saner to
expose the internal kernel API via the eventfd_ctx.
Another thing that comes in my mind (that for some components might not
matter) is considering the effect of userspace doing things like:

for (;;) {
fd = eventfd(...);
ioctl(xfd, XXX_ADD, fd);
close(fd);
}

That might lead to unprivileged users drawing kernel memory w/out any
userspace accountability, if not properly handled.



- Davide

2009-06-25 00:24:50

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Tue, 23 Jun 2009, Davide Libenzi wrote:

> On Tue, 23 Jun 2009, Rusty Russell wrote:
>
> > The first 'struct eventfd_ctx;' line is not required.
>
> Will repost dropping that.

Almost forgot. While fixing lg.h to drop the fwd declaration, I noticed
there's another one ;)


- Davide


--- a/drivers/lguest/lg.h 2009-06-15 15:35:24.000000000 -0700
+++ b/drivers/lguest/lg.h 2009-06-24 17:08:26.000000000 -0700
@@ -38,8 +38,6 @@
#define CHANGED_GDT_TLS 4 /* Actually a subset of CHANGED_GDT */
#define CHANGED_ALL 3

-struct lguest;
-
struct lg_cpu {
unsigned int id;
struct lguest *lg;

2009-06-25 11:42:36

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote:
> On Wed, 24 Jun 2009, Rusty Russell wrote:
> > On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote:
> > > What you're doing there, is setting up a kernel-to-kernel (since
> > > userspace only role is to create the eventfd) communication, using a
> > > file* as accessory. That IMO is plain wrong.
> >
> > The most sensible is that userspace can use these fds; an in-kernel
> > variant is possible too, but not primary IMHO.
> >
> > It's nice that userspace create the fds; it can then use the same fd for
> > multiple event sources.
> >
> > But I didn't see anything wrong with the way eventfd used to work: you
> > have a kvm ioctl to say "attach this eventfd to this guest notification"
> > and that does the eventfd_fget. A detach ioctl does the fput (as does
> > release of the kvm fd).
> >
> > If they close the eventfd and don't do the detach ioctl, it's their
> > problem.
>
> Some components would like to know if userspace dropped the fd, and take
> proper action accordingly (release resources, drop module instances,
> etc...).

Like to know? Possibly. Need to know? Not anything I've seen so far.

If userspace creates the fd, component grab a ref and if userspace wants that
fd completely freed must close the fd *and* tell component. Simple, race free
and explicit. All wins.

As this discussion shows, doing some kind of implies non-reference is hard,
complex and racy.

> Another thing that comes in my mind (that for some components might not
> matter) is considering the effect of userspace doing things like:
>
> for (;;) {
> fd = eventfd(...);
> ioctl(xfd, XXX_ADD, fd);
> close(fd);
> }
>
> That might lead to unprivileged users drawing kernel memory w/out any
> userspace accountability, if not properly handled.

No, fget_eventfd covers this exactly as expected. Don't doubt your ability to
design sane kernel interfaces; eventfd is nice! All lguest needed was a
couple of EXPORT_SYMBOLS and it fitted in beautifully.

Thanks,
Rusty.

2009-06-25 16:39:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Thu, 25 Jun 2009, Rusty Russell wrote:

> On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote:
> >
> > Some components would like to know if userspace dropped the fd, and take
> > proper action accordingly (release resources, drop module instances,
> > etc...).
>
> Like to know? Possibly. Need to know? Not anything I've seen so far.
>
> If userspace creates the fd, component grab a ref and if userspace wants that
> fd completely freed must close the fd *and* tell component. Simple, race free
> and explicit. All wins.
>
> As this discussion shows, doing some kind of implies non-reference is hard,
> complex and racy.

Easier, we agree. Not doing anything is always easier, provided the
userspace interface allows for it.
Cleaner, I'm not sure. Again, it depends from the userspace interface, but
usually when you close(2) something, you expect the kernel to react
accordingly, and not on relying on userspace issuing extra calls in order
to proper cleanup the kernel context.
This is even more true when the eventfd is the sole handle to the visisble
userspace interface.
In such cases, not taking proper action on close(2) and requiring extra
calls, would lead to designing interface with the close-no-really-i-mean-it
patterns.



- Davide

2009-06-25 17:32:42

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Davide Libenzi wrote:
> On Thu, 25 Jun 2009, Rusty Russell wrote:
>
>
>> On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote:
>>
>>> Some components would like to know if userspace dropped the fd, and take
>>> proper action accordingly (release resources, drop module instances,
>>> etc...).
>>>
>> Like to know? Possibly. Need to know? Not anything I've seen so far.
>>
>> If userspace creates the fd, component grab a ref and if userspace wants that
>> fd completely freed must close the fd *and* tell component. Simple, race free
>> and explicit. All wins.
>>
>> As this discussion shows, doing some kind of implies non-reference is hard,
>> complex and racy.
>>
>
> Easier, we agree. Not doing anything is always easier, provided the
> userspace interface allows for it.
> Cleaner, I'm not sure. Again, it depends from the userspace interface, but
> usually when you close(2) something, you expect the kernel to react
> accordingly, and not on relying on userspace issuing extra calls in order
> to proper cleanup the kernel context.
> This is even more true when the eventfd is the sole handle to the visisble
> userspace interface.
> In such cases, not taking proper action on close(2) and requiring extra
> calls, would lead to designing interface with the close-no-really-i-mean-it
> patterns.
>

Avi,
Davide's argument is compelling in favor of considering the latest
irqfd fixes I pushed. You could run into weirdness if we tried to
revert to the old requirement on explicit DEASSIGN ioctl once we start
doing things like handing the fd to other apps/components. Its cleaner
IMO to retain the implicit-deassign-on-close behavior.

Since v5 also restores optional explicit DEASSIGN support, the only
downside to accepting my patches (vs reverting POLLHUP completely) is
the complexity required to deal with POLLHUP. But this has been
substantially reduced and clarified since yesterdays review, and I am
reasonably confident the protocol is sound.

Please take a close look at it and consider for merging, if you would.

Thanks,
-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-25 18:27:27

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote:
> Please take a close look at it and consider for merging, if you would.

With all the incremental patching, I kind of lost track of what the
complete file would look like. Is there a git tree I could pull?

--
MST

2009-06-25 18:41:24

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote:
>
>> Please take a close look at it and consider for merging, if you would.
>>
>
> With all the incremental patching, I kind of lost track of what the
> complete file would look like. Is there a git tree I could pull?
>
>

Ask and ye shall receive :)

git pull
ssh://master.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git
kvm/irqfd

If you dont have a kernel.org account, it should rsync to the
gitweb/git: server soon, in which case
s|ssh://master|git://git should get you the proper path.

Note that there is one additional race that I found that is not included
in that tree in the slow-work infrastructure. You can find my latest
submission to David Howells to resolve that issue here:

http://patchwork.kernel.org/patch/32287/

Thanks Michael!
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-26 11:24:21

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

On Thu, Jun 25, 2009 at 02:41:09PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote:
> >
> >> Please take a close look at it and consider for merging, if you would.
> >>
> >
> > With all the incremental patching, I kind of lost track of what the
> > complete file would look like. Is there a git tree I could pull?
> >
> >
>
> Ask and ye shall receive :)
>
> git pull
> ssh://master.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git
> kvm/irqfd
>
> If you dont have a kernel.org account, it should rsync to the
> gitweb/git: server soon, in which case
> s|ssh://master|git://git should get you the proper path.
>
> Note that there is one additional race that I found that is not included
> in that tree in the slow-work infrastructure. You can find my latest
> submission to David Howells to resolve that issue here:
>
> http://patchwork.kernel.org/patch/32287/
>
> Thanks Michael!
> -Greg
>

Yes, I saw that. I agree your patch solves that issue.

--
MST