2010-01-21 16:29:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

This is a backport of commit: 03db343a6320f780937078433fa7d8da955e6fce
modified in a way that introduces some code duplication on the one hand,
but reduces the risk of regressing existing eventfd users on the other
hand.

KVM needs a wait to atomically remove themselves from the eventfd
->poll() wait queue head, in order to handle correctly their IRQfd
deassign operation.

This patch introduces such API, plus a way to read an eventfd from its
context.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---

Avi, Davidel, how about only including the following part for -stable
then? Reason is, I still would like to be able to use irqfd there, and
getting spurious interrupts 100% of times unmask is done isn't a very
good idea IMO ...


fs/eventfd.c | 35 +++++++++++++++++++++++++++++++++++
include/linux/eventfd.h | 9 +++++++++
2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8b47e42..ea9c18a 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -135,6 +135,41 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)
return events;
}

+static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
+{
+ *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
+ ctx->count -= *cnt;
+}
+
+/**
+ * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue.
+ * @ctx: [in] Pointer to eventfd context.
+ * @wait: [in] Wait queue to be removed.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN : The operation would have blocked.
+ *
+ * This is used to atomically remove a wait queue entry from the eventfd wait
+ * queue head, and read/reset the counter value.
+ */
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+ __u64 *cnt)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&ctx->wqh.lock, flags);
+ eventfd_ctx_do_read(ctx, cnt);
+ __remove_wait_queue(&ctx->wqh, wait);
+ if (*cnt != 0 && waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh, POLLOUT);
+ spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+
+ return *cnt != 0 ? 0 : -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);
+
static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
loff_t *ppos)
{
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 94dd103..85eac48 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -10,6 +10,7 @@

#include <linux/fcntl.h>
#include <linux/file.h>
+#include <linux/wait.h>

/*
* CAREFUL: Check include/asm-generic/fcntl.h when defining
@@ -34,6 +35,8 @@ struct file *eventfd_fget(int fd);
struct eventfd_ctx *eventfd_ctx_fdget(int fd);
struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
int eventfd_signal(struct eventfd_ctx *ctx, int n);
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+ __u64 *cnt);

#else /* CONFIG_EVENTFD */

@@ -61,6 +64,12 @@ static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)

}

+static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx,
+ wait_queue_t *wait, __u64 *cnt)
+{
+ return -ENOSYS;
+}
+
#endif

#endif /* _LINUX_EVENTFD_H */
--
1.6.6.144.g5c3af


2010-01-21 16:58:53

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On Thu, 21 Jan 2010, Michael S. Tsirkin wrote:

> This is a backport of commit: 03db343a6320f780937078433fa7d8da955e6fce
> modified in a way that introduces some code duplication on the one hand,
> but reduces the risk of regressing existing eventfd users on the other
> hand.
>
> KVM needs a wait to atomically remove themselves from the eventfd
> ->poll() wait queue head, in order to handle correctly their IRQfd
> deassign operation.
>
> This patch introduces such API, plus a way to read an eventfd from its
> context.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
>
> Avi, Davidel, how about only including the following part for -stable
> then? Reason is, I still would like to be able to use irqfd there, and
> getting spurious interrupts 100% of times unmask is done isn't a very
> good idea IMO ...

It's the same thing. Unless there are *real* problems in KVM due to the
spurious ints, I still think this is .33 material.


- Davide

2010-01-21 17:13:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On 01/21/2010 06:58 PM, Davide Libenzi wrote:
> On Thu, 21 Jan 2010, Michael S. Tsirkin wrote:
>
>
>> This is a backport of commit: 03db343a6320f780937078433fa7d8da955e6fce
>> modified in a way that introduces some code duplication on the one hand,
>> but reduces the risk of regressing existing eventfd users on the other
>> hand.
>>
>> KVM needs a wait to atomically remove themselves from the eventfd
>> ->poll() wait queue head, in order to handle correctly their IRQfd
>> deassign operation.
>>
>> This patch introduces such API, plus a way to read an eventfd from its
>> context.
>>
>> Signed-off-by: Michael S. Tsirkin<[email protected]>
>> ---
>>
>> Avi, Davidel, how about only including the following part for -stable
>> then? Reason is, I still would like to be able to use irqfd there, and
>> getting spurious interrupts 100% of times unmask is done isn't a very
>> good idea IMO ...
>>
> It's the same thing. Unless there are *real* problems in KVM due to the
> spurious ints, I still think this is .33 material.
>

I agree.

But I think we can solve this in another way in .32: we can clear the
eventfd from irqfd->inject work, which is in process context. The new
stuff is only needed for lockless clearing, no?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:14:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On 01/21/2010 07:13 PM, Avi Kivity wrote:
>
> But I think we can solve this in another way in .32: we can clear the
> eventfd from irqfd->inject work, which is in process context. The new
> stuff is only needed for lockless clearing, no?
>

I meant atomic clearing, when we inject interrupts from the irqfd atomic
context.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:26:44

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On Thu, Jan 21, 2010 at 07:13:13PM +0200, Avi Kivity wrote:
> On 01/21/2010 06:58 PM, Davide Libenzi wrote:
>> On Thu, 21 Jan 2010, Michael S. Tsirkin wrote:
>>
>>
>>> This is a backport of commit: 03db343a6320f780937078433fa7d8da955e6fce
>>> modified in a way that introduces some code duplication on the one hand,
>>> but reduces the risk of regressing existing eventfd users on the other
>>> hand.
>>>
>>> KVM needs a wait to atomically remove themselves from the eventfd
>>> ->poll() wait queue head, in order to handle correctly their IRQfd
>>> deassign operation.
>>>
>>> This patch introduces such API, plus a way to read an eventfd from its
>>> context.
>>>
>>> Signed-off-by: Michael S. Tsirkin<[email protected]>
>>> ---
>>>
>>> Avi, Davidel, how about only including the following part for -stable
>>> then? Reason is, I still would like to be able to use irqfd there, and
>>> getting spurious interrupts 100% of times unmask is done isn't a very
>>> good idea IMO ...
>>>
>> It's the same thing. Unless there are *real* problems in KVM due to the
>> spurious ints, I still think this is .33 material.
>>
>
> I agree.
>
> But I think we can solve this in another way in .32: we can clear the
> eventfd from irqfd->inject work, which is in process context. The new
> stuff is only needed for lockless clearing, no?

No, AFAIK there's no way to clear the counter from kernel without
this patch.

> --
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:33:09

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On 01/21/2010 07:23 PM, Michael S. Tsirkin wrote:
>
>> I agree.
>>
>> But I think we can solve this in another way in .32: we can clear the
>> eventfd from irqfd->inject work, which is in process context. The new
>> stuff is only needed for lockless clearing, no?
>>
> No, AFAIK there's no way to clear the counter from kernel without
> this patch.
>

Can't you read from the file?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:36:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On Thu, Jan 21, 2010 at 07:33:02PM +0200, Avi Kivity wrote:
> On 01/21/2010 07:23 PM, Michael S. Tsirkin wrote:
>>
>>> I agree.
>>>
>>> But I think we can solve this in another way in .32: we can clear the
>>> eventfd from irqfd->inject work, which is in process context. The new
>>> stuff is only needed for lockless clearing, no?
>>>
>> No, AFAIK there's no way to clear the counter from kernel without
>> this patch.
>>
>
> Can't you read from the file?

IMO no, the read could block.

> --
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:47:48

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On 01/21/2010 07:32 PM, Michael S. Tsirkin wrote:
>
>> Can't you read from the file?
>>
> IMO no, the read could block.
>

But you're in process context. An eventfd never blocks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:48:45

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On Thu, Jan 21, 2010 at 07:47:40PM +0200, Avi Kivity wrote:
> On 01/21/2010 07:32 PM, Michael S. Tsirkin wrote:
>>
>>> Can't you read from the file?
>>>
>> IMO no, the read could block.
>>
>
> But you're in process context. An eventfd never blocks.

Yes it blocks if counter is 0. And we don't know
it's not 0 unless we read :) catch-22.

> --
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:50:42

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On Thu, 21 Jan 2010, Avi Kivity wrote:

> On 01/21/2010 07:32 PM, Michael S. Tsirkin wrote:
> >
> > > Can't you read from the file?
> > >
> > IMO no, the read could block.
> >
>
> But you're in process context. An eventfd never blocks.

Can you control the eventfd flags? Because if yes, O_NONBLOCK will never
block.



- Davide

2010-01-21 17:53:22

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On Thu, Jan 21, 2010 at 09:50:34AM -0800, Davide Libenzi wrote:
> On Thu, 21 Jan 2010, Avi Kivity wrote:
>
> > On 01/21/2010 07:32 PM, Michael S. Tsirkin wrote:
> > >
> > > > Can't you read from the file?
> > > >
> > > IMO no, the read could block.
> > >
> >
> > But you're in process context. An eventfd never blocks.
>
> Can you control the eventfd flags? Because if yes, O_NONBLOCK will never
> block.
>

Userspace can but kvm can't.

>
> - Davide
>

2010-01-21 17:56:16

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On 01/21/2010 07:45 PM, Michael S. Tsirkin wrote:
>
>> But you're in process context. An eventfd never blocks.
>>
> Yes it blocks if counter is 0. And we don't know
> it's not 0 unless we read :) catch-22.
>

Ah yes, I forgot.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 17:57:29

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On 01/21/2010 07:56 PM, Avi Kivity wrote:
> On 01/21/2010 07:45 PM, Michael S. Tsirkin wrote:
>>
>>> But you're in process context. An eventfd never blocks.
>> Yes it blocks if counter is 0. And we don't know
>> it's not 0 unless we read :) catch-22.
>
> Ah yes, I forgot.
>

Well, you can poll it and then read it... this introduces a new race (if
userspace does a read in parallel) but it's limited to kvm and buggy
userspace.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-21 18:05:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On Thu, Jan 21, 2010 at 07:57:22PM +0200, Avi Kivity wrote:
> On 01/21/2010 07:56 PM, Avi Kivity wrote:
>> On 01/21/2010 07:45 PM, Michael S. Tsirkin wrote:
>>>
>>>> But you're in process context. An eventfd never blocks.
>>> Yes it blocks if counter is 0. And we don't know
>>> it's not 0 unless we read :) catch-22.
>>
>> Ah yes, I forgot.
>>
>
> Well, you can poll it and then read it... this introduces a new race (if
> userspace does a read in parallel) but it's limited to kvm and buggy
> userspace.

I would rather not require that userspace never reads this fd.
You are right that it does not now, but adding this as requirement
looks like exporting an implementation bug to userspace.

--
MST

2010-01-24 11:06:41

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCHv2 1/3] eventfd: allow atomic read and waitqueue remove

On 01/21/2010 08:02 PM, Michael S. Tsirkin wrote:
> On Thu, Jan 21, 2010 at 07:57:22PM +0200, Avi Kivity wrote:
>
>> On 01/21/2010 07:56 PM, Avi Kivity wrote:
>>
>>> On 01/21/2010 07:45 PM, Michael S. Tsirkin wrote:
>>>
>>>>
>>>>> But you're in process context. An eventfd never blocks.
>>>>>
>>>> Yes it blocks if counter is 0. And we don't know
>>>> it's not 0 unless we read :) catch-22.
>>>>
>>> Ah yes, I forgot.
>>>
>>>
>> Well, you can poll it and then read it... this introduces a new race (if
>> userspace does a read in parallel) but it's limited to kvm and buggy
>> userspace.
>>
> I would rather not require that userspace never reads this fd.
> You are right that it does not now, but adding this as requirement
> looks like exporting an implementation bug to userspace.
>

Well, I don't like risking 2.6.32 non-kvm users for a bug that doesn't
happen in practice now.

After it gets some time in 2.6.33, we can backport it to 2.6.32 (since
that will be maintained long term).

--
error compiling committee.c: too many arguments to function