LinuxLists.cc - [PATCH] eventpoll: fix missing wakeup for ovflist in ep_poll

2020-04-24 02:53:47

Subject: [PATCH] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

2020-04-24 03:33:55

Subject: Re: [PATCH] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On Thu, 23 Apr 2020 19:50:57 -0700 Khazhismel Kumykov <[email protected]> wrote:

> In the event that we add to ovflist, before 339ddb53d373 we would be
> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
> With that wakeup removed, if we add to ovflist here, we may never wake
> up. Rather than adding back the ep_scan_ready_list wakeup - which was
> resulting un uncessary wakeups, trigger a wake-up in ep_poll_callback.
>
> We noticed that one of our workloads was missing wakeups starting with
> 339ddb53d373 and upon manual inspection, this wakeup seemed missing to
> me. With this patch added, we no longer see missing wakeups. I haven't
> yet tried to make a small reproducer, but the existing kselftests in
> filesystem/epoll passed for me with this patch.

I'm no longer familiar with this code, so I'll await input from others.

> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
> Signed-off-by: Khazhismel Kumykov <[email protected]>

However it sounds like a cc:stable would be warranted here, so that
earlier affected kernels get the fix?

2020-04-24 10:15:37

by Roman Penyaev

[permalink] [raw]

Subject: Re: [PATCH] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

Hi Khazhismel,

That seems to be correct. The patch you refer 339ddb53d373
relies on callback path, which *should* wake up, not the path
which harvests events (thus unnecessary wakeups). When we add
a new event to the ->ovflist nobody wakes up the waiters,
thus missing wakeup. You are right.

May I suggest a small change in order to avoid one new goto?
We can add a new event in either ->ovflist or ->rdllist and
then wakeup should happen. So simple 'else if' branch should
do things right, something like the following:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 8c596641a72b..7d566667c6ad 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1171,6 +1171,10 @@ static inline bool chain_epi_lockless(struct
epitem *epi)
{
struct eventpoll *ep = epi->ep;

+ /* Fast preliminary check */
+ if (epi->next != EP_UNACTIVE_PTR)
+ return false;
+
/* Check that the same epi has not been just chained from
another CPU */
if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) !=
EP_UNACTIVE_PTR)
return false;
@@ -1237,16 +1241,13 @@ static int ep_poll_callback(wait_queue_entry_t
*wait, unsigned mode, int sync, v
* chained in ep->ovflist and requeued later on.
*/
if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
- if (epi->next == EP_UNACTIVE_PTR &&
- chain_epi_lockless(epi))
+ if (chain_epi_lockless(epi))
ep_pm_stay_awake_rcu(epi);
- goto out_unlock;
}
-
- /* If this file is already in the ready list we exit soon */
- if (!ep_is_linked(epi) &&
- list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) {
- ep_pm_stay_awake_rcu(epi);
+ /* Otherwise take usual path and add event to ready list */
+ else if (!ep_is_linked(epi)) {
+ if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
+ ep_pm_stay_awake_rcu(epi);
}

I also moved 'epi->next == EP_UNACTIVE_PTR' check directly
to the chain_epi_lockless, where it should be.

This is minor, of course, you are free to keep it as is.

Reviewed-by: Roman Penyaev <[email protected]>

--
Roman

On 2020-04-24 04:50, Khazhismel Kumykov wrote:
> In the event that we add to ovflist, before 339ddb53d373 we would be
> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
> With that wakeup removed, if we add to ovflist here, we may never wake
> up. Rather than adding back the ep_scan_ready_list wakeup - which was
> resulting un uncessary wakeups, trigger a wake-up in ep_poll_callback.
>
> We noticed that one of our workloads was missing wakeups starting with
> 339ddb53d373 and upon manual inspection, this wakeup seemed missing to
> me. With this patch added, we no longer see missing wakeups. I haven't
> yet tried to make a small reproducer, but the existing kselftests in
> filesystem/epoll passed for me with this patch.
>
> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
> epoll")
> Signed-off-by: Khazhismel Kumykov <[email protected]>
> ---
> fs/eventpoll.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 8c596641a72b..40cc89559cf6 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1240,7 +1240,7 @@ static int ep_poll_callback(wait_queue_entry_t
> *wait, unsigned mode, int sync, v
> if (epi->next == EP_UNACTIVE_PTR &&
> chain_epi_lockless(epi))
> ep_pm_stay_awake_rcu(epi);
> - goto out_unlock;
> + goto out_wakeup_unlock;
> }
>
> /* If this file is already in the ready list we exit soon */
> @@ -1249,6 +1249,7 @@ static int ep_poll_callback(wait_queue_entry_t
> *wait, unsigned mode, int sync, v
> ep_pm_stay_awake_rcu(epi);
> }
>
> +out_wakeup_unlock:
> /*
> * Wake up ( if active ) both the eventpoll wait list and the
> ->poll()
> * wait list.

2020-04-24 18:52:50

by Khazhismel Kumykov

[permalink] [raw]

Subject: Re: [PATCH] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On Fri, Apr 24, 2020 at 3:11 AM Roman Penyaev <[email protected]> wrote:
>
> Hi Khazhismel,
>
> That seems to be correct. The patch you refer 339ddb53d373
> relies on callback path, which *should* wake up, not the path
> which harvests events (thus unnecessary wakeups). When we add
> a new event to the ->ovflist nobody wakes up the waiters,
> thus missing wakeup. You are right.
>
> May I suggest a small change in order to avoid one new goto?
> We can add a new event in either ->ovflist or ->rdllist and
> then wakeup should happen. So simple 'else if' branch should
> do things right, something like the following:
>

Thanks for the review! I agree, I'll send a v2 without new goto

Attachments:

smime.p7s (3.76 kB)
S/MIME Cryptographic Signature

2020-04-24 19:04:19

by Khazhismel Kumykov

[permalink] [raw]

Subject: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

In the event that we add to ovflist, before 339ddb53d373 we would be
woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
With that wakeup removed, if we add to ovflist here, we may never wake
up. Rather than adding back the ep_scan_ready_list wakeup - which was
resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.

We noticed that one of our workloads was missing wakeups starting with
339ddb53d373 and upon manual inspection, this wakeup seemed missing to
me. With this patch added, we no longer see missing wakeups. I haven't
yet tried to make a small reproducer, but the existing kselftests in
filesystem/epoll passed for me with this patch.

Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")

Signed-off-by: Khazhismel Kumykov <[email protected]>
Reviewed-by: Roman Penyaev <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Heiher <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: <[email protected]>
---
v2: use if/elif instead of goto + cleanup suggested by Roman
fs/eventpoll.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 8c596641a72b..d6ba0e52439b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1171,6 +1171,10 @@ static inline bool chain_epi_lockless(struct epitem *epi)
{
struct eventpoll *ep = epi->ep;

+ /* Fast preliminary check */
+ if (epi->next != EP_UNACTIVE_PTR)
+ return false;
+
/* Check that the same epi has not been just chained from another CPU */
if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR)
return false;
@@ -1237,16 +1241,12 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
* chained in ep->ovflist and requeued later on.
*/
if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
- if (epi->next == EP_UNACTIVE_PTR &&
- chain_epi_lockless(epi))
+ if (chain_epi_lockless(epi))
+ ep_pm_stay_awake_rcu(epi);
+ } else if (!ep_is_linked(epi)) {
+ /* In the usual case, add event to ready list. */
+ if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
ep_pm_stay_awake_rcu(epi);
- goto out_unlock;
- }
-
- /* If this file is already in the ready list we exit soon */
- if (!ep_is_linked(epi) &&
- list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) {
- ep_pm_stay_awake_rcu(epi);
}

/*
--
2.26.2.303.gf8c07b1a785-goog

2020-04-25 16:19:53

by Jason Baron

[permalink] [raw]

Subject: Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
> In the event that we add to ovflist, before 339ddb53d373 we would be
> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
> With that wakeup removed, if we add to ovflist here, we may never wake
> up. Rather than adding back the ep_scan_ready_list wakeup - which was
> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.

I'm just curious which 'wakeup' we are talking about here? There is:
wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
ep_poll_safewake(ep, epi). It seems to me that its only about the later
one being missing not both? Is your workload using nested epoll?

If so, it might make sense to just do the later, since the point of
the original patch was to minimize unnecessary wakeups.

Thanks,

-Jason

>
> We noticed that one of our workloads was missing wakeups starting with
> 339ddb53d373 and upon manual inspection, this wakeup seemed missing to
> me. With this patch added, we no longer see missing wakeups. I haven't
> yet tried to make a small reproducer, but the existing kselftests in
> filesystem/epoll passed for me with this patch.
>
> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
>
> Signed-off-by: Khazhismel Kumykov <[email protected]>
> Reviewed-by: Roman Penyaev <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Heiher <[email protected]>
> Cc: Jason Baron <[email protected]>
> Cc: <[email protected]>
> ---
> v2: use if/elif instead of goto + cleanup suggested by Roman
> fs/eventpoll.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 8c596641a72b..d6ba0e52439b 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1171,6 +1171,10 @@ static inline bool chain_epi_lockless(struct epitem *epi)
> {
> struct eventpoll *ep = epi->ep;
>
> + /* Fast preliminary check */
> + if (epi->next != EP_UNACTIVE_PTR)
> + return false;
> +
> /* Check that the same epi has not been just chained from another CPU */
> if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR)
> return false;
> @@ -1237,16 +1241,12 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
> * chained in ep->ovflist and requeued later on.
> */
> if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
> - if (epi->next == EP_UNACTIVE_PTR &&
> - chain_epi_lockless(epi))
> + if (chain_epi_lockless(epi))
> + ep_pm_stay_awake_rcu(epi);
> + } else if (!ep_is_linked(epi)) {
> + /* In the usual case, add event to ready list. */
> + if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
> ep_pm_stay_awake_rcu(epi);
> - goto out_unlock;
> - }
> -
> - /* If this file is already in the ready list we exit soon */
> - if (!ep_is_linked(epi) &&
> - list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) {
> - ep_pm_stay_awake_rcu(epi);
> }
>
> /*
>

2020-04-25 21:01:56

by Khazhismel Kumykov

[permalink] [raw]

Subject: Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <[email protected]> wrote:
>
>
>
> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
> > In the event that we add to ovflist, before 339ddb53d373 we would be
> > woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
> > With that wakeup removed, if we add to ovflist here, we may never wake
> > up. Rather than adding back the ep_scan_ready_list wakeup - which was
> > resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
>
> I'm just curious which 'wakeup' we are talking about here? There is:
> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
> ep_poll_safewake(ep, epi). It seems to me that its only about the later
> one being missing not both? Is your workload using nested epoll?
>
> If so, it might make sense to just do the later, since the point of
> the original patch was to minimize unnecessary wakeups.

The missing wake-ups were when we added to ovflist instead of rdllist.
Both are "the ready list" together - so I'd think we'd want the same
wakeups regardless of which specific list we added to.
ep_poll_callback isn't nested specific?

>
> Thanks,
>
> -Jason
>
> >
> > We noticed that one of our workloads was missing wakeups starting with
> > 339ddb53d373 and upon manual inspection, this wakeup seemed missing to
> > me. With this patch added, we no longer see missing wakeups. I haven't
> > yet tried to make a small reproducer, but the existing kselftests in
> > filesystem/epoll passed for me with this patch.
> >
> > Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
> >
> > Signed-off-by: Khazhismel Kumykov <[email protected]>
> > Reviewed-by: Roman Penyaev <[email protected]>
> > Cc: Alexander Viro <[email protected]>
> > Cc: Heiher <[email protected]>
> > Cc: Jason Baron <[email protected]>
> > Cc: <[email protected]>
> > ---
> > v2: use if/elif instead of goto + cleanup suggested by Roman
> > fs/eventpoll.c | 18 +++++++++---------
> > 1 file changed, 9 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index 8c596641a72b..d6ba0e52439b 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -1171,6 +1171,10 @@ static inline bool chain_epi_lockless(struct epitem *epi)
> > {
> > struct eventpoll *ep = epi->ep;
> >
> > + /* Fast preliminary check */
> > + if (epi->next != EP_UNACTIVE_PTR)
> > + return false;
> > +
> > /* Check that the same epi has not been just chained from another CPU */
> > if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR)
> > return false;
> > @@ -1237,16 +1241,12 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
> > * chained in ep->ovflist and requeued later on.
> > */
> > if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
> > - if (epi->next == EP_UNACTIVE_PTR &&
> > - chain_epi_lockless(epi))
> > + if (chain_epi_lockless(epi))
> > + ep_pm_stay_awake_rcu(epi);
> > + } else if (!ep_is_linked(epi)) {
> > + /* In the usual case, add event to ready list. */
> > + if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
> > ep_pm_stay_awake_rcu(epi);
> > - goto out_unlock;
> > - }
> > -
> > - /* If this file is already in the ready list we exit soon */
> > - if (!ep_is_linked(epi) &&
> > - list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) {
> > - ep_pm_stay_awake_rcu(epi);
> > }
> >
> > /*
> >

Attachments:

smime.p7s (3.76 kB)
S/MIME Cryptographic Signature

2020-04-27 20:41:57

by Jason Baron

[permalink] [raw]

Subject: Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <[email protected]> wrote:
>>
>>
>>
>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
>>> In the event that we add to ovflist, before 339ddb53d373 we would be
>>> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
>>> With that wakeup removed, if we add to ovflist here, we may never wake
>>> up. Rather than adding back the ep_scan_ready_list wakeup - which was
>>> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
>>
>> I'm just curious which 'wakeup' we are talking about here? There is:
>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
>> ep_poll_safewake(ep, epi). It seems to me that its only about the later
>> one being missing not both? Is your workload using nested epoll?
>>
>> If so, it might make sense to just do the later, since the point of
>> the original patch was to minimize unnecessary wakeups.
>
> The missing wake-ups were when we added to ovflist instead of rdllist.
> Both are "the ready list" together - so I'd think we'd want the same
> wakeups regardless of which specific list we added to.
> ep_poll_callback isn't nested specific?
>

So I was thinking that ep_poll() would see these events on the
ovflist without an explicit wakeup, b/c the overflow list being active
implies that the ep_poll() path would add them to the rdllist in
ep_scan_read_list(). Thus, it will see the events either in the
current ep_poll() context or via a subsequent syscall to epoll_wait().

However, there are other paths that can call ep_scan_ready_list() thus
I agree with you that both wakeups here are necessary.

I do think are are still (smaller) potential races in ep_scan_ready_list()
where we have:

write_lock_irq(&ep->lock);
list_splice_init(&ep->rdllist, &txlist);
WRITE_ONCE(ep->ovflist, NULL);
write_unlock_irq(&ep->lock);

And in the ep_poll path we have:

static inline int ep_events_available(struct eventpoll *ep)
{
return !list_empty_careful(&ep->rdllist) ||
READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
}

Seems to me that first bit needs to be the following, since
ep_events_available() is now checked in a lockless way:

write_lock_irq(&ep->lock);
WRITE_ONCE(ep->ovflist, NULL);
smp_wmb();
list_splice_init(&ep->rdllist, &txlist);
write_unlock_irq(&ep->lock);

And also this bit:

WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);

/*
* Quickly re-inject items left on "txlist".
*/
list_splice(&txlist, &ep->rdllist);

Should I think better be reversed as well to:

list_splice(&txlist, &ep->rdllist);
smp_wmb();
WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);

I can send those as a separate patch followup.

Thanks,

-Jason

>>> We noticed that one of our workloads was missing wakeups starting with
>>> 339ddb53d373 and upon manual inspection, this wakeup seemed missing to
>>> me. With this patch added, we no longer see missing wakeups. I haven't
>>> yet tried to make a small reproducer, but the existing kselftests in
>>> filesystem/epoll passed for me with this patch.
>>>
>>> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
>>>
>>> Signed-off-by: Khazhismel Kumykov <[email protected]>
>>> Reviewed-by: Roman Penyaev <[email protected]>
>>> Cc: Alexander Viro <[email protected]>
>>> Cc: Heiher <[email protected]>
>>> Cc: Jason Baron <[email protected]>
>>> Cc: <[email protected]>
>>> ---
>>> v2: use if/elif instead of goto + cleanup suggested by Roman
>>> fs/eventpoll.c | 18 +++++++++---------
>>> 1 file changed, 9 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index 8c596641a72b..d6ba0e52439b 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -1171,6 +1171,10 @@ static inline bool chain_epi_lockless(struct epitem *epi)
>>> {
>>> struct eventpoll *ep = epi->ep;
>>>
>>> + /* Fast preliminary check */
>>> + if (epi->next != EP_UNACTIVE_PTR)
>>> + return false;
>>> +
>>> /* Check that the same epi has not been just chained from another CPU */
>>> if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR)
>>> return false;
>>> @@ -1237,16 +1241,12 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>> * chained in ep->ovflist and requeued later on.
>>> */
>>> if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
>>> - if (epi->next == EP_UNACTIVE_PTR &&
>>> - chain_epi_lockless(epi))
>>> + if (chain_epi_lockless(epi))
>>> + ep_pm_stay_awake_rcu(epi);
>>> + } else if (!ep_is_linked(epi)) {
>>> + /* In the usual case, add event to ready list. */
>>> + if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
>>> ep_pm_stay_awake_rcu(epi);
>>> - goto out_unlock;
>>> - }
>>> -
>>> - /* If this file is already in the ready list we exit soon */
>>> - if (!ep_is_linked(epi) &&
>>> - list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) {
>>> - ep_pm_stay_awake_rcu(epi);
>>> }
>>>
>>> /*
>>>

2020-04-28 18:15:04

by Roman Penyaev

[permalink] [raw]

Subject: Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On 2020-04-27 22:38, Jason Baron wrote:
> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <[email protected]> wrote:
>>>
>>>
>>>
>>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
>>>> In the event that we add to ovflist, before 339ddb53d373 we would be
>>>> woken up by ep_scan_ready_list, and did no wakeup in
>>>> ep_poll_callback.
>>>> With that wakeup removed, if we add to ovflist here, we may never
>>>> wake
>>>> up. Rather than adding back the ep_scan_ready_list wakeup - which
>>>> was
>>>> resulting in unnecessary wakeups, trigger a wake-up in
>>>> ep_poll_callback.
>>>
>>> I'm just curious which 'wakeup' we are talking about here? There is:
>>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
>>> ep_poll_safewake(ep, epi). It seems to me that its only about the
>>> later
>>> one being missing not both? Is your workload using nested epoll?
>>>
>>> If so, it might make sense to just do the later, since the point of
>>> the original patch was to minimize unnecessary wakeups.
>>
>> The missing wake-ups were when we added to ovflist instead of rdllist.
>> Both are "the ready list" together - so I'd think we'd want the same
>> wakeups regardless of which specific list we added to.
>> ep_poll_callback isn't nested specific?
>>
>
> So I was thinking that ep_poll() would see these events on the
> ovflist without an explicit wakeup, b/c the overflow list being active
> implies that the ep_poll() path would add them to the rdllist in
> ep_scan_read_list(). Thus, it will see the events either in the
> current ep_poll() context or via a subsequent syscall to epoll_wait().
>
> However, there are other paths that can call ep_scan_ready_list() thus
> I agree with you that both wakeups here are necessary.
>
> I do think are are still (smaller) potential races in
> ep_scan_ready_list()
> where we have:
>
> write_lock_irq(&ep->lock);
> list_splice_init(&ep->rdllist, &txlist);
> WRITE_ONCE(ep->ovflist, NULL);
> write_unlock_irq(&ep->lock);
>
> And in the ep_poll path we have:
>
> static inline int ep_events_available(struct eventpoll *ep)
> {
> return !list_empty_careful(&ep->rdllist) ||
> READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
> }
>
>
> Seems to me that first bit needs to be the following, since
> ep_events_available() is now checked in a lockless way:
>
>
> write_lock_irq(&ep->lock);
> WRITE_ONCE(ep->ovflist, NULL);
> smp_wmb();
> list_splice_init(&ep->rdllist, &txlist);
> write_unlock_irq(&ep->lock);

Hi Jason,

For the first chunk you refer the order seems irrelevant.
Either you see something not empty, you go take the lock
and then check lists under the lock, either you see all
lists are empty.

>
> And also this bit:
>
> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>
> /*
> * Quickly re-inject items left on "txlist".
> */
> list_splice(&txlist, &ep->rdllist);
>
> Should I think better be reversed as well to:
>
> list_splice(&txlist, &ep->rdllist);
> smp_wmb();
> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);

But this one is much more interesting. I understand what you
are trying to achieve: we can't leave both lists empty for the
short period of time, if there is something left the caller
of ep_events_available() should be able to see one of the lists
is not empty, otherwise it can be too late.

But the problem you've spotted is much worse. Some remains
can be in the txlist (this can happen if caller of epoll_wait
wants to harvest only 1 event, but there are more in the ->rdlist).
And we again get the lost wakeup.

Problem is reproduced by the test below. The idea is simple:
we have 10 threads and 10 event fds. Each thread can harvest
only 1 event. 1 producer fires all 10 events at once and waits
that all 10 events will be observed by 10 threads.

The fix is basically a revert of 339ddb53d373 with 1 major
exception: we do wakeups from ep_scan_ready_list() but
if txlist is not empty && if ep_scan_ready_list() is called
from the routine, which sends events, not reads it
(thus we protect ourselves from repeated wake ups)

I will send the code a bit later.

--
Roman

---- test -------

enum {
EPOLL60_EVENTS_NR = 10,
};

struct epoll60_ctx {
volatile int stopped;
int ready;
int waiters;
int epfd;
int evfd[EPOLL60_EVENTS_NR];
};

static inline int count_waiters(struct epoll60_ctx *ctx)
{
return __atomic_load_n(&ctx->waiters, __ATOMIC_ACQUIRE);
}

static void *epoll60_wait_thread(void *ctx_)
{
struct epoll60_ctx *ctx = ctx_;
struct epoll_event e;
uint64_t v;
int ret;

while (!ctx->stopped) {
/* Mark we are ready */
__atomic_fetch_add(&ctx->ready, 1, __ATOMIC_ACQUIRE);

/* Start when all are ready */
while (__atomic_load_n(&ctx->ready, __ATOMIC_ACQUIRE) &&
!ctx->stopped);

/* Account this waiter */
__atomic_fetch_add(&ctx->waiters, 1, __ATOMIC_ACQUIRE);
again_wait:
ret = epoll_wait(ctx->epfd, &e, 1, 1000);
if (ret != 1) {
/* Should be stopped, otherwise we lost wakeup */
assert(ctx->stopped);
return NULL;
}

ret = read(e.data.fd, &v, sizeof(v));
if (ret != sizeof(v)) {
/* Event was stollen by other thread */
goto again_wait;
}
__atomic_fetch_sub(&ctx->waiters, 1, __ATOMIC_RELEASE);
}

return NULL;
}

static inline unsigned long long msecs(void)
{
struct timespec ts;
unsigned long long msecs;

clock_gettime(CLOCK_REALTIME, &ts);
msecs = ts.tv_sec * 1000ull;
msecs += ts.tv_nsec / 1000000ull;

return msecs;
}

TEST(epoll60)
{
struct epoll60_ctx ctx = { 0 };
pthread_t waiters[ARRAY_SIZE(ctx.evfd)];
struct epoll_event e;
int i, n, ret;

signal(SIGUSR1, signal_handler);

ctx.epfd = epoll_create1(0);
ASSERT_GE(ctx.epfd, 0);

/* Create event fds */
for (i = 0; i < ARRAY_SIZE(ctx.evfd); i++) {
ctx.evfd[i] = eventfd(0, EFD_NONBLOCK);
ASSERT_GE(ctx.evfd[i], 0);

e.events = EPOLLIN | EPOLLERR;
e.data.fd = ctx.evfd[i];
ASSERT_EQ(epoll_ctl(ctx.epfd, EPOLL_CTL_ADD, ctx.evfd[i], &e), 0);
}

/* Create waiter threads */
for (i = 0; i < ARRAY_SIZE(waiters); i++)
ASSERT_EQ(pthread_create(&waiters[i], NULL,
epoll60_wait_thread, &ctx), 0);

for (i = 0; i < 300; i++) {
uint64_t v = 1, ms;

/* Wait for all to be ready */
while (__atomic_load_n(&ctx.ready, __ATOMIC_ACQUIRE) !=
ARRAY_SIZE(ctx.evfd))
;

/* Steady, go */
__atomic_fetch_sub(&ctx.ready, ARRAY_SIZE(ctx.evfd),
__ATOMIC_ACQUIRE);

/* Wait all have gone to kernel */
while (count_waiters(&ctx) != ARRAY_SIZE(ctx.evfd))
;

/* 1ms should be enough to schedule out */
usleep(1000);

/* Quickly signal all handles at once */
for (n = 0; n < ARRAY_SIZE(ctx.evfd); n++) {
ret = write(ctx.evfd[n], &v, sizeof(v));
ASSERT_EQ(ret, sizeof(v));
}

/* Busy loop for 1s and wait for all waiters to wake up */
ms = msecs();
while (count_waiters(&ctx) && msecs() < ms + 3000)
;

ASSERT_EQ(count_waiters(&ctx), 0);
}
ctx.stopped = 1;
/* Stop waiters */
for (i = 0; i < ARRAY_SIZE(waiters); i++) {
pthread_kill(waiters[i], SIGUSR1);
pthread_join(waiters[i], NULL);
}

for (i = 0; i < ARRAY_SIZE(waiters); i++)
close(ctx.evfd[i]);
close(ctx.epfd);
}

2020-04-29 04:14:26

by Jason Baron

[permalink] [raw]

Subject: Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On 4/28/20 2:10 PM, Roman Penyaev wrote:
> On 2020-04-27 22:38, Jason Baron wrote:
>> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
>>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
>>>>> In the event that we add to ovflist, before 339ddb53d373 we would be
>>>>> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
>>>>> With that wakeup removed, if we add to ovflist here, we may never wake
>>>>> up. Rather than adding back the ep_scan_ready_list wakeup - which was
>>>>> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
>>>>
>>>> I'm just curious which 'wakeup' we are talking about here? There is:
>>>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
>>>> ep_poll_safewake(ep, epi). It seems to me that its only about the later
>>>> one being missing not both? Is your workload using nested epoll?
>>>>
>>>> If so, it might make sense to just do the later, since the point of
>>>> the original patch was to minimize unnecessary wakeups.
>>>
>>> The missing wake-ups were when we added to ovflist instead of rdllist.
>>> Both are "the ready list" together - so I'd think we'd want the same
>>> wakeups regardless of which specific list we added to.
>>> ep_poll_callback isn't nested specific?
>>>
>>
>> So I was thinking that ep_poll() would see these events on the
>> ovflist without an explicit wakeup, b/c the overflow list being active
>> implies that the ep_poll() path would add them to the rdllist in
>> ep_scan_read_list(). Thus, it will see the events either in the
>> current ep_poll() context or via a subsequent syscall to epoll_wait().
>>
>> However, there are other paths that can call ep_scan_ready_list() thus
>> I agree with you that both wakeups here are necessary.
>>
>> I do think are are still (smaller) potential races in ep_scan_ready_list()
>> where we have:
>>
>>         write_lock_irq(&ep->lock);
>>         list_splice_init(&ep->rdllist, &txlist);
>>         WRITE_ONCE(ep->ovflist, NULL);
>>         write_unlock_irq(&ep->lock);
>>
>> And in the ep_poll path we have:
>>
>> static inline int ep_events_available(struct eventpoll *ep)
>> {
>>         return !list_empty_careful(&ep->rdllist) ||
>>                 READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>> }
>>
>>
>> Seems to me that first bit needs to be the following, since
>> ep_events_available() is now checked in a lockless way:
>>
>>
>>         write_lock_irq(&ep->lock);
>>     WRITE_ONCE(ep->ovflist, NULL);
>>     smp_wmb();
>>         list_splice_init(&ep->rdllist, &txlist);
>>         write_unlock_irq(&ep->lock);
>
>
> Hi Jason,
>
> For the first chunk you refer the order seems irrelevant.
> Either you see something not empty, you go take the lock
> and then check lists under the lock, either you see all
> lists are empty.
>

Hi Roman,

It does matter. Let's say we have:

epfd1->epfd2->socket. And thread a is doing an
epoll_wait() on epfd1, and thread b is doing
epoll_wait on epfd2. then:

1) data comes in on socket

ep_poll_callback() wakes up threads a and b

2) thread a runs

ep_poll()
ep_scan_ready_list()
ep_send_events_proc()
ep_item_poll()
ep_scan_ready_list()
list_splice_init(&ep->rdllist, &txlist);

3) now thread b is running

ep_poll()
ep_events_available()
returns false
schedule_hrtimeout_range()

Thus, thread c has missed a wakeup and will never
get it.

Similarly, for the second chunk I referenced.

>>
>> And also this bit:
>>
>>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>>
>>         /*
>>          * Quickly re-inject items left on "txlist".
>>          */
>>         list_splice(&txlist, &ep->rdllist);
>>
>> Should I think better be reversed as well to:
>>
>> list_splice(&txlist, &ep->rdllist);
>> smp_wmb();
>> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>
> But this one is much more interesting. I understand what you
> are trying to achieve: we can't leave both lists empty for the
> short period of time, if there is something left the caller
> of ep_events_available() should be able to see one of the lists
> is not empty, otherwise it can be too late.
>
> But the problem you've spotted is much worse. Some remains
> can be in the txlist (this can happen if caller of epoll_wait
> wants to harvest only 1 event, but there are more in the ->rdlist).
> And we again get the lost wakeup.
>
> Problem is reproduced by the test below. The idea is simple:
> we have 10 threads and 10 event fds. Each thread can harvest
> only 1 event. 1 producer fires all 10 events at once and waits
> that all 10 events will be observed by 10 threads.
>
> The fix is basically a revert of 339ddb53d373 with 1 major
> exception: we do wakeups from ep_scan_ready_list() but
> if txlist is not empty && if ep_scan_ready_list() is called
> from the routine, which sends events, not reads it
> (thus we protect ourselves from repeated wake ups)
>
> I will send the code a bit later.

This was discussed as part of the original discussion around
339ddb53d373: https://lkml.org/lkml/2019/10/7/905

The context was more a performance difference rather than
a semantic difference, but as I pointed out I believe that
behavior pre-dates the above commit and goes back to:
86c0517 fs/epoll: deal with wait_queue only once

There, since the thread is left on the waitqueue over the
ep_scan_ready_list() thus the ep_wakeup() (that was removed
in 339ddb53d373), would no longer wakeup other potential
waiters.

So since I think this behavior change goes back to 5.0 and there
really haven't been any reports, I don't think there are
too many apps relying on these semantics that your test
case is showing. It would be interesting to confirm that
your test does indeed succeed/fail before/after that patch.

Also, as part of that original discussion, you had a patch
that I think addresses this. I would be ok with that, in
addition to a patch to address the ordering issue I pointed
out. I can post a patch for the former, if you think this
plan makes sense?

Thanks,

-Jason

>
> --
> Roman
>
> ---- test -------
>
> enum {
>     EPOLL60_EVENTS_NR = 10,
> };
>
> struct epoll60_ctx {
>     volatile int stopped;
>     int ready;
>     int waiters;
>     int epfd;
>     int evfd[EPOLL60_EVENTS_NR];
> };
>
> static inline int count_waiters(struct epoll60_ctx *ctx)
> {
>     return __atomic_load_n(&ctx->waiters, __ATOMIC_ACQUIRE);
> }
>
> static void *epoll60_wait_thread(void *ctx_)
> {
>     struct epoll60_ctx *ctx = ctx_;
>     struct epoll_event e;
>     uint64_t v;
>     int ret;
>
>     while (!ctx->stopped) {
>         /* Mark we are ready */
>         __atomic_fetch_add(&ctx->ready, 1, __ATOMIC_ACQUIRE);
>
>         /* Start when all are ready */
>         while (__atomic_load_n(&ctx->ready, __ATOMIC_ACQUIRE) &&
>                !ctx->stopped);
>
>         /* Account this waiter */
>         __atomic_fetch_add(&ctx->waiters, 1, __ATOMIC_ACQUIRE);
> again_wait:
>         ret = epoll_wait(ctx->epfd, &e, 1, 1000);
>         if (ret != 1) {
>             /* Should be stopped, otherwise we lost wakeup */
>             assert(ctx->stopped);
>             return NULL;
>         }
>
>         ret = read(e.data.fd, &v, sizeof(v));
>         if (ret != sizeof(v)) {
>             /* Event was stollen by other thread */
>             goto again_wait;
>         }
>         __atomic_fetch_sub(&ctx->waiters, 1, __ATOMIC_RELEASE);
>     }
>
>     return NULL;
> }
>
> static inline unsigned long long msecs(void)
> {
>     struct timespec ts;
>     unsigned long long msecs;
>
>     clock_gettime(CLOCK_REALTIME, &ts);
>     msecs = ts.tv_sec * 1000ull;
>     msecs += ts.tv_nsec / 1000000ull;
>
>     return msecs;
> }
>
> TEST(epoll60)
> {
>     struct epoll60_ctx ctx = { 0 };
>     pthread_t waiters[ARRAY_SIZE(ctx.evfd)];
>     struct epoll_event e;
>     int i, n, ret;
>
>     signal(SIGUSR1, signal_handler);
>
>     ctx.epfd = epoll_create1(0);
>     ASSERT_GE(ctx.epfd, 0);
>
>     /* Create event fds */
>     for (i = 0; i < ARRAY_SIZE(ctx.evfd); i++) {
>         ctx.evfd[i] = eventfd(0, EFD_NONBLOCK);
>         ASSERT_GE(ctx.evfd[i], 0);
>
>         e.events = EPOLLIN | EPOLLERR;
>         e.data.fd = ctx.evfd[i];
>         ASSERT_EQ(epoll_ctl(ctx.epfd, EPOLL_CTL_ADD, ctx.evfd[i], &e), 0);
>     }
>
>     /* Create waiter threads */
>     for (i = 0; i < ARRAY_SIZE(waiters); i++)
>         ASSERT_EQ(pthread_create(&waiters[i], NULL,
>                      epoll60_wait_thread, &ctx), 0);
>
>     for (i = 0; i < 300; i++) {
>         uint64_t v = 1, ms;
>
>         /* Wait for all to be ready */
>         while (__atomic_load_n(&ctx.ready, __ATOMIC_ACQUIRE) !=
>                ARRAY_SIZE(ctx.evfd))
>             ;
>
>         /* Steady, go */
>         __atomic_fetch_sub(&ctx.ready, ARRAY_SIZE(ctx.evfd),
>                    __ATOMIC_ACQUIRE);
>
>         /* Wait all have gone to kernel */
>         while (count_waiters(&ctx) != ARRAY_SIZE(ctx.evfd))
>             ;
>
>         /* 1ms should be enough to schedule out */
>         usleep(1000);
>
>         /* Quickly signal all handles at once */
>         for (n = 0; n < ARRAY_SIZE(ctx.evfd); n++) {
>             ret = write(ctx.evfd[n], &v, sizeof(v));
>             ASSERT_EQ(ret, sizeof(v));
>         }
>
>         /* Busy loop for 1s and wait for all waiters to wake up */
>         ms = msecs();
>         while (count_waiters(&ctx) && msecs() < ms + 3000)
>             ;
>
>         ASSERT_EQ(count_waiters(&ctx), 0);
>     }
>     ctx.stopped = 1;
>     /* Stop waiters */
>     for (i = 0; i < ARRAY_SIZE(waiters); i++) {
>         pthread_kill(waiters[i], SIGUSR1);
>         pthread_join(waiters[i], NULL);
>     }
>
>     for (i = 0; i < ARRAY_SIZE(waiters); i++)
>         close(ctx.evfd[i]);
>     close(ctx.epfd);
> }
>
>

2020-04-29 14:40:34

by Roman Penyaev

[permalink] [raw]

Subject: Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

On 2020-04-29 06:12, Jason Baron wrote:
> On 4/28/20 2:10 PM, Roman Penyaev wrote:
>> On 2020-04-27 22:38, Jason Baron wrote:
>>> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
>>>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <[email protected]>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
>>>>>> In the event that we add to ovflist, before 339ddb53d373 we would
>>>>>> be
>>>>>> woken up by ep_scan_ready_list, and did no wakeup in
>>>>>> ep_poll_callback.
>>>>>> With that wakeup removed, if we add to ovflist here, we may never
>>>>>> wake
>>>>>> up. Rather than adding back the ep_scan_ready_list wakeup - which
>>>>>> was
>>>>>> resulting in unnecessary wakeups, trigger a wake-up in
>>>>>> ep_poll_callback.
>>>>>
>>>>> I'm just curious which 'wakeup' we are talking about here? There
>>>>> is:
>>>>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
>>>>> ep_poll_safewake(ep, epi). It seems to me that its only about the
>>>>> later
>>>>> one being missing not both? Is your workload using nested epoll?
>>>>>
>>>>> If so, it might make sense to just do the later, since the point of
>>>>> the original patch was to minimize unnecessary wakeups.
>>>>
>>>> The missing wake-ups were when we added to ovflist instead of
>>>> rdllist.
>>>> Both are "the ready list" together - so I'd think we'd want the same
>>>> wakeups regardless of which specific list we added to.
>>>> ep_poll_callback isn't nested specific?
>>>>
>>>
>>> So I was thinking that ep_poll() would see these events on the
>>> ovflist without an explicit wakeup, b/c the overflow list being
>>> active
>>> implies that the ep_poll() path would add them to the rdllist in
>>> ep_scan_read_list(). Thus, it will see the events either in the
>>> current ep_poll() context or via a subsequent syscall to
>>> epoll_wait().
>>>
>>> However, there are other paths that can call ep_scan_ready_list()
>>> thus
>>> I agree with you that both wakeups here are necessary.
>>>
>>> I do think are are still (smaller) potential races in
>>> ep_scan_ready_list()
>>> where we have:
>>>
>>>         write_lock_irq(&ep->lock);
>>>         list_splice_init(&ep->rdllist, &txlist);
>>>         WRITE_ONCE(ep->ovflist, NULL);
>>>         write_unlock_irq(&ep->lock);
>>>
>>> And in the ep_poll path we have:
>>>
>>> static inline int ep_events_available(struct eventpoll *ep)
>>> {
>>>         return !list_empty_careful(&ep->rdllist) ||
>>>                 READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>>> }
>>>
>>>
>>> Seems to me that first bit needs to be the following, since
>>> ep_events_available() is now checked in a lockless way:
>>>
>>>
>>>         write_lock_irq(&ep->lock);
>>>     WRITE_ONCE(ep->ovflist, NULL);
>>>     smp_wmb();
>>>         list_splice_init(&ep->rdllist, &txlist);
>>>         write_unlock_irq(&ep->lock);
>>
>>
>> Hi Jason,
>>
>> For the first chunk you refer the order seems irrelevant.
>> Either you see something not empty, you go take the lock
>> and then check lists under the lock, either you see all
>> lists are empty.
>>
>
> Hi Roman,
>
> It does matter. Let's say we have:
>
> epfd1->epfd2->socket. And thread a is doing an
> epoll_wait() on epfd1, and thread b is doing
> epoll_wait on epfd2. then:
>
> 1) data comes in on socket
>
> ep_poll_callback() wakes up threads a and b
>
> 2) thread a runs
>
> ep_poll()
> ep_scan_ready_list()
> ep_send_events_proc()
> ep_item_poll()
> ep_scan_ready_list()
> list_splice_init(&ep->rdllist, &txlist);
>
> 3) now thread b is running
>
> ep_poll()
> ep_events_available()
> returns false
> schedule_hrtimeout_range()
>
> Thus, thread c has missed a wakeup and will never
> get it.
>
>
> Similarly, for the second chunk I referenced.

Hi Jason,

Yes, that makes sense.

>>>
>>> And also this bit:
>>>
>>>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>>
>>>         /*
>>>          * Quickly re-inject items left on "txlist".
>>>          */
>>>         list_splice(&txlist, &ep->rdllist);
>>>
>>> Should I think better be reversed as well to:
>>>
>>> list_splice(&txlist, &ep->rdllist);
>>> smp_wmb();
>>> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>>
>> But this one is much more interesting. I understand what you
>> are trying to achieve: we can't leave both lists empty for the
>> short period of time, if there is something left the caller
>> of ep_events_available() should be able to see one of the lists
>> is not empty, otherwise it can be too late.
>>
>> But the problem you've spotted is much worse. Some remains
>> can be in the txlist (this can happen if caller of epoll_wait
>> wants to harvest only 1 event, but there are more in the ->rdlist).
>> And we again get the lost wakeup.
>>
>> Problem is reproduced by the test below. The idea is simple:
>> we have 10 threads and 10 event fds. Each thread can harvest
>> only 1 event. 1 producer fires all 10 events at once and waits
>> that all 10 events will be observed by 10 threads.
>>
>> The fix is basically a revert of 339ddb53d373 with 1 major
>> exception: we do wakeups from ep_scan_ready_list() but
>> if txlist is not empty && if ep_scan_ready_list() is called
>> from the routine, which sends events, not reads it
>> (thus we protect ourselves from repeated wake ups)
>>
>> I will send the code a bit later.
>
> This was discussed as part of the original discussion around
> 339ddb53d373: https://lkml.org/lkml/2019/10/7/905

True! This is the exact scenario which is covered by the
test from my previous email. And the test fails. I forgot
about this discussion.

> The context was more a performance difference rather than
> a semantic difference, but as I pointed out I believe that
> behavior pre-dates the above commit and goes back to:
> 86c0517 fs/epoll: deal with wait_queue only once
>
> There, since the thread is left on the waitqueue over the
> ep_scan_ready_list() thus the ep_wakeup() (that was removed
> in 339ddb53d373), would no longer wakeup other potential
> waiters.
>
> So since I think this behavior change goes back to 5.0 and there
> really haven't been any reports, I don't think there are
> too many apps relying on these semantics that your test
> case is showing. It would be interesting to confirm that
> your test does indeed succeed/fail before/after that patch.

The main problem is that two sequential wakeups can hit the
same thread, while the wait queue entry is still in the list.
You also described this in https://lkml.org/lkml/2019/10/7/905

If we wakeup only from ep_poll_callback (current code state)
we have to be sure each wakeup hits new wait queue entry.
This behavior can be achieved with autoremove_wake_function,
which should be also good performance-wise, since after
an explicit wakeup you don't need to remove the wait entry
from the list under the lock.

At the end it turns out like this:

1. either we need to wakeup additionally from ep_scan_ready_list(),
this is what we had prior 339ddb53d373.

2. either we wait with autoremove_wake_function, thus we
guarantee, that each new wakeup hits new thread.

I'm testing both variants using the last test I sent yesterday,
works so far.

I personally tend to the second variant, I really don't like
"we-wakeup-from-all-the-places" solution.

Could you please take a look on the patch below? It probably
needs some tweaks, but the idea should be clear.

> Also, as part of that original discussion, you had a patch
> that I think addresses this. I would be ok with that, in
> addition to a patch to address the ordering issue I pointed
> out. I can post a patch for the former, if you think this
> plan makes sense?

Go ahead with you reordering findings! That is correct for
a single wakeup, which we have in the ep_poll_callback().

--
Roman

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d6ba0e52439b..1057598cd299 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1822,7 +1822,6 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
{
int res = 0, eavail, timed_out = 0;
u64 slack = 0;
- bool waiter = false;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;

@@ -1867,21 +1866,12 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
*/
ep_reset_busy_poll_napi_id(ep);

- /*
- * We don't have any available event to return to the caller.
We need
- * to sleep here, and we will be woken by ep_poll_callback()
when events
- * become available.
- */
- if (!waiter) {
- waiter = true;
- init_waitqueue_entry(&wait, current);
-
+ do {
+ init_wait(&wait);
write_lock_irq(&ep->lock);
__add_wait_queue_exclusive(&ep->wq, &wait);
write_unlock_irq(&ep->lock);
- }

- for (;;) {
/*
* We don't want to sleep if the ep_poll_callback()
sends us
* a wakeup in between. That's why we set the task state
@@ -1911,6 +1901,16 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
timed_out = 1;
break;
}
+
+ /* We were woken up, thus go and try to harvest some
events */
+ eavail = 1;
+
+ } while (0);
+
+ if (!list_empty_careful(&wait.entry)) {
+ write_lock_irq(&ep->lock);
+ __remove_wait_queue(&ep->wq, &wait);
+ write_unlock_irq(&ep->lock);
}
__set_current_state(TASK_RUNNING);
@@ -1925,12 +1925,6 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
!(res = ep_send_events(ep, events, maxevents)) &&
!timed_out)
goto fetch_events;

- if (waiter) {
- write_lock_irq(&ep->lock);
- __remove_wait_queue(&ep->wq, &wait);
- write_unlock_irq(&ep->lock);
- }
-
return res;
}