2020-05-01 19:23:43

by Jason Baron

[permalink] [raw]
Subject: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events

Now that the ep_events_available() check is done in a lockless way, and
we no longer perform wakeups from ep_scan_ready_list(), we need to ensure
that either ep->rdllist has items or the overflow list is active. Prior to:
commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
epoll"), we did wake_up(&ep->wq) after manipulating the ep->rdllist and the
overflow list. Thus, any waiters would observe the correct state. However,
with that wake_up() now removed we need to be more careful to ensure that
condition.

Here's an example of what could go wrong:

We have epoll fds: epfd1, epfd2. And epfd1 is added to epfd2 and epfd2 is
added to a socket: epfd1->epfd2->socket. Thread a is doing epoll_wait() on
epfd1, and thread b is doing epoll_wait on epfd2. Then:

1) data comes in on socket

ep_poll_callback() wakes up threads a and b

2) thread a runs

ep_poll()
ep_scan_ready_list()
ep_send_events_proc()
ep_item_poll()
ep_scan_ready_list()
list_splice_init(&ep->rdllist, &txlist);

3) now thread b is running

ep_poll()
ep_events_available()
returns false
schedule_hrtimeout_range()

Thus, thread b has now scheduled and missed the wakeup.

Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Jason Baron <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Heiher <[email protected]>
Cc: Roman Penyaev <[email protected]>
Cc: Khazhismel Kumykov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: <[email protected]>
---
fs/eventpoll.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index aba03ee749f8..4af2d020f548 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -704,8 +704,14 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
* in a lockless way.
*/
write_lock_irq(&ep->lock);
- list_splice_init(&ep->rdllist, &txlist);
WRITE_ONCE(ep->ovflist, NULL);
+ /*
+ * In ep_poll() we use ep_events_available() in a lockless way to decide
+ * if events are available. So we need to preserve that either
+ * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
+ */
+ smp_wmb();
+ list_splice_init(&ep->rdllist, &txlist);
write_unlock_irq(&ep->lock);

/*
@@ -737,16 +743,21 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
}
}
/*
+ * Quickly re-inject items left on "txlist".
+ */
+ list_splice(&txlist, &ep->rdllist);
+ /*
+ * In ep_poll() we use ep_events_available() in a lockless way to decide
+ * if events are available. So we need to preserve that either
+ * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
+ */
+ smp_wmb();
+ /*
* We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
* releasing the lock, events will be queued in the normal way inside
* ep->rdllist.
*/
WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
-
- /*
- * Quickly re-inject items left on "txlist".
- */
- list_splice(&txlist, &ep->rdllist);
__pm_relax(ep->ws);
write_unlock_irq(&ep->lock);

--
2.7.4


2020-05-01 21:06:47

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events

Hi Jason,

That is indeed a nice catch.
Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist)
and
READ_ONCE(ep->ovflist) for ep_events_available(), do we?

Other than that:

Reviewed-by: Roman Penyaev <[email protected]>

--
Roman

On 2020-05-01 21:15, Jason Baron wrote:
> Now that the ep_events_available() check is done in a lockless way, and
> we no longer perform wakeups from ep_scan_ready_list(), we need to
> ensure
> that either ep->rdllist has items or the overflow list is active. Prior
> to:
> commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
> epoll"), we did wake_up(&ep->wq) after manipulating the ep->rdllist and
> the
> overflow list. Thus, any waiters would observe the correct state.
> However,
> with that wake_up() now removed we need to be more careful to ensure
> that
> condition.
>
> Here's an example of what could go wrong:
>
> We have epoll fds: epfd1, epfd2. And epfd1 is added to epfd2 and epfd2
> is
> added to a socket: epfd1->epfd2->socket. Thread a is doing epoll_wait()
> on
> epfd1, and thread b is doing epoll_wait on epfd2. Then:
>
> 1) data comes in on socket
>
> ep_poll_callback() wakes up threads a and b
>
> 2) thread a runs
>
> ep_poll()
> ep_scan_ready_list()
> ep_send_events_proc()
> ep_item_poll()
> ep_scan_ready_list()
> list_splice_init(&ep->rdllist, &txlist);
>
> 3) now thread b is running
>
> ep_poll()
> ep_events_available()
> returns false
> schedule_hrtimeout_range()
>
> Thus, thread b has now scheduled and missed the wakeup.
>
> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
> epoll")
> Signed-off-by: Jason Baron <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Heiher <[email protected]>
> Cc: Roman Penyaev <[email protected]>
> Cc: Khazhismel Kumykov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: <[email protected]>
> ---
> fs/eventpoll.c | 23 +++++++++++++++++------
> 1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index aba03ee749f8..4af2d020f548 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -704,8 +704,14 @@ static __poll_t ep_scan_ready_list(struct
> eventpoll *ep,
> * in a lockless way.
> */
> write_lock_irq(&ep->lock);
> - list_splice_init(&ep->rdllist, &txlist);
> WRITE_ONCE(ep->ovflist, NULL);
> + /*
> + * In ep_poll() we use ep_events_available() in a lockless way to
> decide
> + * if events are available. So we need to preserve that either
> + * ep->oflist != EP_UNACTIVE_PTR or there are events on the
> ep->rdllist.
> + */
> + smp_wmb();
> + list_splice_init(&ep->rdllist, &txlist);
> write_unlock_irq(&ep->lock);
>
> /*
> @@ -737,16 +743,21 @@ static __poll_t ep_scan_ready_list(struct
> eventpoll *ep,
> }
> }
> /*
> + * Quickly re-inject items left on "txlist".
> + */
> + list_splice(&txlist, &ep->rdllist);
> + /*
> + * In ep_poll() we use ep_events_available() in a lockless way to
> decide
> + * if events are available. So we need to preserve that either
> + * ep->oflist != EP_UNACTIVE_PTR or there are events on the
> ep->rdllist.
> + */
> + smp_wmb();
> + /*
> * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
> * releasing the lock, events will be queued in the normal way inside
> * ep->rdllist.
> */
> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
> -
> - /*
> - * Quickly re-inject items left on "txlist".
> - */
> - list_splice(&txlist, &ep->rdllist);
> __pm_relax(ep->ws);
> write_unlock_irq(&ep->lock);

2020-05-01 22:15:12

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events



On 5/1/20 5:02 PM, Roman Penyaev wrote:
> Hi Jason,
>
> That is indeed a nice catch.
> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist) and
> READ_ONCE(ep->ovflist) for ep_events_available(), do we?
>

Hi Roman,

Good point, even if we order those reads its still racy, since the
read of the ready list could come after its been cleared and the
read of the overflow could again come after its been cleared.

So I'm afraid we might need instead something like this to make
sure they are read together:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d6ba0e5..31c5f14 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1899,7 +1899,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
break;
}

+ write_lock_irq(&ep->lock);
eavail = ep_events_available(ep);
+ write_unlock_irq(&ep->lock);
if (eavail)
break;
if (signal_pending(current)) {


Thanks,

-Jason



> Other than that:
>
> Reviewed-by: Roman Penyaev <[email protected]>
>
> --
> Roman
>
> On 2020-05-01 21:15, Jason Baron wrote:
>> Now that the ep_events_available() check is done in a lockless way, and
>> we no longer perform wakeups from ep_scan_ready_list(), we need to ensure
>> that either ep->rdllist has items or the overflow list is active. Prior to:
>> commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
>> epoll"), we did wake_up(&ep->wq) after manipulating the ep->rdllist and the
>> overflow list. Thus, any waiters would observe the correct state. However,
>> with that wake_up() now removed we need to be more careful to ensure that
>> condition.
>>
>> Here's an example of what could go wrong:
>>
>> We have epoll fds: epfd1, epfd2. And epfd1 is added to epfd2 and epfd2 is
>> added to a socket: epfd1->epfd2->socket. Thread a is doing epoll_wait() on
>> epfd1, and thread b is doing epoll_wait on epfd2. Then:
>>
>> 1) data comes in on socket
>>
>> ep_poll_callback() wakes up threads a and b
>>
>> 2) thread a runs
>>
>> ep_poll()
>>  ep_scan_ready_list()
>>   ep_send_events_proc()
>>    ep_item_poll()
>>      ep_scan_ready_list()
>>        list_splice_init(&ep->rdllist, &txlist);
>>
>> 3) now thread b is running
>>
>> ep_poll()
>>  ep_events_available()
>>    returns false
>>  schedule_hrtimeout_range()
>>
>> Thus, thread b has now scheduled and missed the wakeup.
>>
>> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
>> Signed-off-by: Jason Baron <[email protected]>
>> Cc: Alexander Viro <[email protected]>
>> Cc: Heiher <[email protected]>
>> Cc: Roman Penyaev <[email protected]>
>> Cc: Khazhismel Kumykov <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Davidlohr Bueso <[email protected]>
>> Cc: <[email protected]>
>> ---
>>  fs/eventpoll.c | 23 +++++++++++++++++------
>>  1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index aba03ee749f8..4af2d020f548 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -704,8 +704,14 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>       * in a lockless way.
>>       */
>>      write_lock_irq(&ep->lock);
>> -    list_splice_init(&ep->rdllist, &txlist);
>>      WRITE_ONCE(ep->ovflist, NULL);
>> +    /*
>> +     * In ep_poll() we use ep_events_available() in a lockless way to decide
>> +     * if events are available. So we need to preserve that either
>> +     * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
>> +     */
>> +    smp_wmb();
>> +    list_splice_init(&ep->rdllist, &txlist);
>>      write_unlock_irq(&ep->lock);
>>
>>      /*
>> @@ -737,16 +743,21 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>          }
>>      }
>>      /*
>> +     * Quickly re-inject items left on "txlist".
>> +     */
>> +    list_splice(&txlist, &ep->rdllist);
>> +    /*
>> +     * In ep_poll() we use ep_events_available() in a lockless way to decide
>> +     * if events are available. So we need to preserve that either
>> +     * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
>> +     */
>> +    smp_wmb();
>> +    /*
>>       * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
>>       * releasing the lock, events will be queued in the normal way inside
>>       * ep->rdllist.
>>       */
>>      WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>> -
>> -    /*
>> -     * Quickly re-inject items left on "txlist".
>> -     */
>> -    list_splice(&txlist, &ep->rdllist);
>>      __pm_relax(ep->ws);
>>      write_unlock_irq(&ep->lock);
>

2020-05-03 10:27:11

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events

On 2020-05-02 00:09, Jason Baron wrote:
> On 5/1/20 5:02 PM, Roman Penyaev wrote:
>> Hi Jason,
>>
>> That is indeed a nice catch.
>> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist)
>> and
>> READ_ONCE(ep->ovflist) for ep_events_available(), do we?
>>
>
> Hi Roman,
>
> Good point, even if we order those reads its still racy, since the
> read of the ready list could come after its been cleared and the
> read of the overflow could again come after its been cleared.

You mean the second chunk? True. Sigh.

> So I'm afraid we might need instead something like this to make
> sure they are read together:

No, impossible, I can't believe in that :) We can't give up.

All we need is to keep a mark, that ep->rdllist is not empty,
even we've just spliced it. ep_poll_callback() always takes
the ->ovflist path, if ->ovflist is not EP_UNACTIVE_PTR, but
ep_events_available() does not need to observe ->ovflist at
all, just a ->rdllist.

Take a look at that, do I miss something? :

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index aba03ee749f8..a8770f9a917e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -376,8 +376,7 @@ static void ep_nested_calls_init(struct nested_calls
*ncalls)
*/
static inline int ep_events_available(struct eventpoll *ep)
{
- return !list_empty_careful(&ep->rdllist) ||
- READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+ return !list_empty_careful(&ep->rdllist);
}

#ifdef CONFIG_NET_RX_BUSY_POLL
@@ -683,7 +682,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll
*ep,
{
__poll_t res;
struct epitem *epi, *nepi;
- LIST_HEAD(txlist);
+ LIST_HEAD(rdllist);
+ LIST_HEAD(ovflist);

lockdep_assert_irqs_enabled();

@@ -704,14 +704,22 @@ static __poll_t ep_scan_ready_list(struct
eventpoll *ep,
* in a lockless way.
*/
write_lock_irq(&ep->lock);
- list_splice_init(&ep->rdllist, &txlist);
+ /*
+ * We do not call list_splice_init() because for lockless
+ * ep_events_available() ->rdllist is still "not empty".
+ * Otherwise the feature that there is something left in
+ * the list can be lost which causes missed wakeup.
+ */
+ list_splice(&ep->rdllist, &rdllist);
+ /*
+ * If ->rdllist was empty we should pretend it was not,
+ * because after the unlock ->ovflist comes into play,
+ * which is invisible for lockless ep_events_available().
+ */
+ ep->rdllist.next = LIST_POISON1;
WRITE_ONCE(ep->ovflist, NULL);
write_unlock_irq(&ep->lock);

/*
* Now call the callback function.
*/
- res = (*sproc)(ep, &txlist, priv);
+ res = (*sproc)(ep, &rdllist, priv);

write_lock_irq(&ep->lock);
/*
@@ -724,7 +732,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll
*ep,
/*
* We need to check if the item is already in the list.
* During the "sproc" callback execution time, items are
- * queued into ->ovflist but the "txlist" might already
+ * queued into ->ovflist but the "rdllist" might already
* contain them, and the list_splice() below takes care
of them.
*/
if (!ep_is_linked(epi)) {
@@ -732,7 +740,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll
*ep,
* ->ovflist is LIFO, so we have to reverse it
in order
* to keep in FIFO.
*/
- list_add(&epi->rdllink, &ep->rdllist);
+ list_add(&epi->rdllink, &ovflist);
ep_pm_stay_awake(epi);
}
}
@@ -743,10 +751,11 @@ static __poll_t ep_scan_ready_list(struct
eventpoll *ep,
*/
WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);

- /*
- * Quickly re-inject items left on "txlist".
- */
- list_splice(&txlist, &ep->rdllist);
+ /* Events from ->ovflist happened later, thus splice to the tail
*/
+ list_splice_tail(&ovflist, &rdllist);
+ /* Just replace list */
+ list_replace(&rdllist, &ep->rdllist);
+
__pm_relax(ep->ws);
write_unlock_irq(&ep->lock);

@@ -1763,13 +1772,13 @@ static __poll_t ep_send_events_proc(struct
eventpoll *ep, struct list_head *head
* Trigger mode, we need to insert back inside
* the ready list, so that the next call to
* epoll_wait() will check again the events
- * availability. At this point, no one can
insert
- * into ep->rdllist besides us. The epoll_ctl()
- * callers are locked out by
- * ep_scan_ready_list() holding "mtx" and the
- * poll callback will queue them in ep->ovflist.
+ * availability. What we do here is simply
+ * return the epi to the same position where
+ * it was, the ep_scan_ready_list() will
+ * re-inject the leftovers to the ->rdllist
+ * under the proper lock.
*/
- list_add_tail(&epi->rdllink, &ep->rdllist);
+ list_add_tail(&epi->rdllink, &tmp->rdllink);
ep_pm_stay_awake(epi);
}
}


--
Roman

2020-05-03 13:09:33

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events

From: Jason Baron
> Sent: 01 May 2020 20:16
>
> Now that the ep_events_available() check is done in a lockless way, and
> we no longer perform wakeups from ep_scan_ready_list(), we need to ensure
> that either ep->rdllist has items or the overflow list is active. Prior to:
> commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
> epoll"), we did wake_up(&ep->wq) after manipulating the ep->rdllist and the
> overflow list. Thus, any waiters would observe the correct state. However,
> with that wake_up() now removed we need to be more careful to ensure that
> condition.

I'm wondering how much all this affects the (probably) more common
case of a process reading events from a lot of sockets in 'level'
mode.

Even the change to a rwlock() may have had an adverse effect
on such programs.

In 'level' mode it doesn't make any sense to have multiple
readers of the event queue.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-05-04 05:03:05

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events



On 5/3/20 6:24 AM, Roman Penyaev wrote:
> On 2020-05-02 00:09, Jason Baron wrote:
>> On 5/1/20 5:02 PM, Roman Penyaev wrote:
>>> Hi Jason,
>>>
>>> That is indeed a nice catch.
>>> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist) and
>>> READ_ONCE(ep->ovflist) for ep_events_available(), do we?
>>>
>>
>> Hi Roman,
>>
>> Good point, even if we order those reads its still racy, since the
>> read of the ready list could come after its been cleared and the
>> read of the overflow could again come after its been cleared.
>
> You mean the second chunk? True. Sigh.
>
>> So I'm afraid we might need instead something like this to make
>> sure they are read together:
>
> No, impossible, I can't believe in that :) We can't give up.
>
> All we need is to keep a mark, that ep->rdllist is not empty,
> even we've just spliced it.  ep_poll_callback() always takes
> the ->ovflist path, if ->ovflist is not EP_UNACTIVE_PTR, but
> ep_events_available() does not need to observe ->ovflist at
> all, just a ->rdllist.
>
> Take a look at that, do I miss something? :
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index aba03ee749f8..a8770f9a917e 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -376,8 +376,7 @@ static void ep_nested_calls_init(struct nested_calls *ncalls)
>   */
>  static inline int ep_events_available(struct eventpoll *ep)
>  {
> -       return !list_empty_careful(&ep->rdllist) ||
> -               READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
> +       return !list_empty_careful(&ep->rdllist);
>  }
>
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> @@ -683,7 +682,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>  {
>         __poll_t res;
>         struct epitem *epi, *nepi;
> -       LIST_HEAD(txlist);
> +       LIST_HEAD(rdllist);
> +       LIST_HEAD(ovflist);
>
>         lockdep_assert_irqs_enabled();
>
> @@ -704,14 +704,22 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>          * in a lockless way.
>          */
>         write_lock_irq(&ep->lock);
> -       list_splice_init(&ep->rdllist, &txlist);
> +       /*
> +        * We do not call list_splice_init() because for lockless
> +        * ep_events_available() ->rdllist is still "not empty".
> +        * Otherwise the feature that there is something left in
> +        * the list can be lost which causes missed wakeup.
> +        */
> +       list_splice(&ep->rdllist, &rdllist);
> +       /*
> +        * If ->rdllist was empty we should pretend it was not,
> +        * because after the unlock ->ovflist comes into play,
> +        * which is invisible for lockless ep_events_available().
> +        */
> +       ep->rdllist.next = LIST_POISON1;
>         WRITE_ONCE(ep->ovflist, NULL);
>         write_unlock_irq(&ep->lock);
>
>         /*
>          * Now call the callback function.
>          */
> -       res = (*sproc)(ep, &txlist, priv);
> +       res = (*sproc)(ep, &rdllist, priv);
>
>         write_lock_irq(&ep->lock);
>         /*
> @@ -724,7 +732,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>                 /*
>                  * We need to check if the item is already in the list.
>                  * During the "sproc" callback execution time, items are
> -                * queued into ->ovflist but the "txlist" might already
> +                * queued into ->ovflist but the "rdllist" might already
>                  * contain them, and the list_splice() below takes care of them.
>                  */
>                 if (!ep_is_linked(epi)) {
> @@ -732,7 +740,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>                          * ->ovflist is LIFO, so we have to reverse it in order
>                          * to keep in FIFO.
>                          */
> -                       list_add(&epi->rdllink, &ep->rdllist);
> +                       list_add(&epi->rdllink, &ovflist);
>                         ep_pm_stay_awake(epi);
>                 }
>         }
> @@ -743,10 +751,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>          */
>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>
> -       /*
> -        * Quickly re-inject items left on "txlist".
> -        */
> -       list_splice(&txlist, &ep->rdllist);
> +       /* Events from ->ovflist happened later, thus splice to the tail */
> +       list_splice_tail(&ovflist, &rdllist);
> +       /* Just replace list */
> +       list_replace(&rdllist, &ep->rdllist);
> +
>         __pm_relax(ep->ws);
>         write_unlock_irq(&ep->lock);
>
> @@ -1763,13 +1772,13 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head
>                          * Trigger mode, we need to insert back inside
>                          * the ready list, so that the next call to
>                          * epoll_wait() will check again the events
> -                        * availability. At this point, no one can insert
> -                        * into ep->rdllist besides us. The epoll_ctl()
> -                        * callers are locked out by
> -                        * ep_scan_ready_list() holding "mtx" and the
> -                        * poll callback will queue them in ep->ovflist.
> +                        * availability. What we do here is simply
> +                        * return the epi to the same position where
> +                        * it was, the ep_scan_ready_list() will
> +                        * re-inject the leftovers to the ->rdllist
> +                        * under the proper lock.
>                          */
> -                       list_add_tail(&epi->rdllink, &ep->rdllist);
> +                       list_add_tail(&epi->rdllink, &tmp->rdllink);
>                         ep_pm_stay_awake(epi);
>                 }
>         }
>
>
> --
> Roman
>


Hi Roman,

I think this misses an important case - the initial ep_poll_callback()
may queue to the overflow list. In this case ep_poll has no visibility
into the event since its only checking ep->rdllist.

Here's a different approach using seqcount that avoids expanding ep->lock
region.

Thanks,

-Jason


--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -221,6 +221,8 @@ struct eventpoll {
struct list_head visited_list_link;
int visited;

+ seqcount_t seq;
+
#ifdef CONFIG_NET_RX_BUSY_POLL
/* used to track busy poll napi_id */
unsigned int napi_id;
@@ -704,8 +706,10 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
* in a lockless way.
*/
write_lock_irq(&ep->lock);
+ write_seqcount_begin(&ep->seq);
list_splice_init(&ep->rdllist, &txlist);
WRITE_ONCE(ep->ovflist, NULL);
+ write_seqcount_end(&ep->seq);
write_unlock_irq(&ep->lock);

/*
@@ -741,12 +745,14 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
* releasing the lock, events will be queued in the normal way inside
* ep->rdllist.
*/
+ write_seqcount_begin(&ep->seq);
WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);

/*
* Quickly re-inject items left on "txlist".
*/
list_splice(&txlist, &ep->rdllist);
+ write_seqcount_end(&ep->seq);
__pm_relax(ep->ws);
write_unlock_irq(&ep->lock);

@@ -1025,6 +1031,7 @@ static int ep_alloc(struct eventpoll **pep)
ep->rbr = RB_ROOT_CACHED;
ep->ovflist = EP_UNACTIVE_PTR;
ep->user = user;
+ seqcount_init(&ep->seq);

*pep = ep;

@@ -1824,6 +1831,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
u64 slack = 0;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
+ unsigned int seq;

lockdep_assert_irqs_enabled();

@@ -1900,7 +1908,10 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
break;
}

- eavail = ep_events_available(ep);
+ do {
+ seq = read_seqcount_begin(&ep->seq);
+ eavail = ep_events_available(ep);
+ } while (read_seqcount_retry(&ep->seq, seq));
if (eavail)
break;
if (signal_pending(current)) {

2020-05-04 05:06:56

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events



On 5/4/20 12:29 AM, Jason Baron wrote:
>
>
> On 5/3/20 6:24 AM, Roman Penyaev wrote:
>> On 2020-05-02 00:09, Jason Baron wrote:
>>> On 5/1/20 5:02 PM, Roman Penyaev wrote:
>>>> Hi Jason,
>>>>
>>>> That is indeed a nice catch.
>>>> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist) and
>>>> READ_ONCE(ep->ovflist) for ep_events_available(), do we?
>>>>
>>>
>>> Hi Roman,
>>>
>>> Good point, even if we order those reads its still racy, since the
>>> read of the ready list could come after its been cleared and the
>>> read of the overflow could again come after its been cleared.
>>
>> You mean the second chunk? True. Sigh.
>>
>>> So I'm afraid we might need instead something like this to make
>>> sure they are read together:
>>
>> No, impossible, I can't believe in that :) We can't give up.
>>
>> All we need is to keep a mark, that ep->rdllist is not empty,
>> even we've just spliced it.  ep_poll_callback() always takes
>> the ->ovflist path, if ->ovflist is not EP_UNACTIVE_PTR, but
>> ep_events_available() does not need to observe ->ovflist at
>> all, just a ->rdllist.
>>
>> Take a look at that, do I miss something? :
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index aba03ee749f8..a8770f9a917e 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -376,8 +376,7 @@ static void ep_nested_calls_init(struct nested_calls *ncalls)
>>   */
>>  static inline int ep_events_available(struct eventpoll *ep)
>>  {
>> -       return !list_empty_careful(&ep->rdllist) ||
>> -               READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>> +       return !list_empty_careful(&ep->rdllist);
>>  }
>>
>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>> @@ -683,7 +682,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>  {
>>         __poll_t res;
>>         struct epitem *epi, *nepi;
>> -       LIST_HEAD(txlist);
>> +       LIST_HEAD(rdllist);
>> +       LIST_HEAD(ovflist);
>>
>>         lockdep_assert_irqs_enabled();
>>
>> @@ -704,14 +704,22 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>          * in a lockless way.
>>          */
>>         write_lock_irq(&ep->lock);
>> -       list_splice_init(&ep->rdllist, &txlist);
>> +       /*
>> +        * We do not call list_splice_init() because for lockless
>> +        * ep_events_available() ->rdllist is still "not empty".
>> +        * Otherwise the feature that there is something left in
>> +        * the list can be lost which causes missed wakeup.
>> +        */
>> +       list_splice(&ep->rdllist, &rdllist);
>> +       /*
>> +        * If ->rdllist was empty we should pretend it was not,
>> +        * because after the unlock ->ovflist comes into play,
>> +        * which is invisible for lockless ep_events_available().
>> +        */
>> +       ep->rdllist.next = LIST_POISON1;
>>         WRITE_ONCE(ep->ovflist, NULL);
>>         write_unlock_irq(&ep->lock);
>>
>>         /*
>>          * Now call the callback function.
>>          */
>> -       res = (*sproc)(ep, &txlist, priv);
>> +       res = (*sproc)(ep, &rdllist, priv);
>>
>>         write_lock_irq(&ep->lock);
>>         /*
>> @@ -724,7 +732,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>                 /*
>>                  * We need to check if the item is already in the list.
>>                  * During the "sproc" callback execution time, items are
>> -                * queued into ->ovflist but the "txlist" might already
>> +                * queued into ->ovflist but the "rdllist" might already
>>                  * contain them, and the list_splice() below takes care of them.
>>                  */
>>                 if (!ep_is_linked(epi)) {
>> @@ -732,7 +740,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>                          * ->ovflist is LIFO, so we have to reverse it in order
>>                          * to keep in FIFO.
>>                          */
>> -                       list_add(&epi->rdllink, &ep->rdllist);
>> +                       list_add(&epi->rdllink, &ovflist);
>>                         ep_pm_stay_awake(epi);
>>                 }
>>         }
>> @@ -743,10 +751,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>          */
>>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>>
>> -       /*
>> -        * Quickly re-inject items left on "txlist".
>> -        */
>> -       list_splice(&txlist, &ep->rdllist);
>> +       /* Events from ->ovflist happened later, thus splice to the tail */
>> +       list_splice_tail(&ovflist, &rdllist);
>> +       /* Just replace list */
>> +       list_replace(&rdllist, &ep->rdllist);
>> +
>>         __pm_relax(ep->ws);
>>         write_unlock_irq(&ep->lock);
>>
>> @@ -1763,13 +1772,13 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head
>>                          * Trigger mode, we need to insert back inside
>>                          * the ready list, so that the next call to
>>                          * epoll_wait() will check again the events
>> -                        * availability. At this point, no one can insert
>> -                        * into ep->rdllist besides us. The epoll_ctl()
>> -                        * callers are locked out by
>> -                        * ep_scan_ready_list() holding "mtx" and the
>> -                        * poll callback will queue them in ep->ovflist.
>> +                        * availability. What we do here is simply
>> +                        * return the epi to the same position where
>> +                        * it was, the ep_scan_ready_list() will
>> +                        * re-inject the leftovers to the ->rdllist
>> +                        * under the proper lock.
>>                          */
>> -                       list_add_tail(&epi->rdllink, &ep->rdllist);
>> +                       list_add_tail(&epi->rdllink, &tmp->rdllink);
>>                         ep_pm_stay_awake(epi);
>>                 }
>>         }
>>
>>
>> --
>> Roman
>>
>
>
> Hi Roman,
>
> I think this misses an important case - the initial ep_poll_callback()
> may queue to the overflow list. In this case ep_poll has no visibility
> into the event since its only checking ep->rdllist.

Ok, my mistake - I see it sets: ep->rdllist.next = LIST_POISON1; for that
case. Ok I think this approach makes sense then.

Thanks,

-Jason

2020-05-04 09:43:52

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events

On 2020-05-04 06:59, Jason Baron wrote:
> On 5/4/20 12:29 AM, Jason Baron wrote:
>>
>>
>> On 5/3/20 6:24 AM, Roman Penyaev wrote:
>>> On 2020-05-02 00:09, Jason Baron wrote:
>>>> On 5/1/20 5:02 PM, Roman Penyaev wrote:
>>>>> Hi Jason,
>>>>>
>>>>> That is indeed a nice catch.
>>>>> Seems we need smp_rmb() pair between
>>>>> list_empty_careful(&rp->rdllist) and
>>>>> READ_ONCE(ep->ovflist) for ep_events_available(), do we?
>>>>>
>>>>
>>>> Hi Roman,
>>>>
>>>> Good point, even if we order those reads its still racy, since the
>>>> read of the ready list could come after its been cleared and the
>>>> read of the overflow could again come after its been cleared.
>>>
>>> You mean the second chunk? True. Sigh.
>>>
>>>> So I'm afraid we might need instead something like this to make
>>>> sure they are read together:
>>>
>>> No, impossible, I can't believe in that :) We can't give up.
>>>
>>> All we need is to keep a mark, that ep->rdllist is not empty,
>>> even we've just spliced it.  ep_poll_callback() always takes
>>> the ->ovflist path, if ->ovflist is not EP_UNACTIVE_PTR, but
>>> ep_events_available() does not need to observe ->ovflist at
>>> all, just a ->rdllist.
>>>
>>> Take a look at that, do I miss something? :
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index aba03ee749f8..a8770f9a917e 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -376,8 +376,7 @@ static void ep_nested_calls_init(struct
>>> nested_calls *ncalls)
>>>   */
>>>  static inline int ep_events_available(struct eventpoll *ep)
>>>  {
>>> -       return !list_empty_careful(&ep->rdllist) ||
>>> -               READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>>> +       return !list_empty_careful(&ep->rdllist);
>>>  }
>>>
>>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>>> @@ -683,7 +682,8 @@ static __poll_t ep_scan_ready_list(struct
>>> eventpoll *ep,
>>>  {
>>>         __poll_t res;
>>>         struct epitem *epi, *nepi;
>>> -       LIST_HEAD(txlist);
>>> +       LIST_HEAD(rdllist);
>>> +       LIST_HEAD(ovflist);
>>>
>>>         lockdep_assert_irqs_enabled();
>>>
>>> @@ -704,14 +704,22 @@ static __poll_t ep_scan_ready_list(struct
>>> eventpoll *ep,
>>>          * in a lockless way.
>>>          */
>>>         write_lock_irq(&ep->lock);
>>> -       list_splice_init(&ep->rdllist, &txlist);
>>> +       /*
>>> +        * We do not call list_splice_init() because for lockless
>>> +        * ep_events_available() ->rdllist is still "not empty".
>>> +        * Otherwise the feature that there is something left in
>>> +        * the list can be lost which causes missed wakeup.
>>> +        */
>>> +       list_splice(&ep->rdllist, &rdllist);
>>> +       /*
>>> +        * If ->rdllist was empty we should pretend it was not,
>>> +        * because after the unlock ->ovflist comes into play,
>>> +        * which is invisible for lockless ep_events_available().
>>> +        */
>>> +       ep->rdllist.next = LIST_POISON1;
>>>         WRITE_ONCE(ep->ovflist, NULL);
>>>         write_unlock_irq(&ep->lock);
>>>
>>>         /*
>>>          * Now call the callback function.
>>>          */
>>> -       res = (*sproc)(ep, &txlist, priv);
>>> +       res = (*sproc)(ep, &rdllist, priv);
>>>
>>>         write_lock_irq(&ep->lock);
>>>         /*
>>> @@ -724,7 +732,7 @@ static __poll_t ep_scan_ready_list(struct
>>> eventpoll *ep,
>>>                 /*
>>>                  * We need to check if the item is already in the
>>> list.
>>>                  * During the "sproc" callback execution time, items
>>> are
>>> -                * queued into ->ovflist but the "txlist" might
>>> already
>>> +                * queued into ->ovflist but the "rdllist" might
>>> already
>>>                  * contain them, and the list_splice() below takes
>>> care of them.
>>>                  */
>>>                 if (!ep_is_linked(epi)) {
>>> @@ -732,7 +740,7 @@ static __poll_t ep_scan_ready_list(struct
>>> eventpoll *ep,
>>>                          * ->ovflist is LIFO, so we have to reverse
>>> it in order
>>>                          * to keep in FIFO.
>>>                          */
>>> -                       list_add(&epi->rdllink, &ep->rdllist);
>>> +                       list_add(&epi->rdllink, &ovflist);
>>>                         ep_pm_stay_awake(epi);
>>>                 }
>>>         }
>>> @@ -743,10 +751,11 @@ static __poll_t ep_scan_ready_list(struct
>>> eventpoll *ep,
>>>          */
>>>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>>>
>>> -       /*
>>> -        * Quickly re-inject items left on "txlist".
>>> -        */
>>> -       list_splice(&txlist, &ep->rdllist);
>>> +       /* Events from ->ovflist happened later, thus splice to the
>>> tail */
>>> +       list_splice_tail(&ovflist, &rdllist);
>>> +       /* Just replace list */
>>> +       list_replace(&rdllist, &ep->rdllist);
>>> +
>>>         __pm_relax(ep->ws);
>>>         write_unlock_irq(&ep->lock);
>>>
>>> @@ -1763,13 +1772,13 @@ static __poll_t ep_send_events_proc(struct
>>> eventpoll *ep, struct list_head *head
>>>                          * Trigger mode, we need to insert back
>>> inside
>>>                          * the ready list, so that the next call to
>>>                          * epoll_wait() will check again the events
>>> -                        * availability. At this point, no one can
>>> insert
>>> -                        * into ep->rdllist besides us. The
>>> epoll_ctl()
>>> -                        * callers are locked out by
>>> -                        * ep_scan_ready_list() holding "mtx" and the
>>> -                        * poll callback will queue them in
>>> ep->ovflist.
>>> +                        * availability. What we do here is simply
>>> +                        * return the epi to the same position where
>>> +                        * it was, the ep_scan_ready_list() will
>>> +                        * re-inject the leftovers to the ->rdllist
>>> +                        * under the proper lock.
>>>                          */
>>> -                       list_add_tail(&epi->rdllink, &ep->rdllist);
>>> +                       list_add_tail(&epi->rdllink, &tmp->rdllink);
>>>                         ep_pm_stay_awake(epi);
>>>                 }
>>>         }
>>>
>>>
>>> --
>>> Roman
>>>
>>
>>
>> Hi Roman,
>>
>> I think this misses an important case - the initial ep_poll_callback()
>> may queue to the overflow list. In this case ep_poll has no visibility
>> into the event since its only checking ep->rdllist.
>
> Ok, my mistake - I see it sets: ep->rdllist.next = LIST_POISON1; for
> that
> case. Ok I think this approach makes sense then.

Hi Jason,

I want also to emphasize, that the original behavior should not be
changed
(correct me if I'm wrong): callers of ep_events_available() will get
false
positive (something is available, but under the lock when ->ovflist is
resolved, it turns out to be no events to harvest). This was always like
that and this is fine, since the final decision is made under the proper
lock.

(I speak it out loud to accurately cover all the cases).

I will take couple of days for testing and then send out a patch.
No objections on that?

--
Roman