2020-02-10 09:42:42

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v2 1/3] epoll: fix possible lost wakeup on epoll_ctl() path

This fixes possible lost wakeup introduced by the a218cc491420.
Originally modifications to ep->wq were serialized by ep->wq.lock,
but in the a218cc491420 new rw lock was introduced in order to
relax fd event path, i.e. callers of ep_poll_callback() function.

After the change ep_modify and ep_insert (both are called on
epoll_ctl() path) were switched to ep->lock, but ep_poll
(epoll_wait) was using ep->wq.lock on wqueue list modification.

The bug doesn't lead to any wqueue list corruptions, because wake up
path and list modifications were serialized by ep->wq.lock
internally, but actual waitqueue_active() check prior wake_up()
call can be reordered with modifications of ep ready list, thus
wake up can be lost.

And yes, can be healed by explicit smp_mb():

list_add_tail(&epi->rdlink, &ep->rdllist);
smp_mb();
if (waitqueue_active(&ep->wq))
wake_up(&ep->wp);

But let's make it simple, thus current patch replaces ep->wq.lock
with the ep->lock for wqueue modifications, thus wake up path
always observes activeness of the wqueue correcty.

Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
Signed-off-by: Roman Penyaev <[email protected]>
Reported-by: Max Neunhoeffer <[email protected]>
Bisected-by: Max Neunhoeffer <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Christopher Kohlhoff <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Nothing interesting in v2:
changed the comment a bit and specified Reported-by and Bisected-by tags

fs/eventpoll.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index b041b66002db..eee3c92a9ebf 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1854,9 +1854,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
waiter = true;
init_waitqueue_entry(&wait, current);

- spin_lock_irq(&ep->wq.lock);
+ write_lock_irq(&ep->lock);
__add_wait_queue_exclusive(&ep->wq, &wait);
- spin_unlock_irq(&ep->wq.lock);
+ write_unlock_irq(&ep->lock);
}

for (;;) {
@@ -1904,9 +1904,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
goto fetch_events;

if (waiter) {
- spin_lock_irq(&ep->wq.lock);
+ write_lock_irq(&ep->lock);
__remove_wait_queue(&ep->wq, &wait);
- spin_unlock_irq(&ep->wq.lock);
+ write_unlock_irq(&ep->lock);
}

return res;
--
2.24.1


2020-02-10 09:43:19

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v2 2/3] epoll: ep->wq can be woken up unlocked in certain cases

Now ep->lock is responsible for wqueue serialization, thus if ep->lock
is taken on write path, wake_up_locked() can be invoked.

Though, read path is different. Since concurrent cpus can enter the
wake up function it needs to be internally serialized, thus wake_up()
variant is used which implies internal spin lock.

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Max Neunhoeffer <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Christopher Kohlhoff <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Nothing interesting in v2:
changed the comment a bit

fs/eventpoll.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index eee3c92a9ebf..6e218234bd4a 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1173,7 +1173,7 @@ static inline bool chain_epi_lockless(struct epitem *epi)
* Another thing worth to mention is that ep_poll_callback() can be called
* concurrently for the same @epi from different CPUs if poll table was inited
* with several wait queues entries. Plural wakeup from different CPUs of a
- * single wait queue is serialized by wq.lock, but the case when multiple wait
+ * single wait queue is serialized by ep->lock, but the case when multiple wait
* queues are used should be detected accordingly. This is detected using
* cmpxchg() operation.
*/
@@ -1248,6 +1248,12 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
break;
}
}
+ /*
+ * Since here we have the read lock (ep->lock) taken, plural
+ * wakeup from different CPUs can occur, thus we call wake_up()
+ * variant which implies its own lock on wqueue. All other paths
+ * take write lock.
+ */
wake_up(&ep->wq);
}
if (waitqueue_active(&ep->poll_wait))
@@ -1551,7 +1557,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,

/* Notify waiting tasks that events are available */
if (waitqueue_active(&ep->wq))
- wake_up(&ep->wq);
+ wake_up_locked(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
@@ -1657,7 +1663,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,

/* Notify waiting tasks that events are available */
if (waitqueue_active(&ep->wq))
- wake_up(&ep->wq);
+ wake_up_locked(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
--
2.24.1

2020-02-10 18:19:11

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] epoll: ep->wq can be woken up unlocked in certain cases



On 2/10/20 4:41 AM, Roman Penyaev wrote:
> Now ep->lock is responsible for wqueue serialization, thus if ep->lock
> is taken on write path, wake_up_locked() can be invoked.
>
> Though, read path is different. Since concurrent cpus can enter the
> wake up function it needs to be internally serialized, thus wake_up()
> variant is used which implies internal spin lock.
>
> Signed-off-by: Roman Penyaev <[email protected]>
> Cc: Max Neunhoeffer <[email protected]>
> Cc: Jakub Kicinski <[email protected]>
> Cc: Christopher Kohlhoff <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Jason Baron <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> Nothing interesting in v2:
> changed the comment a bit
>
> fs/eventpoll.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index eee3c92a9ebf..6e218234bd4a 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1173,7 +1173,7 @@ static inline bool chain_epi_lockless(struct epitem *epi)
> * Another thing worth to mention is that ep_poll_callback() can be called
> * concurrently for the same @epi from different CPUs if poll table was inited
> * with several wait queues entries. Plural wakeup from different CPUs of a
> - * single wait queue is serialized by wq.lock, but the case when multiple wait
> + * single wait queue is serialized by ep->lock, but the case when multiple wait
> * queues are used should be detected accordingly. This is detected using
> * cmpxchg() operation.
> */
> @@ -1248,6 +1248,12 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
> break;
> }
> }
> + /*
> + * Since here we have the read lock (ep->lock) taken, plural
> + * wakeup from different CPUs can occur, thus we call wake_up()
> + * variant which implies its own lock on wqueue. All other paths
> + * take write lock.
> + */
> wake_up(&ep->wq);
> }
> if (waitqueue_active(&ep->poll_wait))
> @@ -1551,7 +1557,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>
> /* Notify waiting tasks that events are available */
> if (waitqueue_active(&ep->wq))
> - wake_up(&ep->wq);
> + wake_up_locked(&ep->wq);


So I think this will now hit the 'lockdep_assert_held()' in
__wake_up_common()? I agree that its correct, but I think it will
confuse lockdep here...

Thanks,

-Jason

2020-02-10 19:32:31

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] epoll: ep->wq can be woken up unlocked in certain cases

On 2020-02-10 19:16, Jason Baron wrote:
> On 2/10/20 4:41 AM, Roman Penyaev wrote:
>> Now ep->lock is responsible for wqueue serialization, thus if ep->lock
>> is taken on write path, wake_up_locked() can be invoked.
>>
>> Though, read path is different. Since concurrent cpus can enter the
>> wake up function it needs to be internally serialized, thus wake_up()
>> variant is used which implies internal spin lock.
>>
>> Signed-off-by: Roman Penyaev <[email protected]>
>> Cc: Max Neunhoeffer <[email protected]>
>> Cc: Jakub Kicinski <[email protected]>
>> Cc: Christopher Kohlhoff <[email protected]>
>> Cc: Davidlohr Bueso <[email protected]>
>> Cc: Jason Baron <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> Nothing interesting in v2:
>> changed the comment a bit
>>
>> fs/eventpoll.c | 12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index eee3c92a9ebf..6e218234bd4a 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -1173,7 +1173,7 @@ static inline bool chain_epi_lockless(struct
>> epitem *epi)
>> * Another thing worth to mention is that ep_poll_callback() can be
>> called
>> * concurrently for the same @epi from different CPUs if poll table
>> was inited
>> * with several wait queues entries. Plural wakeup from different
>> CPUs of a
>> - * single wait queue is serialized by wq.lock, but the case when
>> multiple wait
>> + * single wait queue is serialized by ep->lock, but the case when
>> multiple wait
>> * queues are used should be detected accordingly. This is detected
>> using
>> * cmpxchg() operation.
>> */
>> @@ -1248,6 +1248,12 @@ static int ep_poll_callback(wait_queue_entry_t
>> *wait, unsigned mode, int sync, v
>> break;
>> }
>> }
>> + /*
>> + * Since here we have the read lock (ep->lock) taken, plural
>> + * wakeup from different CPUs can occur, thus we call wake_up()
>> + * variant which implies its own lock on wqueue. All other paths
>> + * take write lock.
>> + */
>> wake_up(&ep->wq);
>> }
>> if (waitqueue_active(&ep->poll_wait))
>> @@ -1551,7 +1557,7 @@ static int ep_insert(struct eventpoll *ep, const
>> struct epoll_event *event,
>>
>> /* Notify waiting tasks that events are available */
>> if (waitqueue_active(&ep->wq))
>> - wake_up(&ep->wq);
>> + wake_up_locked(&ep->wq);
>
>
> So I think this will now hit the 'lockdep_assert_held()' in
> __wake_up_common()? I agree that its correct, but I think it will
> confuse lockdep here...

Argh! True. And I do not see any neat way to shut up lockdep here
(Calling lock_acquire() manually seems not an option for such minor
thing).

Then this optimization is not needed, patch is cancelled.

Thanks for noting that.

--
Roman