2017-12-14 15:24:32

by Christoph Hellwig

[permalink] [raw]
Subject: waitqueue lockdep annotation V3

Hi all,

this series adds a strategic lockdep_assert_held to __wake_up_common
to ensure callers really do hold the wait_queue_head lock when calling
the unlocked wake_up variants. It turns out epoll did not do this
for a fairly common path (hit all the time by systemd during bootup),
so the second patch fixed this instance as well.

Changes since V2:
- various typo fixes
- updated comments in epoll to always refer to ep->wq.lock
- add the new userfaultd patch from Matthew

Changes since V1:
- remove eq->lock in epoll


2017-12-14 15:24:48

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 2/3] userfaultfd: use fault_wqh lock

From: Matthew Wilcox <[email protected]>

The epoll code currently uses the unlocked waitqueue helpers for managing
fault_wqh, but instead of holding the waitqueue lock for this waitqueue
around these calls, it the waitqueue lock of fault_pending_wq, which is
a different waitqueue instance. Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.

Switch to the internally locked waitqueue helpers instead. This means
that the lock inside fault_wqh now nests inside the fault_pending_wqh
lock, but that's not a problem since it was entireyl unused before.

Signed-off-by: Matthew Wilcox <[email protected]>
[hch: slight changelog updates]
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/userfaultfd.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index ac9a4e65ca49..a39bc3237b68 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -879,7 +879,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
*/
spin_lock(&ctx->fault_pending_wqh.lock);
__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, &range);
- __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, &range);
+ __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range);
spin_unlock(&ctx->fault_pending_wqh.lock);

/* Flush pending events that may still wait on event_wqh */
@@ -1045,7 +1045,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
* anyway.
*/
list_del(&uwq->wq.entry);
- __add_wait_queue(&ctx->fault_wqh, &uwq->wq);
+ add_wait_queue(&ctx->fault_wqh, &uwq->wq);

write_seqcount_end(&ctx->refile_seq);

@@ -1194,7 +1194,7 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
range);
if (waitqueue_active(&ctx->fault_wqh))
- __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, range);
+ __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, range);
spin_unlock(&ctx->fault_pending_wqh.lock);
}

--
2.14.2

2017-12-14 15:24:55

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 3/3] sched/wait: assert the wait_queue_head lock is held in __wake_up_common

Better ensure we actually hold the lock using lockdep than just commenting
on it. Due to the various exported _locked interfaces it is far too easy
to get the locking wrong.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
---
kernel/sched/wait.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 929ecb7d6b78..d09c4de4865c 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -76,6 +76,8 @@ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
wait_queue_entry_t *curr, *next;
int cnt = 0;

+ lockdep_assert_held(&wq_head->lock);
+
if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
curr = list_next_entry(bookmark, entry);

--
2.14.2

2017-12-14 15:25:13

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 1/3] epoll: use the waitqueue lock to protect ep->wq

The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock. Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Baron <[email protected]>
---
fs/eventpoll.c | 65 ++++++++++++++++++++++++++--------------------------------
1 file changed, 29 insertions(+), 36 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index afd548ebc328..d32617582432 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -50,10 +50,10 @@
*
* 1) epmutex (mutex)
* 2) ep->mtx (mutex)
- * 3) ep->lock (spinlock)
+ * 3) ep->wq.lock (spinlock)
*
* The acquire order is the one listed above, from 1 to 3.
- * We need a spinlock (ep->lock) because we manipulate objects
+ * We need a spinlock (ep->wq.lock) because we manipulate objects
* from inside the poll callback, that might be triggered from
* a wake_up() that in turn might be called from IRQ context.
* So we can't sleep inside the poll callback and hence we need
@@ -85,7 +85,7 @@
* of epoll file descriptors, we use the current recursion depth as
* the lockdep subkey.
* It is possible to drop the "ep->mtx" and to use the global
- * mutex "epmutex" (together with "ep->lock") to have it working,
+ * mutex "epmutex" (together with "ep->wq.lock") to have it working,
* but having "ep->mtx" will make the interface more scalable.
* Events that require holding "epmutex" are very rare, while for
* normal operations the epoll private "ep->mtx" will guarantee
@@ -182,11 +182,10 @@ struct epitem {
* This structure is stored inside the "private_data" member of the file
* structure and represents the main data structure for the eventpoll
* interface.
+ *
+ * Access to it is protected by the lock inside wq.
*/
struct eventpoll {
- /* Protect the access to this structure */
- spinlock_t lock;
-
/*
* This mutex is used to ensure that files are not removed
* while epoll is using them. This is held during the event
@@ -210,7 +209,7 @@ struct eventpoll {
/*
* This is a single linked list that chains all the "struct epitem" that
* happened while transferring ready events to userspace w/out
- * holding ->lock.
+ * holding ->wq.lock.
*/
struct epitem *ovflist;

@@ -686,17 +685,17 @@ static int ep_scan_ready_list(struct eventpoll *ep,
* because we want the "sproc" callback to be able to do it
* in a lockless way.
*/
- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);
list_splice_init(&ep->rdllist, &txlist);
ep->ovflist = NULL;
- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);

/*
* Now call the callback function.
*/
error = (*sproc)(ep, &txlist, priv);

- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);
/*
* During the time we spent inside the "sproc" callback, some
* other events might have been queued by the poll callback.
@@ -738,7 +737,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);

if (!ep_locked)
mutex_unlock(&ep->mtx);
@@ -766,12 +765,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
struct file *file = epi->ffd.file;

/*
- * Removes poll wait queue hooks. We _have_ to do this without holding
- * the "ep->lock" otherwise a deadlock might occur. This because of the
- * sequence of the lock acquisition. Here we do "ep->lock" then the wait
- * queue head lock when unregistering the wait queue. The wakeup callback
- * will run by holding the wait queue head lock and will call our callback
- * that will try to get "ep->lock".
+ * Removes poll wait queue hooks.
*/
ep_unregister_pollwait(ep, epi);

@@ -782,10 +776,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)

rb_erase_cached(&epi->rbn, &ep->rbr);

- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);
if (ep_is_linked(&epi->rdllink))
list_del_init(&epi->rdllink);
- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);

wakeup_source_unregister(ep_wakeup_source(epi));
/*
@@ -835,7 +829,7 @@ static void ep_free(struct eventpoll *ep)
* Walks through the whole tree by freeing each "struct epitem". At this
* point we are sure no poll callbacks will be lingering around, and also by
* holding "epmutex" we can be sure that no file cleanup code will hit
- * us during this operation. So we can avoid the lock on "ep->lock".
+ * us during this operation. So we can avoid the lock on "ep->wq.lock".
* We do not need to lock ep->mtx, either, we only do it to prevent
* a lockdep warning.
*/
@@ -1015,7 +1009,6 @@ static int ep_alloc(struct eventpoll **pep)
if (unlikely(!ep))
goto free_uid;

- spin_lock_init(&ep->lock);
mutex_init(&ep->mtx);
init_waitqueue_head(&ep->wq);
init_waitqueue_head(&ep->poll_wait);
@@ -1119,7 +1112,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
struct eventpoll *ep = epi->ep;
int ewake = 0;

- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);

ep_set_busy_poll_napi_id(epi);

@@ -1196,7 +1189,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
pwake++;

out_unlock:
- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);

/* We have to call this outside the lock */
if (pwake)
@@ -1480,7 +1473,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
goto error_remove_epi;

/* We have to drop the new item inside our item list to keep track of it */
- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);

/* record NAPI ID of new item if present */
ep_set_busy_poll_napi_id(epi);
@@ -1497,7 +1490,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
pwake++;
}

- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);

atomic_long_inc(&ep->user->epoll_watches);

@@ -1523,10 +1516,10 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
* list, since that is used/cleaned only inside a section bound by "mtx".
* And ep_insert() is called with "mtx" held.
*/
- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);
if (ep_is_linked(&epi->rdllink))
list_del_init(&epi->rdllink);
- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);

wakeup_source_unregister(ep_wakeup_source(epi));

@@ -1568,9 +1561,9 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
* 1) Flush epi changes above to other CPUs. This ensures
* we do not miss events from ep_poll_callback if an
* event occurs immediately after we call f_op->poll().
- * We need this because we did not take ep->lock while
+ * We need this because we did not take ep->wq.lock while
* changing epi above (but ep_poll_callback does take
- * ep->lock).
+ * ep->wq.lock).
*
* 2) We also need to ensure we do not miss _past_ events
* when calling f_op->poll(). This barrier also
@@ -1593,7 +1586,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
* list, push it inside.
*/
if (revents & event->events) {
- spin_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->wq.lock);
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
@@ -1604,7 +1597,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
- spin_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->wq.lock);
}

/* We have to call this outside the lock */
@@ -1754,7 +1747,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* caller specified a non blocking operation.
*/
timed_out = 1;
- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);
goto check_events;
}

@@ -1763,7 +1756,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
if (!ep_events_available(ep))
ep_busy_loop(ep, timed_out);

- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);

if (!ep_events_available(ep)) {
/*
@@ -1805,11 +1798,11 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
break;
}

- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);
if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
timed_out = 1;

- spin_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->wq.lock, flags);
}

__remove_wait_queue(&ep->wq, &wait);
@@ -1819,7 +1812,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
/* Is it worth to try to dig for events ? */
eavail = ep_events_available(ep);

- spin_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->wq.lock, flags);

/*
* Try to transfer events to user space. In case we get 0 events and
--
2.14.2

2017-12-17 07:10:24

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 2/3] userfaultfd: use fault_wqh lock

Hi,

On Thu, Dec 14, 2017 at 04:23:43PM +0100, Christoph Hellwig wrote:
> From: Matthew Wilcox <[email protected]>
>
> The epoll code currently uses the unlocked waitqueue helpers for managing

The userfaultfd code

> fault_wqh, but instead of holding the waitqueue lock for this waitqueue
> around these calls, it the waitqueue lock of fault_pending_wq, which is
> a different waitqueue instance. Given that the waitqueue is not exposed
> to the rest of the kernel this actually works ok at the moment, but
> prevents the epoll locking rules from being enforced using lockdep.

ditto

> Switch to the internally locked waitqueue helpers instead. This means
> that the lock inside fault_wqh now nests inside the fault_pending_wqh
> lock, but that's not a problem since it was entireyl unused before.

spelling: entirely

> Signed-off-by: Matthew Wilcox <[email protected]>
> [hch: slight changelog updates]
> Signed-off-by: Christoph Hellwig <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> fs/userfaultfd.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index ac9a4e65ca49..a39bc3237b68 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -879,7 +879,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> */
> spin_lock(&ctx->fault_pending_wqh.lock);
> __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, &range);
> - __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, &range);
> + __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range);
> spin_unlock(&ctx->fault_pending_wqh.lock);
>
> /* Flush pending events that may still wait on event_wqh */
> @@ -1045,7 +1045,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
> * anyway.
> */
> list_del(&uwq->wq.entry);
> - __add_wait_queue(&ctx->fault_wqh, &uwq->wq);
> + add_wait_queue(&ctx->fault_wqh, &uwq->wq);
>
> write_seqcount_end(&ctx->refile_seq);
>
> @@ -1194,7 +1194,7 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
> __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
> range);
> if (waitqueue_active(&ctx->fault_wqh))
> - __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, range);
> + __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, range);
> spin_unlock(&ctx->fault_pending_wqh.lock);
> }
>
> --
> 2.14.2
>

--
Sincerely yours,
Mike.

2018-07-12 19:20:08

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: waitqueue lockdep annotation V3

On Thu, 14 Dec 2017, Christoph Hellwig wrote:

>Hi all,
>
>this series adds a strategic lockdep_assert_held to __wake_up_common
>to ensure callers really do hold the wait_queue_head lock when calling
>the unlocked wake_up variants. It turns out epoll did not do this
>for a fairly common path (hit all the time by systemd during bootup),
>so the second patch fixed this instance as well.

I ran into these changes because of patch 1 getting rid of ep->lock. Is
there any reason why this series was never picked up?

Thanks,
Davidlohr

2018-07-17 14:23:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: waitqueue lockdep annotation V3

On Thu, Jul 12, 2018 at 12:17:53PM -0700, Davidlohr Bueso wrote:
> On Thu, 14 Dec 2017, Christoph Hellwig wrote:
>
>> Hi all,
>>
>> this series adds a strategic lockdep_assert_held to __wake_up_common
>> to ensure callers really do hold the wait_queue_head lock when calling
>> the unlocked wake_up variants. It turns out epoll did not do this
>> for a fairly common path (hit all the time by systemd during bootup),
>> so the second patch fixed this instance as well.
>
> I ran into these changes because of patch 1 getting rid of ep->lock. Is
> there any reason why this series was never picked up?

I'd love to see this merged, but I never heard back about it.

2018-07-17 15:06:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: waitqueue lockdep annotation V3

On Tue, Jul 17, 2018 at 04:24:37PM +0200, Christoph Hellwig wrote:
> On Thu, Jul 12, 2018 at 12:17:53PM -0700, Davidlohr Bueso wrote:
> > On Thu, 14 Dec 2017, Christoph Hellwig wrote:
> >
> >> Hi all,
> >>
> >> this series adds a strategic lockdep_assert_held to __wake_up_common
> >> to ensure callers really do hold the wait_queue_head lock when calling
> >> the unlocked wake_up variants. It turns out epoll did not do this
> >> for a fairly common path (hit all the time by systemd during bootup),
> >> so the second patch fixed this instance as well.
> >
> > I ran into these changes because of patch 1 getting rid of ep->lock. Is
> > there any reason why this series was never picked up?
>
> I'd love to see this merged, but I never heard back about it.

Seeing how it touched fs bits, I was expecting this to go through the
vfs tree; was that not the intended target?

2018-07-17 15:07:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: waitqueue lockdep annotation V3

On Tue, Jul 17, 2018 at 05:04:39PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 04:24:37PM +0200, Christoph Hellwig wrote:
> > On Thu, Jul 12, 2018 at 12:17:53PM -0700, Davidlohr Bueso wrote:
> > > On Thu, 14 Dec 2017, Christoph Hellwig wrote:
> > >
> > >> Hi all,
> > >>
> > >> this series adds a strategic lockdep_assert_held to __wake_up_common
> > >> to ensure callers really do hold the wait_queue_head lock when calling
> > >> the unlocked wake_up variants. It turns out epoll did not do this
> > >> for a fairly common path (hit all the time by systemd during bootup),
> > >> so the second patch fixed this instance as well.
> > >
> > > I ran into these changes because of patch 1 getting rid of ep->lock. Is
> > > there any reason why this series was never picked up?
> >
> > I'd love to see this merged, but I never heard back about it.
>
> Seeing how it touched fs bits, I was expecting this to go through the
> vfs tree; was that not the intended target?

I don't really care which way it goes, although tip is more reliably
at picking up changes than the vfs tree..