LinuxLists.cc - [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On Mon, Sep 11, 2023 at 04:11:35PM +0200, Peter Zijlstra wrote:

> Now.. the reason we need hb->lock at this point is to avoid the
> wake_futex_pi() -EAGAIN case.
>
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> becomes inconsistent. The current rules are such that this inconsistency
> will not be observed.
>
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.
>
> *However* the case at hand is where a waiter is leaving, in this case
> what happens is that we'll fail to observe a waiter that is already
> gone, which is harmless afaict.
>
> Would not something like the below work instead?

I forgot to say this is complete untested and hasn't evne seen a
compiler up close...

> ---
> diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
> index ce2889f12375..8c76a52da9bd 100644
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
> /*
> * Caller must hold a reference on @pi_state.
> */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> + struct futex_pi_state *pi_state,
> + rt_mutex_waiter *top_waiter)
> {
> - struct rt_mutex_waiter *top_waiter;
> struct task_struct *new_owner;
> bool postunlock = false;
> DEFINE_RT_WAKE_Q(wqh);
> u32 curval, newval;
> int ret = 0;
>
> - top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> - if (WARN_ON_ONCE(!top_waiter)) {
> - /*
> - * As per the comment in futex_unlock_pi() this should not happen.
> - *
> - * When this happens, give up our locks and try again, giving
> - * the futex_lock_pi() instance time to complete, either by
> - * waiting on the rtmutex or removing itself from the futex
> - * queue.
> - */
> - ret = -EAGAIN;
> - goto out_unlock;
> - }
> -
> new_owner = top_waiter->task;
>
> /*
> @@ -1039,19 +1026,27 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>
> cleanup:
> - spin_lock(q.lock_ptr);
> /*
> * If we failed to acquire the lock (deadlock/signal/timeout), we must
> * first acquire the hb->lock before removing the lock from the
> * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> * lists consistent.
> *
> - * In particular; it is important that futex_unlock_pi() can not
> - * observe this inconsistency.
> + * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> + * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> + *
> + * Doing the cleanup without holding hb->lock can cause inconsistent
> + * state between hb and pi_state, but only in the direction of seeing a
> + * waiter that is leaving.
> */
> if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
> ret = 0;
>
> + /*
> + * Now that the rt_waiter has been dequeued, it is safe to use
> + * spinlock/rtlock, which will enqueue a new rt_waiter.
> + */
> + spin_lock(q.lock_ptr);
> no_block:
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1132,6 +1127,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> top_waiter = futex_top_waiter(hb, &key);
> if (top_waiter) {
> struct futex_pi_state *pi_state = top_waiter->pi_state;
> + struct rt_mutex_waiter *rt_waiter;
>
> ret = -EINVAL;
> if (!pi_state)
> @@ -1147,19 +1143,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> get_pi_state(pi_state);
> /*
> * By taking wait_lock while still holding hb->lock, we ensure
> - * there is no point where we hold neither; and therefore
> - * wake_futex_p() must observe a state consistent with what we
> - * observed.
> + * there is no point where we hold neither; and thereby
> + * wake_futex_pi() must observe any new waiters.
> + *
> + * Since the cleanup: case in futex_lock_pi() removes the
> + * rt_waiter without holding hb->lock, it is possible for
> + * wake_futex_pi() to not find a waiter while the above does,
> + * in this case the waiter is on the way out and it can be
> + * ignored.
> *
> * In particular; this forces __rt_mutex_start_proxy() to
> * complete such that we're guaranteed to observe the
> - * rt_waiter. Also see the WARN in wake_futex_pi().
> + * rt_waiter.
> */
> raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> + /*
> + * Futex vs rt_mutex waiter state -- if there are on rt_mutex
> + * waiters even though futex thinkgs there are, then the waiter
> + * is leaving and the uncontended path is safe to take.
> + */
> + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> + if (!rt_waiter)
> + goto do_uncontended;

That ^ needs to drop wait_lock before the goto.

> +
> spin_unlock(&hb->lock);
>
> /* drops pi_state->pi_mutex.wait_lock */
> - ret = wake_futex_pi(uaddr, uval, pi_state);
> + ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>
> put_pi_state(pi_state);
>
> @@ -1187,6 +1198,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> return ret;
> }
>
> +do_uncontended:
> /*
> * We have no kernel internal state, i.e. no waiters in the
> * kernel. Waiters which are about to queue themselves are stuck
>

2023-09-12 12:54:56

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2023-09-12 12:57:45 [+0200], Peter Zijlstra wrote:
> > --- a/kernel/futex/pi.c
> > +++ b/kernel/futex/pi.c
> > @@ -1147,19 +1143,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
…
> > + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> > + if (!rt_waiter)
> > + goto do_uncontended;
>
> That ^ needs to drop wait_lock before the goto.

Ach, you noticed that one. Wrote a reply a few minutes after that one.

Sebastian

2023-09-12 14:54:00

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On Tue, Sep 12, 2023 at 01:17:11PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-09-11 16:11:35 [+0200], Peter Zijlstra wrote:
> > Aside from this being just plain gross, this also destroys determinism
> > of futex_pi, which completely defeats the purpose of the whole thing.
>
> No objections here.
>
> > Now.. the reason we need hb->lock at this point is to avoid the
> > wake_futex_pi() -EAGAIN case.
> >
> > This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> > becomes inconsistent. The current rules are such that this inconsistency
> > will not be observed.
> >
> > Notably the case that needs to be avoided is where futex_lock_pi() and
> > futex_unlock_pi() interleave such that unlock will fail to observe a new
> > waiter.
> >
> > *However* the case at hand is where a waiter is leaving, in this case
> > what happens is that we'll fail to observe a waiter that is already
> > gone, which is harmless afaict.
>
> Sounds harmless. I sure something will be pop up in a few years and we
> will look back this ;)

Oh absolutely, I already hate this... It's dangerous as heck, because
if we ever do trip the race the other way around it will silently
misbehave.

However, compared to the other hack ... :-(

2023-09-12 17:08:00

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2023-09-12 13:25:51 [+0200], Peter Zijlstra wrote:
>
> And I think there was a requeue site that wants updating too.. the above
> was more or less a starting point for discussion :-)

Yes. I spent the morning testing it so I can contribute to the
discussion ;)

Sebastian

2023-09-12 18:41:27

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On Tue, Sep 12, 2023 at 01:17:11PM +0200, Sebastian Andrzej Siewior wrote:

> Comments follow…
>
> diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
> index f8e65b27d9d6b..d8866278e92ff 100644
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
> /*
> * Caller must hold a reference on @pi_state.
> */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> + struct futex_pi_state *pi_state,
> + struct rt_mutex_waiter *top_waiter)
> {
> - struct rt_mutex_waiter *top_waiter;
> struct task_struct *new_owner;
> bool postunlock = false;
> DEFINE_RT_WAKE_Q(wqh);
> u32 curval, newval;
> int ret = 0;
>
> - top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> - if (WARN_ON_ONCE(!top_waiter)) {
> - /*
> - * As per the comment in futex_unlock_pi() this should not happen.
> - *
> - * When this happens, give up our locks and try again, giving
> - * the futex_lock_pi() instance time to complete, either by
> - * waiting on the rtmutex or removing itself from the futex
> - * queue.
> - */
> - ret = -EAGAIN;
> - goto out_unlock;
> - }
> -
> new_owner = top_waiter->task;
>
> /*
> @@ -1046,15 +1033,18 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>
> cleanup:
> - spin_lock(q.lock_ptr);
> /*
> * If we failed to acquire the lock (deadlock/signal/timeout), we must
> * first acquire the hb->lock before removing the lock from the
> * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> * lists consistent.
> *
> - * In particular; it is important that futex_unlock_pi() can not
> - * observe this inconsistency.
> + * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> + * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> + *
> + * Doing the cleanup without holding hb->lock can cause inconsistent
> + * state between hb and pi_state, but only in the direction of seeing a
> + * waiter that is leaving.
> */
> if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
> ret = 0;
> @@ -1063,6 +1053,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> * Waiter is unqueued.
> */
> rt_mutex_post_schedule();
> +
> + /*
> + * Now that the rt_waiter has been dequeued, it is safe to use
> + * spinlock/rtlock, which will enqueue a new rt_waiter.
> + */
> + spin_lock(q.lock_ptr);
> no_block:
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1143,6 +1139,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> top_waiter = futex_top_waiter(hb, &key);
> if (top_waiter) {
> struct futex_pi_state *pi_state = top_waiter->pi_state;
> + struct rt_mutex_waiter *rt_waiter;
>
> ret = -EINVAL;
> if (!pi_state)
> @@ -1158,19 +1155,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> get_pi_state(pi_state);
> /*
> * By taking wait_lock while still holding hb->lock, we ensure
> - * there is no point where we hold neither; and therefore
> - * wake_futex_p() must observe a state consistent with what we
> - * observed.
> + * there is no point where we hold neither; and thereby
> + * wake_futex_pi() must observe any new waiters.
> + *
> + * Since the cleanup: case in futex_lock_pi() removes the
> + * rt_waiter without holding hb->lock, it is possible for
> + * wake_futex_pi() to not find a waiter while the above does,
> + * in this case the waiter is on the way out and it can be
> + * ignored.
> *
> * In particular; this forces __rt_mutex_start_proxy() to
> * complete such that we're guaranteed to observe the
> - * rt_waiter. Also see the WARN in wake_futex_pi().
> + * rt_waiter.
> */
> raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> + /*
> + * Futex vs rt_mutex waiter state -- if there are on rt_mutex
> + * waiters even though futex thinkgs there are, then the waiter
> + * is leaving and the uncontended path is safe to take.
> + */
> + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> + if (!rt_waiter)
> + goto do_uncontended;

That wants to be:

if (!rt_waiter) {
raw_spin_unlock_irQ(&pi_state->pi_mutex.wait_lock);
goto do_uncontended;
}

> +
> spin_unlock(&hb->lock);
>
> /* drops pi_state->pi_mutex.wait_lock */
> - ret = wake_futex_pi(uaddr, uval, pi_state);
> + ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>
> put_pi_state(pi_state);
>
> @@ -1198,6 +1210,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> return ret;
> }
>
> +do_uncontended:
> /*
> * We have no kernel internal state, i.e. no waiters in the
> * kernel. Waiters which are about to queue themselves are stuck
>

And I think there was a requeue site that wants updating too.. the above
was more or less a starting point for discussion :-)

2023-09-13 00:06:36

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2023-09-11 16:11:35 [+0200], Peter Zijlstra wrote:
> Aside from this being just plain gross, this also destroys determinism
> of futex_pi, which completely defeats the purpose of the whole thing.

No objections here.

> Now.. the reason we need hb->lock at this point is to avoid the
> wake_futex_pi() -EAGAIN case.
>
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> becomes inconsistent. The current rules are such that this inconsistency
> will not be observed.
>
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.
>
> *However* the case at hand is where a waiter is leaving, in this case
> what happens is that we'll fail to observe a waiter that is already
> gone, which is harmless afaict.

Sounds harmless. I sure something will be pop up in a few years and we
will look back this ;)

> Would not something like the below work instead?

The following is what I ended up with. It adds the `struct' keywords to
wake_futex_pi() and accounts for rt_mutex_post_schedule() in
futex_lock_pi() (assuming you stuff this at the end.

Comments follow…

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index f8e65b27d9d6b..d8866278e92ff 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
/*
* Caller must hold a reference on @pi_state.
*/
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval,
+ struct futex_pi_state *pi_state,
+ struct rt_mutex_waiter *top_waiter)
{
- struct rt_mutex_waiter *top_waiter;
struct task_struct *new_owner;
bool postunlock = false;
DEFINE_RT_WAKE_Q(wqh);
u32 curval, newval;
int ret = 0;

- top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
- if (WARN_ON_ONCE(!top_waiter)) {
- /*
- * As per the comment in futex_unlock_pi() this should not happen.
- *
- * When this happens, give up our locks and try again, giving
- * the futex_lock_pi() instance time to complete, either by
- * waiting on the rtmutex or removing itself from the futex
- * queue.
- */
- ret = -EAGAIN;
- goto out_unlock;
- }
-
new_owner = top_waiter->task;

/*
@@ -1046,15 +1033,18 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);

cleanup:
- spin_lock(q.lock_ptr);
/*
* If we failed to acquire the lock (deadlock/signal/timeout), we must
* first acquire the hb->lock before removing the lock from the
* rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
* lists consistent.
*
- * In particular; it is important that futex_unlock_pi() can not
- * observe this inconsistency.
+ * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
+ * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
+ *
+ * Doing the cleanup without holding hb->lock can cause inconsistent
+ * state between hb and pi_state, but only in the direction of seeing a
+ * waiter that is leaving.
*/
if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
ret = 0;
@@ -1063,6 +1053,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
* Waiter is unqueued.
*/
rt_mutex_post_schedule();
+
+ /*
+ * Now that the rt_waiter has been dequeued, it is safe to use
+ * spinlock/rtlock, which will enqueue a new rt_waiter.
+ */
+ spin_lock(q.lock_ptr);
no_block:
/*
* Fixup the pi_state owner and possibly acquire the lock if we
@@ -1143,6 +1139,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
top_waiter = futex_top_waiter(hb, &key);
if (top_waiter) {
struct futex_pi_state *pi_state = top_waiter->pi_state;
+ struct rt_mutex_waiter *rt_waiter;

ret = -EINVAL;
if (!pi_state)
@@ -1158,19 +1155,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
get_pi_state(pi_state);
/*
* By taking wait_lock while still holding hb->lock, we ensure
- * there is no point where we hold neither; and therefore
- * wake_futex_p() must observe a state consistent with what we
- * observed.
+ * there is no point where we hold neither; and thereby
+ * wake_futex_pi() must observe any new waiters.
+ *
+ * Since the cleanup: case in futex_lock_pi() removes the
+ * rt_waiter without holding hb->lock, it is possible for
+ * wake_futex_pi() to not find a waiter while the above does,
+ * in this case the waiter is on the way out and it can be
+ * ignored.
*
* In particular; this forces __rt_mutex_start_proxy() to
* complete such that we're guaranteed to observe the
- * rt_waiter. Also see the WARN in wake_futex_pi().
+ * rt_waiter.
*/
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+ /*
+ * Futex vs rt_mutex waiter state -- if there are on rt_mutex
+ * waiters even though futex thinkgs there are, then the waiter
+ * is leaving and the uncontended path is safe to take.
+ */
+ rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
+ if (!rt_waiter)
+ goto do_uncontended;
+
spin_unlock(&hb->lock);

/* drops pi_state->pi_mutex.wait_lock */
- ret = wake_futex_pi(uaddr, uval, pi_state);
+ ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);

put_pi_state(pi_state);

@@ -1198,6 +1210,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
return ret;
}

+do_uncontended:
/*
* We have no kernel internal state, i.e. no waiters in the
* kernel. Waiters which are about to queue themselves are stuck

Sebastian

2023-09-14 06:36:29

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2023-09-12 13:17:16 [+0200], To Peter Zijlstra wrote:
> Comments follow…
>
> diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
> index f8e65b27d9d6b..d8866278e92ff 100644
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
> /*
> * Caller must hold a reference on @pi_state.
> */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> + struct futex_pi_state *pi_state,
> + struct rt_mutex_waiter *top_waiter)
> {
> - struct rt_mutex_waiter *top_waiter;
> struct task_struct *new_owner;
> bool postunlock = false;
> DEFINE_RT_WAKE_Q(wqh);
> u32 curval, newval;
> int ret = 0;
>
> - top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> - if (WARN_ON_ONCE(!top_waiter)) {
> - /*
> - * As per the comment in futex_unlock_pi() this should not happen.
> - *
> - * When this happens, give up our locks and try again, giving
> - * the futex_lock_pi() instance time to complete, either by
> - * waiting on the rtmutex or removing itself from the futex
> - * queue.
> - */
> - ret = -EAGAIN;
> - goto out_unlock;
> - }
> -
> new_owner = top_waiter->task;
>
> /*
> @@ -1046,15 +1033,18 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>
> cleanup:
> - spin_lock(q.lock_ptr);
> /*
> * If we failed to acquire the lock (deadlock/signal/timeout), we must
> * first acquire the hb->lock before removing the lock from the
> * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> * lists consistent.
> *
> - * In particular; it is important that futex_unlock_pi() can not
> - * observe this inconsistency.
> + * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> + * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> + *
> + * Doing the cleanup without holding hb->lock can cause inconsistent
> + * state between hb and pi_state, but only in the direction of seeing a
> + * waiter that is leaving.
> */
> if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
> ret = 0;
> @@ -1063,6 +1053,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> * Waiter is unqueued.
> */
> rt_mutex_post_schedule();

So no new lock-task can enter but one can give up and vanish.

> +
> + /*
> + * Now that the rt_waiter has been dequeued, it is safe to use
> + * spinlock/rtlock, which will enqueue a new rt_waiter.
> + */
> + spin_lock(q.lock_ptr);
> no_block:
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1143,6 +1139,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> top_waiter = futex_top_waiter(hb, &key);
> if (top_waiter) {
> struct futex_pi_state *pi_state = top_waiter->pi_state;
> + struct rt_mutex_waiter *rt_waiter;
>
> ret = -EINVAL;
> if (!pi_state)
> @@ -1158,19 +1155,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> get_pi_state(pi_state);
> /*
> * By taking wait_lock while still holding hb->lock, we ensure
> - * there is no point where we hold neither; and therefore
> - * wake_futex_p() must observe a state consistent with what we
> - * observed.
> + * there is no point where we hold neither; and thereby
> + * wake_futex_pi() must observe any new waiters.
> + *
> + * Since the cleanup: case in futex_lock_pi() removes the
> + * rt_waiter without holding hb->lock, it is possible for
> + * wake_futex_pi() to not find a waiter while the above does,
> + * in this case the waiter is on the way out and it can be
> + * ignored.
> *
> * In particular; this forces __rt_mutex_start_proxy() to
> * complete such that we're guaranteed to observe the
> - * rt_waiter. Also see the WARN in wake_futex_pi().
> + * rt_waiter.
> */
> raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> + /*
> + * Futex vs rt_mutex waiter state -- if there are on rt_mutex
> + * waiters even though futex thinkgs there are, then the waiter
> + * is leaving and the uncontended path is safe to take.
> + */
> + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> + if (!rt_waiter)
> + goto do_uncontended;

So I spent some time to come up with a test case for this and then
hacked the kernel to get into exact this situation. The wait_lock from
above needs to be released since it is done by wake_futex_pi()
otherwise.

Then we do earlier get_pi_state(pi_state) and I think a put here is in
order right?

We own the hb lock here but it is released later in the do_uncontended
label so that looks okay.

The userland state is modified under the hb-lock so that should work.

> +
> spin_unlock(&hb->lock);
>
> /* drops pi_state->pi_mutex.wait_lock */
> - ret = wake_futex_pi(uaddr, uval, pi_state);
> + ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>
> put_pi_state(pi_state);
>
> @@ -1198,6 +1210,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> return ret;
> }
>
> +do_uncontended:
> /*
> * We have no kernel internal state, i.e. no waiters in the
> * kernel. Waiters which are about to queue themselves are stuck

So I *think* this could work. Then we need something similar for
requeue-pi.

Sebastian

2023-09-15 14:11:50

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On Mon, Sep 11 2023 at 16:11, Peter Zijlstra wrote:
> On Fri, Sep 08, 2023 at 06:22:54PM +0200, Sebastian Andrzej Siewior wrote:
> Aside from this being just plain gross, this also destroys determinism
> of futex_pi, which completely defeats the purpose of the whole thing.

It served the purpose to force the two dudes who wrote most of this
horrorshow to actually look at it. It worked :)

> Now.. the reason we need hb->lock at this point is to avoid the
> wake_futex_pi() -EAGAIN case.
>
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> becomes inconsistent. The current rules are such that this inconsistency
> will not be observed.
>
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.

That's correct.

> *However* the case at hand is where a waiter is leaving, in this case
> what happens is that we'll fail to observe a waiter that is already
> gone, which is harmless afaict.

It is harmless, when the wakee handles the inconsistency between the
futex waiter list and the rtmutex waiter list correctly.

> @@ -1039,19 +1026,27 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
> ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>
> cleanup:
> - spin_lock(q.lock_ptr);
> /*
> * If we failed to acquire the lock (deadlock/signal/timeout), we must
> * first acquire the hb->lock before removing the lock from the
> * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> * lists consistent.

This comment does not make sense anymore.

> - * In particular; it is important that futex_unlock_pi() can not
> - * observe this inconsistency.
> + * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> + * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> + *
> + * Doing the cleanup without holding hb->lock can cause inconsistent
> + * state between hb and pi_state, but only in the direction of seeing a
> + * waiter that is leaving.

+ * See futex_unlock_pi()

> */
> if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
> ret = 0;
>
> + /*
> + * Now that the rt_waiter has been dequeued, it is safe to use
> + * spinlock/rtlock, which will enqueue a new rt_waiter.
> + */
> + spin_lock(q.lock_ptr);
> no_block:
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1132,6 +1127,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> top_waiter = futex_top_waiter(hb, &key);
> if (top_waiter) {
> struct futex_pi_state *pi_state = top_waiter->pi_state;
> + struct rt_mutex_waiter *rt_waiter;
>
> ret = -EINVAL;
> if (!pi_state)
> @@ -1147,19 +1143,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> get_pi_state(pi_state);
> /*
> * By taking wait_lock while still holding hb->lock, we ensure
> - * there is no point where we hold neither; and therefore
> - * wake_futex_p() must observe a state consistent with what we
> - * observed.
> + * there is no point where we hold neither; and thereby
> + * wake_futex_pi() must observe any new waiters.
> + *
> + * Since the cleanup: case in futex_lock_pi() removes the
> + * rt_waiter without holding hb->lock, it is possible for
> + * wake_futex_pi() to not find a waiter while the above does,
> + * in this case the waiter is on the way out and it can be
> + * ignored.
> *
> * In particular; this forces __rt_mutex_start_proxy() to
> * complete such that we're guaranteed to observe the
> - * rt_waiter. Also see the WARN in wake_futex_pi().
> + * rt_waiter.
> */
> raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> + /*
> + * Futex vs rt_mutex waiter state -- if there are on rt_mutex

s/on/no/

> + * waiters even though futex thinkgs there are, then the waiter

thinkgs :)

> + * is leaving and the uncontended path is safe to take.
> + */
> + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> + if (!rt_waiter)
> + goto do_uncontended;

Leaks pi_mutex.wait_lock

> +
> spin_unlock(&hb->lock);
>
> /* drops pi_state->pi_mutex.wait_lock */
> - ret = wake_futex_pi(uaddr, uval, pi_state);
> + ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>
> put_pi_state(pi_state);
>
> @@ -1187,6 +1198,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> return ret;
> }
>
> +do_uncontended:
> /*
> * We have no kernel internal state, i.e. no waiters in the
> * kernel. Waiters which are about to queue themselves are stuck

Plus you need:

diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a6a4cc..af1427689414 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -850,11 +850,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
pi_mutex = &q.pi_state->pi_mutex;
ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);

- /* Current is not longer pi_blocked_on */
- spin_lock(q.lock_ptr);
+ /* Add a proper comment */
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;

+ spin_lock(q.lock_ptr);
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we

I spent quite some time to convince myself that this is correct. I was
not able to poke a hole into it. So that really should be safe to
do. Famous last words ...

Thanks,

tglx

2023-09-15 15:32:26

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2023-09-15 14:58:35 [+0200], Thomas Gleixner wrote:
> > + * is leaving and the uncontended path is safe to take.
> > + */
> > + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> > + if (!rt_waiter)
> > + goto do_uncontended;
>
> Leaks pi_mutex.wait_lock

and pi_state.

> Plus you need:
>
> diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
> index cba8b1a6a4cc..af1427689414 100644
> --- a/kernel/futex/requeue.c
> +++ b/kernel/futex/requeue.c
> @@ -850,11 +850,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
> pi_mutex = &q.pi_state->pi_mutex;
> ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
>
> - /* Current is not longer pi_blocked_on */
> - spin_lock(q.lock_ptr);
> + /* Add a proper comment */
> if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
> ret = 0;
>
> + spin_lock(q.lock_ptr);
> debug_rt_mutex_free_waiter(&rt_waiter);
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we

Yes, if we do this.

>
> I spent quite some time to convince myself that this is correct. I was
> not able to poke a hole into it. So that really should be safe to
> do. Famous last words ...

Okay. Then let me collect the pieces and post a new round then.

> Thanks,
>
> tglx

Sebastian

2023-09-15 19:24:27

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:

> I spent quite some time to convince myself that this is correct. I was
> not able to poke a hole into it. So that really should be safe to
> do. Famous last words ...

IKR :-/

Something like so then...

---
Subject: futex/pi: Fix recursive rt_mutex waiter state
From: Peter Zijlstra <[email protected]>
Date: Tue, 12 Sep 2023 13:17:11 +0200

Some new assertions pointed out that the existing code has nested rt_mutex wait
state in the futex code.

Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
still is a rt_waiter enqueued for this task, resulting in a state where there
are two waiters for the same task (and task_struct::pi_blocked_on gets
scrambled).

The reason to take hb->lock at this point is to avoid the wake_futex_pi()
EAGAIN case.

This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
inconsistent. The current rules are such that this inconsistency will not be
observed.

Notably the case that needs to be avoided is where futex_lock_pi() and
futex_unlock_pi() interleave such that unlock will fail to observe a new
waiter.

*However* the case at hand is where a waiter is leaving, in this case the race
means a waiter that is going away is not observed -- which is harmless,
provided this race is explicitly handled.

This is a somewhat dangerous proposition because the converse race is not
observing a new waiter, which must absolutely not happen. But since the race is
valid this cannot be asserted.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
pi.c | 76 +++++++++++++++++++++++++++++++++++++++-----------------------
requeue.c | 6 +++-
2 files changed, 52 insertions(+), 30 deletions(-)

Index: linux-2.6/kernel/futex/pi.c
===================================================================
--- linux-2.6.orig/kernel/futex/pi.c
+++ linux-2.6/kernel/futex/pi.c
@@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uad
/*
* Caller must hold a reference on @pi_state.
*/
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval,
+ struct futex_pi_state *pi_state,
+ struct rt_mutex_waiter *top_waiter)
{
- struct rt_mutex_waiter *top_waiter;
struct task_struct *new_owner;
bool postunlock = false;
DEFINE_RT_WAKE_Q(wqh);
u32 curval, newval;
int ret = 0;

- top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
- if (WARN_ON_ONCE(!top_waiter)) {
- /*
- * As per the comment in futex_unlock_pi() this should not happen.
- *
- * When this happens, give up our locks and try again, giving
- * the futex_lock_pi() instance time to complete, either by
- * waiting on the rtmutex or removing itself from the futex
- * queue.
- */
- ret = -EAGAIN;
- goto out_unlock;
- }
-
new_owner = top_waiter->task;

/*
@@ -1039,19 +1026,33 @@ retry_private:
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);

cleanup:
- spin_lock(q.lock_ptr);
/*
* If we failed to acquire the lock (deadlock/signal/timeout), we must
- * first acquire the hb->lock before removing the lock from the
- * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
- * lists consistent.
+ * must unwind the above, however we canont lock hb->lock because
+ * rt_mutex already has a waiter enqueued and hb->lock can itself try
+ * and enqueue an rt_waiter through rtlock.
+ *
+ * Doing the cleanup without holding hb->lock can cause inconsistent
+ * state between hb and pi_state, but only in the direction of not
+ * seeing a waiter that is leaving.
+ *
+ * See futex_unlock_pi(), it deals with this inconsistency.
+ *
+ * There be dragons here, since we must deal with the inconsistency on
+ * the way out (here), it is impossible to detect/warn about the race
+ * the other way around (missing an incoming waiter).
*
- * In particular; it is important that futex_unlock_pi() can not
- * observe this inconsistency.
+ * What could possibly go wrong...
*/
if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
ret = 0;

+ /*
+ * Now that the rt_waiter has been dequeued, it is safe to use
+ * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+ * the
+ */
+ spin_lock(q.lock_ptr);
no_block:
/*
* Fixup the pi_state owner and possibly acquire the lock if we
@@ -1132,6 +1133,7 @@ retry:
top_waiter = futex_top_waiter(hb, &key);
if (top_waiter) {
struct futex_pi_state *pi_state = top_waiter->pi_state;
+ struct rt_mutex_waiter *rt_waiter;

ret = -EINVAL;
if (!pi_state)
@@ -1144,22 +1146,39 @@ retry:
if (pi_state->owner != current)
goto out_unlock;

- get_pi_state(pi_state);
/*
* By taking wait_lock while still holding hb->lock, we ensure
- * there is no point where we hold neither; and therefore
- * wake_futex_p() must observe a state consistent with what we
- * observed.
+ * there is no point where we hold neither; and thereby
+ * wake_futex_pi() must observe any new waiters.
+ *
+ * Since the cleanup: case in futex_lock_pi() removes the
+ * rt_waiter without holding hb->lock, it is possible for
+ * wake_futex_pi() to not find a waiter while the above does,
+ * in this case the waiter is on the way out and it can be
+ * ignored.
*
* In particular; this forces __rt_mutex_start_proxy() to
* complete such that we're guaranteed to observe the
- * rt_waiter. Also see the WARN in wake_futex_pi().
+ * rt_waiter.
*/
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+ /*
+ * Futex vs rt_mutex waiter state -- if there are no rt_mutex
+ * waiters even though futex thinks there are, then the waiter
+ * is leaving and the uncontended path is safe to take.
+ */
+ rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
+ if (!rt_waiter) {
+ raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+ goto do_uncontended;
+ }
+
+ get_pi_state(pi_state);
spin_unlock(&hb->lock);

/* drops pi_state->pi_mutex.wait_lock */
- ret = wake_futex_pi(uaddr, uval, pi_state);
+ ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);

put_pi_state(pi_state);

@@ -1187,6 +1206,7 @@ retry:
return ret;
}

+do_uncontended:
/*
* We have no kernel internal state, i.e. no waiters in the
* kernel. Waiters which are about to queue themselves are stuck
Index: linux-2.6/kernel/futex/requeue.c
===================================================================
--- linux-2.6.orig/kernel/futex/requeue.c
+++ linux-2.6/kernel/futex/requeue.c
@@ -850,11 +850,13 @@ int futex_wait_requeue_pi(u32 __user *ua
pi_mutex = &q.pi_state->pi_mutex;
ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);

- /* Current is not longer pi_blocked_on */
- spin_lock(q.lock_ptr);
+ /*
+ * See futex_unlock_pi()'s cleanup: comment.
+ */
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;

+ spin_lock(q.lock_ptr);
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we

2023-09-16 02:14:57

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On Fri, Sep 15 2023 at 17:19, Peter Zijlstra wrote:
> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
> *However* the case at hand is where a waiter is leaving, in this case the race
> means a waiter that is going away is not observed -- which is harmless,
> provided this race is explicitly handled.
>
> This is a somewhat dangerous proposition because the converse race is not
> observing a new waiter, which must absolutely not happen. But since the race is
> valid this cannot be asserted.

Correct. But adding a new waiter requires to hold hb::lock which _IS_
held by the unlocking code when it deals with the outgoing race.

So I'm not too worried about it.

> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> /*
> * If we failed to acquire the lock (deadlock/signal/timeout), we must
> - * first acquire the hb->lock before removing the lock from the
> - * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> - * lists consistent.
> + * must unwind the above, however we canont lock hb->lock because
> + * rt_mutex already has a waiter enqueued and hb->lock can itself try
> + * and enqueue an rt_waiter through rtlock.
> + *
> + * Doing the cleanup without holding hb->lock can cause inconsistent
> + * state between hb and pi_state, but only in the direction of not
> + * seeing a waiter that is leaving.
> + *
> + * See futex_unlock_pi(), it deals with this inconsistency.
> + *
> + * There be dragons here, since we must deal with the inconsistency on
> + * the way out (here), it is impossible to detect/warn about the race
> + * the other way around (missing an incoming waiter).
> *
> - * In particular; it is important that futex_unlock_pi() can not
> - * observe this inconsistency.
> + * What could possibly go wrong...

If some code in the future tries to enqueue a waiter w/o holding
hb::lock then this corner case will be the least of our worries. There
are tons of other things which will insta go south.

Reviewed-by: Thomas Gleixner <[email protected]>

2023-09-19 16:13:34

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2023-09-15 17:19:43 [+0200], Peter Zijlstra wrote:
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

Looked at it, poked at it can't find anything wrong with it.

Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>

Sebastian

2023-09-20 09:37:07

[permalink] [raw]

Subject: [tip: locking/core] futex/pi: Fix recursive rt_mutex waiter state

The following commit has been merged into the locking/core branch of tip:

Commit-ID: fbeb558b0dd0d6348e0872bbbbe96e30c65867b7
Gitweb: https://git.kernel.org/tip/fbeb558b0dd0d6348e0872bbbbe96e30c65867b7
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 15 Sep 2023 17:19:44 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 20 Sep 2023 09:31:14 +02:00

futex/pi: Fix recursive rt_mutex waiter state

Some new assertions pointed out that the existing code has nested rt_mutex wait
state in the futex code.

Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
still is a rt_waiter enqueued for this task, resulting in a state where there
are two waiters for the same task (and task_struct::pi_blocked_on gets
scrambled).

The reason to take hb->lock at this point is to avoid the wake_futex_pi()
EAGAIN case.

This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
inconsistent. The current rules are such that this inconsistency will not be
observed.

Notably the case that needs to be avoided is where futex_lock_pi() and
futex_unlock_pi() interleave such that unlock will fail to observe a new
waiter.

*However* the case at hand is where a waiter is leaving, in this case the race
means a waiter that is going away is not observed -- which is harmless,
provided this race is explicitly handled.

This is a somewhat dangerous proposition because the converse race is not
observing a new waiter, which must absolutely not happen. But since the race is
valid this cannot be asserted.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/futex/pi.c | 76 +++++++++++++++++++++++++----------------
kernel/futex/requeue.c | 6 ++-
2 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index f8e65b2..d636a1b 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
/*
* Caller must hold a reference on @pi_state.
*/
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval,
+ struct futex_pi_state *pi_state,
+ struct rt_mutex_waiter *top_waiter)
{
- struct rt_mutex_waiter *top_waiter;
struct task_struct *new_owner;
bool postunlock = false;
DEFINE_RT_WAKE_Q(wqh);
u32 curval, newval;
int ret = 0;

- top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
- if (WARN_ON_ONCE(!top_waiter)) {
- /*
- * As per the comment in futex_unlock_pi() this should not happen.
- *
- * When this happens, give up our locks and try again, giving
- * the futex_lock_pi() instance time to complete, either by
- * waiting on the rtmutex or removing itself from the futex
- * queue.
- */
- ret = -EAGAIN;
- goto out_unlock;
- }
-
new_owner = top_waiter->task;

/*
@@ -1046,20 +1033,34 @@ retry_private:
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);

cleanup:
- spin_lock(q.lock_ptr);
/*
* If we failed to acquire the lock (deadlock/signal/timeout), we must
- * first acquire the hb->lock before removing the lock from the
- * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
- * lists consistent.
+ * must unwind the above, however we canont lock hb->lock because
+ * rt_mutex already has a waiter enqueued and hb->lock can itself try
+ * and enqueue an rt_waiter through rtlock.
+ *
+ * Doing the cleanup without holding hb->lock can cause inconsistent
+ * state between hb and pi_state, but only in the direction of not
+ * seeing a waiter that is leaving.
+ *
+ * See futex_unlock_pi(), it deals with this inconsistency.
*
- * In particular; it is important that futex_unlock_pi() can not
- * observe this inconsistency.
+ * There be dragons here, since we must deal with the inconsistency on
+ * the way out (here), it is impossible to detect/warn about the race
+ * the other way around (missing an incoming waiter).
+ *
+ * What could possibly go wrong...
*/
if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
ret = 0;

/*
+ * Now that the rt_waiter has been dequeued, it is safe to use
+ * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+ * the
+ */
+ spin_lock(q.lock_ptr);
+ /*
* Waiter is unqueued.
*/
rt_mutex_post_schedule();
@@ -1143,6 +1144,7 @@ retry:
top_waiter = futex_top_waiter(hb, &key);
if (top_waiter) {
struct futex_pi_state *pi_state = top_waiter->pi_state;
+ struct rt_mutex_waiter *rt_waiter;

ret = -EINVAL;
if (!pi_state)
@@ -1155,22 +1157,39 @@ retry:
if (pi_state->owner != current)
goto out_unlock;

- get_pi_state(pi_state);
/*
* By taking wait_lock while still holding hb->lock, we ensure
- * there is no point where we hold neither; and therefore
- * wake_futex_p() must observe a state consistent with what we
- * observed.
+ * there is no point where we hold neither; and thereby
+ * wake_futex_pi() must observe any new waiters.
+ *
+ * Since the cleanup: case in futex_lock_pi() removes the
+ * rt_waiter without holding hb->lock, it is possible for
+ * wake_futex_pi() to not find a waiter while the above does,
+ * in this case the waiter is on the way out and it can be
+ * ignored.
*
* In particular; this forces __rt_mutex_start_proxy() to
* complete such that we're guaranteed to observe the
- * rt_waiter. Also see the WARN in wake_futex_pi().
+ * rt_waiter.
*/
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+ /*
+ * Futex vs rt_mutex waiter state -- if there are no rt_mutex
+ * waiters even though futex thinks there are, then the waiter
+ * is leaving and the uncontended path is safe to take.
+ */
+ rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
+ if (!rt_waiter) {
+ raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+ goto do_uncontended;
+ }
+
+ get_pi_state(pi_state);
spin_unlock(&hb->lock);

/* drops pi_state->pi_mutex.wait_lock */
- ret = wake_futex_pi(uaddr, uval, pi_state);
+ ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);

put_pi_state(pi_state);

@@ -1198,6 +1217,7 @@ retry:
return ret;
}

+do_uncontended:
/*
* We have no kernel internal state, i.e. no waiters in the
* kernel. Waiters which are about to queue themselves are stuck
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a..4c73e0b 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -850,11 +850,13 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
pi_mutex = &q.pi_state->pi_mutex;
ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);

- /* Current is not longer pi_blocked_on */
- spin_lock(q.lock_ptr);
+ /*
+ * See futex_unlock_pi()'s cleanup: comment.
+ */
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;

+ spin_lock(q.lock_ptr);
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we

2024-01-15 11:40:26

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 15. 09. 23, 17:19, Peter Zijlstra wrote:
> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
>
>> I spent quite some time to convince myself that this is correct. I was
>> not able to poke a hole into it. So that really should be safe to
>> do. Famous last words ...
>
> IKR :-/
>
> Something like so then...
>
> ---
> Subject: futex/pi: Fix recursive rt_mutex waiter state

So this breaks some random test in APR:

From
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/apr/standard/x86_64:
testprocmutex : Line 122: child did not terminate with success

The child in fact terminates on
https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
if (!APR_STATUS_IS_TIMEUP(rv))
exit(1); <----- here

The test creates 6 children and does some
pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel
while sleeping 1 us inside the lock. The timeout is 1 us above. And the
test expects all them to fail (to time out). But the time out does not
always happen in 6.7 (it's racy, so the failure is semi-random: like 1
of 1000 attempts is bad).

If I revert this patch (commit fbeb558b0dd0d), the test works.

I know, the test could be broken too, but I have no idea, really. The
testsuite is sort of hairy and I could not come up with a simple repro.

Note APR sets up PTHREAD_PROCESS_SHARED, _ROBUST, and _PRIO_INHERIT
attrs for the mutex.

Anyway:
downstream report: https://bugzilla.suse.com/show_bug.cgi?id=1218801
APR report: https://bz.apache.org/bugzilla/show_bug.cgi?id=68481

Any idea if this patch should cause the above (or even is a desired
behavior)?

Thanks.

> From: Peter Zijlstra <[email protected]>
> Date: Tue, 12 Sep 2023 13:17:11 +0200
>
> Some new assertions pointed out that the existing code has nested rt_mutex wait
> state in the futex code.
>
> Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
> still is a rt_waiter enqueued for this task, resulting in a state where there
> are two waiters for the same task (and task_struct::pi_blocked_on gets
> scrambled).
>
> The reason to take hb->lock at this point is to avoid the wake_futex_pi()
> EAGAIN case.
>
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
> inconsistent. The current rules are such that this inconsistency will not be
> observed.
>
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.
>
> *However* the case at hand is where a waiter is leaving, in this case the race
> means a waiter that is going away is not observed -- which is harmless,
> provided this race is explicitly handled.
>
> This is a somewhat dangerous proposition because the converse race is not
> observing a new waiter, which must absolutely not happen. But since the race is
> valid this cannot be asserted.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> pi.c | 76 +++++++++++++++++++++++++++++++++++++++-----------------------
> requeue.c | 6 +++-
> 2 files changed, 52 insertions(+), 30 deletions(-)
>
> Index: linux-2.6/kernel/futex/pi.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex/pi.c
> +++ linux-2.6/kernel/futex/pi.c
> @@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uad
> /*
> * Caller must hold a reference on @pi_state.
> */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> + struct futex_pi_state *pi_state,
> + struct rt_mutex_waiter *top_waiter)
> {
> - struct rt_mutex_waiter *top_waiter;
> struct task_struct *new_owner;
> bool postunlock = false;
> DEFINE_RT_WAKE_Q(wqh);
> u32 curval, newval;
> int ret = 0;
>
> - top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> - if (WARN_ON_ONCE(!top_waiter)) {
> - /*
> - * As per the comment in futex_unlock_pi() this should not happen.
> - *
> - * When this happens, give up our locks and try again, giving
> - * the futex_lock_pi() instance time to complete, either by
> - * waiting on the rtmutex or removing itself from the futex
> - * queue.
> - */
> - ret = -EAGAIN;
> - goto out_unlock;
> - }
> -
> new_owner = top_waiter->task;
>
> /*
> @@ -1039,19 +1026,33 @@ retry_private:
> ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>
> cleanup:
> - spin_lock(q.lock_ptr);
> /*
> * If we failed to acquire the lock (deadlock/signal/timeout), we must
> - * first acquire the hb->lock before removing the lock from the
> - * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> - * lists consistent.
> + * must unwind the above, however we canont lock hb->lock because
> + * rt_mutex already has a waiter enqueued and hb->lock can itself try
> + * and enqueue an rt_waiter through rtlock.
> + *
> + * Doing the cleanup without holding hb->lock can cause inconsistent
> + * state between hb and pi_state, but only in the direction of not
> + * seeing a waiter that is leaving.
> + *
> + * See futex_unlock_pi(), it deals with this inconsistency.
> + *
> + * There be dragons here, since we must deal with the inconsistency on
> + * the way out (here), it is impossible to detect/warn about the race
> + * the other way around (missing an incoming waiter).
> *
> - * In particular; it is important that futex_unlock_pi() can not
> - * observe this inconsistency.
> + * What could possibly go wrong...
> */
> if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
> ret = 0;
>
> + /*
> + * Now that the rt_waiter has been dequeued, it is safe to use
> + * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
> + * the
> + */
> + spin_lock(q.lock_ptr);
> no_block:
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1132,6 +1133,7 @@ retry:
> top_waiter = futex_top_waiter(hb, &key);
> if (top_waiter) {
> struct futex_pi_state *pi_state = top_waiter->pi_state;
> + struct rt_mutex_waiter *rt_waiter;
>
> ret = -EINVAL;
> if (!pi_state)
> @@ -1144,22 +1146,39 @@ retry:
> if (pi_state->owner != current)
> goto out_unlock;
>
> - get_pi_state(pi_state);
> /*
> * By taking wait_lock while still holding hb->lock, we ensure
> - * there is no point where we hold neither; and therefore
> - * wake_futex_p() must observe a state consistent with what we
> - * observed.
> + * there is no point where we hold neither; and thereby
> + * wake_futex_pi() must observe any new waiters.
> + *
> + * Since the cleanup: case in futex_lock_pi() removes the
> + * rt_waiter without holding hb->lock, it is possible for
> + * wake_futex_pi() to not find a waiter while the above does,
> + * in this case the waiter is on the way out and it can be
> + * ignored.
> *
> * In particular; this forces __rt_mutex_start_proxy() to
> * complete such that we're guaranteed to observe the
> - * rt_waiter. Also see the WARN in wake_futex_pi().
> + * rt_waiter.
> */
> raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> + /*
> + * Futex vs rt_mutex waiter state -- if there are no rt_mutex
> + * waiters even though futex thinks there are, then the waiter
> + * is leaving and the uncontended path is safe to take.
> + */
> + rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> + if (!rt_waiter) {
> + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
> + goto do_uncontended;
> + }
> +
> + get_pi_state(pi_state);
> spin_unlock(&hb->lock);
>
> /* drops pi_state->pi_mutex.wait_lock */
> - ret = wake_futex_pi(uaddr, uval, pi_state);
> + ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>
> put_pi_state(pi_state);
>
> @@ -1187,6 +1206,7 @@ retry:
> return ret;
> }
>
> +do_uncontended:
> /*
> * We have no kernel internal state, i.e. no waiters in the
> * kernel. Waiters which are about to queue themselves are stuck
> Index: linux-2.6/kernel/futex/requeue.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex/requeue.c
> +++ linux-2.6/kernel/futex/requeue.c
> @@ -850,11 +850,13 @@ int futex_wait_requeue_pi(u32 __user *ua
> pi_mutex = &q.pi_state->pi_mutex;
> ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
>
> - /* Current is not longer pi_blocked_on */
> - spin_lock(q.lock_ptr);
> + /*
> + * See futex_unlock_pi()'s cleanup: comment.
> + */
> if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
> ret = 0;
>
> + spin_lock(q.lock_ptr);
> debug_rt_mutex_free_waiter(&rt_waiter);
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
>

--
js
suse labs

2024-01-15 11:53:04

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 15. 01. 24, 12:40, Jiri Slaby wrote:
> On 15. 09. 23, 17:19, Peter Zijlstra wrote:
>> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
>>
>>> I spent quite some time to convince myself that this is correct. I was
>>> not able to poke a hole into it. So that really should be safe to
>>> do. Famous last words ...
>>
>> IKR :-/
>>
>> Something like so then...
>>
>> ---
>> Subject: futex/pi: Fix recursive rt_mutex waiter state
>
> So this breaks some random test in APR:
>
> From
> https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/apr/standard/x86_64:
> testprocmutex       : Line 122: child did not terminate with success
>
> The child in fact terminates on
> https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
>                 while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
>                     if (!APR_STATUS_IS_TIMEUP(rv))
>                         exit(1); <----- here
>
> The test creates 6 children and does some
> pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel
> while sleeping 1 us inside the lock. The timeout is 1 us above. And the
> test expects all them to fail (to time out). But the time out does not
> always happen in 6.7 (it's racy, so the failure is semi-random: like 1
> of 1000 attempts is bad).

This is not precise as I misinterpreted. The test is: either it succeeds
or times out.

But since the commit, futex() yields 22/EINVAL, i.e. fails.

> If I revert this patch (commit fbeb558b0dd0d), the test works.
>
> I know, the test could be broken too, but I have no idea, really. The
> testsuite is sort of hairy and I could not come up with a simple repro.
>
> Note APR sets up PTHREAD_PROCESS_SHARED, _ROBUST, and _PRIO_INHERIT
> attrs for the mutex.
>
> Anyway:
> downstream report: https://bugzilla.suse.com/show_bug.cgi?id=1218801
> APR report: https://bz.apache.org/bugzilla/show_bug.cgi?id=68481
>
> Any idea if this patch should cause the above (or even is a desired
> behavior)?
>
> Thanks.

--
js

2024-01-15 12:16:45

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2024-01-15 12:52:49 [+0100], Jiri Slaby wrote:
…
> >
> > The child in fact terminates on
> > https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
> >                 while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
> >                     if (!APR_STATUS_IS_TIMEUP(rv))
> >                         exit(1); <----- here
> >
> > The test creates 6 children and does some
> > pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel
> > while sleeping 1 us inside the lock. The timeout is 1 us above. And the
> > test expects all them to fail (to time out). But the time out does not
> > always happen in 6.7 (it's racy, so the failure is semi-random: like 1
> > of 1000 attempts is bad).
>
> This is not precise as I misinterpreted. The test is: either it succeeds or
> times out.
>
> But since the commit, futex() yields 22/EINVAL, i.e. fails.

Let me see if I can reproduce it here…

Sebastian

2024-01-15 13:01:24

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 15. 01. 24, 12:52, Jiri Slaby wrote:
> On 15. 01. 24, 12:40, Jiri Slaby wrote:
>> On 15. 09. 23, 17:19, Peter Zijlstra wrote:
>>> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
>>>
>>>> I spent quite some time to convince myself that this is correct. I was
>>>> not able to poke a hole into it. So that really should be safe to
>>>> do. Famous last words ...
>>>
>>> IKR :-/
>>>
>>> Something like so then...
>>>
>>> ---
>>> Subject: futex/pi: Fix recursive rt_mutex waiter state
>>
>> So this breaks some random test in APR:
>>
>> From
>> https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/apr/standard/x86_64:
>> testprocmutex       : Line 122: child did not terminate with success
>>
>> The child in fact terminates on
>> https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
>>                  while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
>>                      if (!APR_STATUS_IS_TIMEUP(rv))
>>                          exit(1); <----- here
>>
>> The test creates 6 children and does some
>> pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel
>> while sleeping 1 us inside the lock. The timeout is 1 us above. And
>> the test expects all them to fail (to time out). But the time out does
>> not always happen in 6.7 (it's racy, so the failure is semi-random:
>> like 1 of 1000 attempts is bad).
>
> This is not precise as I misinterpreted. The test is: either it succeeds
> or times out.
>
> But since the commit, futex() yields 22/EINVAL, i.e. fails.

A simplified reproducer attached (in particular, no APR anymore). Build
with -pthread, obviously. If you see
BADx rv=22

that's bad.

regards,
--
js
suse labs

Attachments:

pokus2.c (1.70 kB)

2024-01-15 15:33:05

by Yann Ylavic

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On Mon, Jan 15, 2024 at 1:55 PM Jiri Slaby <[email protected]> wrote:
>
> A simplified reproducer attached (in particular, no APR anymore). Build
> with -pthread, obviously.

I think "abstime" should be updated in the the
pthread_mutex_timedlock() loop or it might get in the past after the
first ETIMEDOUT?
Maybe something the attached pokus3.c reproducer is more inline with
the original code.

Regards;
Yann.

Attachments:

pokus3.c (1.78 kB)

2024-01-15 18:33:52

[permalink] [raw]

Subject: Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.

On 2024-01-15 13:54:32 [+0100], Jiri Slaby wrote:
> A simplified reproducer attached (in particular, no APR anymore). Build with
> -pthread, obviously. If you see
> BADx rv=22
>
> that's bad.

So task1 owns the futex, task2 does lock_pi_futex(). task2 setups the pi
state and everything for task1. Then task1 times out on the lock and
does not want the lock no more. And we break user state vs kernel via:

task1 task2

futex_unlock_pi(); futex_lock_pi();
rt_mutex_wait_proxy_lock() // ETIMEDOUT
spin_lock(&hb->lock);
rt_mutex_cleanup_proxy_lock()
/* now the lock is still owned by task1, and the
* task2 removed itself as the waiter but its
* futex_q is still queued
*/
spin_lock(&hb->lock); /* block */

top_waiter = futex_top_waiter(hb, &key);
/* top_wait is task2's */
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
/* rt_waiter is NULL and the
futex is unlocked in userland via uaddr
*/

and now

task 3 task4
locks in userland
futex_lock_pi();
futex_lock_pi_atomic();
-EINVAL = attach_to_pi_state()
/*
* becauase pi_state says owner
* is task1 but uaddr says task3.
*/

\*/

This is due to the new lock ordering and the guarantees we no longer
have since the commit cited. The pi-state is cleaned/ removed by the last
one that wants the lock so in the unlock path there is either pi-state
with a waiter or nothing.

This duct tape at the end waits until the pi-state leaves or we get a
waiter. So this works but is not a fix.

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 90e5197f4e56..f504ae864cc9 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1182,6 +1182,9 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
if (!rt_waiter) {
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+ spin_unlock(&hb->lock);
+ cpu_relax();
+ goto retry;
goto do_uncontended;
}

--
2.43.0

> regards,

Sebastian

2024-01-16 07:07:32