2010-07-13 08:03:43

by Darren Hart

[permalink] [raw]
Subject: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI

Thanks to Thomas, Steven, and Mike for hashing this over me. After an
IRC discussion with Thomas, I put the following together. It resolves
the issue for me, Mike please test and let us know if it fixes it for
you. A couple of points of discussion before we commit this:

The use of the new state flag, PI_WAKEUP_INPROGRESS, is pretty ugly.
Would a new task_pi_blocked_on_valid() method be preferred (in
rtmutex.c)?

The new WARN_ON() in task_blocks_on_rt_mutex() is complex. It didn't
exist before and we've now closed this gap, should we just drop it?

I've added a couple BUG_ON()s in futex_wait_requeue_pi() dealing with
the race with requeue and q.lock_ptr. I'd like to leave this for the
time being if nobody strongly objects.

Thanks,

Darren


>From 93fd3bb97800ebf5e5c1a6a85937bab93256dd42 Mon Sep 17 00:00:00 2001
From: Darren Hart <[email protected]>
Date: Fri, 9 Jul 2010 17:50:23 -0400
Subject: [PATCH 1/2] futex: protect against pi_blocked_on corruption during requeue PI

The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
a scenario where a task can wakeup, not knowing it has been enqueued on an
rtmutex. Blocking on an hb->lock() can overwrite a valid value in
current->pi_blocked_on, leading to an inconsistent state.

Prevent overwriting pi_blocked_on by serializing on the waiter's pi_lock (a
raw_spinlock) and using the new PI_WAKEUP_INPROGRESS state flag to indicate a
waiter that has been woken by a timeout or signal. This prevents the rtmutex
code from adding the waiter to the rtmutex wait list, returning EAGAIN to
futex_requeue(), which will in turn ignore the waiter during a requeue. Care
is taken to allow current to block on locks even if PI_WAKEUP_INPROGRESS is
set.

During normal wakeup, this results in one less hb->lock protected section. In
the pre-requeue-timeout-or-signal wakeup, this removes the "greedy locking"
behavior, no attempt will be made to acquire the lock.

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/futex.c | 50 +++++++++++++++++++++++++++++++++-------------
kernel/rtmutex.c | 45 ++++++++++++++++++++++++++++++++++-------
kernel/rtmutex_common.h | 1 +
kernel/sched.c | 5 +++-
4 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..c92978d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1336,6 +1336,9 @@ retry_private:
requeue_pi_wake_futex(this, &key2, hb2);
drop_count++;
continue;
+ } else if (ret == -EAGAIN) {
+ /* Waiter woken by timeout or signal. */
+ continue;
} else if (ret) {
/* -EDEADLK */
this->pi_state = NULL;
@@ -2211,9 +2214,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
int clockrt, u32 __user *uaddr2)
{
struct hrtimer_sleeper timeout, *to = NULL;
+ struct futex_hash_bucket *hb, *hb2;
struct rt_mutex_waiter rt_waiter;
struct rt_mutex *pi_mutex = NULL;
- struct futex_hash_bucket *hb;
union futex_key key2;
struct futex_q q;
int res, ret;
@@ -2255,18 +2258,33 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
futex_wait_queue_me(hb, &q, to);

- spin_lock(&hb->lock);
- ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
- spin_unlock(&hb->lock);
- if (ret)
- goto out_put_keys;
-
/*
- * In order for us to be here, we know our q.key == key2, and since
- * we took the hb->lock above, we also know that futex_requeue() has
- * completed and we no longer have to concern ourselves with a wakeup
- * race with the atomic proxy lock acquition by the requeue code.
+ * Avoid races with requeue and trying to block on two mutexes
+ * (hb->lock and uaddr2's rtmutex) by serializing access to
+ * pi_blocked_on with pi_lock and setting PI_BLOCKED_ON_PENDING.
+ */
+ raw_spin_lock(&current->pi_lock);
+ if (current->pi_blocked_on) {
+ raw_spin_unlock(&current->pi_lock);
+ } else {
+ current->pi_blocked_on = (struct rt_mutex_waiter *)PI_WAKEUP_INPROGRESS;
+ raw_spin_unlock(&current->pi_lock);
+
+ spin_lock(&hb->lock);
+ ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
+ spin_unlock(&hb->lock);
+ if (ret)
+ goto out_put_keys;
+ }
+
+ /*
+ * In order to be here, we have either been requeued, are in the process
+ * of being requeued, or requeue successfully acquired uaddr2 on our
+ * behalf. If pi_blocked_on was non-null above, we may be racing with a
+ * requeue. Do not rely on q->lock_ptr to be hb2->lock until after
+ * blocking on hb->lock or hb2->lock.
*/
+ hb2 = hash_futex(&key2);

/* Check if the requeue code acquired the second futex for us. */
if (!q.rt_waiter) {
@@ -2275,10 +2293,12 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
* did a lock-steal - fix up the PI-state in that case.
*/
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ spin_lock(&hb2->lock);
+ BUG_ON(&hb2->lock != q.lock_ptr);
+
ret = fixup_pi_state_owner(uaddr2, &q, current,
fshared);
- spin_unlock(q.lock_ptr);
+ spin_unlock(&hb2->lock);
}
} else {
/*
@@ -2291,7 +2311,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
debug_rt_mutex_free_waiter(&rt_waiter);

- spin_lock(q.lock_ptr);
+ spin_lock(&hb2->lock);
+ BUG_ON(&hb2->lock != q.lock_ptr);
+
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..0399108 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
* reached or the state of the chain has changed while we
* dropped the locks.
*/
- if (!waiter || !waiter->task)
+ if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
goto out_unlock_pi;

/*
@@ -448,6 +448,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
int chain_walk = 0, res;

raw_spin_lock(&task->pi_lock);
+
+ /*
+ * In the case of futex requeue PI, this will be a proxy lock. The task
+ * will wake unaware that it is enqueueed on this lock. Avoid blocking
+ * on two locks and corrupting pi_blocked_on via the
+ * PI_WAKEUP_INPROGRESS flag. futex_wait_requeue_pi() sets this when it
+ * wakes up before requeue (due to a signal or timeout). Do not enqueue
+ * the task if PI_WAKEUP_INPROGRESS is set.
+ */
+ if (task != current &&
+ (long)task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
+ raw_spin_unlock(&task->pi_lock);
+ return -EAGAIN;
+ }
+
__rt_mutex_adjust_prio(task);
waiter->task = task;
waiter->lock = lock;
@@ -459,6 +474,15 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
top_waiter = rt_mutex_top_waiter(lock);
plist_add(&waiter->list_entry, &lock->wait_list);

+ /*
+ * Tasks can only block on one lock at a time. In the case of futex
+ * requeue PI, if task == current it may have set PI_WAKEUP_INPROGRESS
+ * to prevent requeue, but it will still need to acquire locks on its
+ * way out of futex_wait_requeue_pi().
+ */
+ WARN_ON(task->pi_blocked_on != NULL &&
+ (task != current || (long)task->pi_blocked_on != PI_WAKEUP_INPROGRESS));
+
task->pi_blocked_on = waiter;

raw_spin_unlock(&task->pi_lock);
@@ -469,7 +493,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
plist_add(&waiter->pi_list_entry, &owner->pi_waiters);

__rt_mutex_adjust_prio(owner);
- if (owner->pi_blocked_on)
+ if (owner->pi_blocked_on &&
+ (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
chain_walk = 1;
raw_spin_unlock(&owner->pi_lock);
}
@@ -579,9 +604,11 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)

raw_spin_lock(&pendowner->pi_lock);

- WARN_ON(!pendowner->pi_blocked_on);
- WARN_ON(pendowner->pi_blocked_on != waiter);
- WARN_ON(pendowner->pi_blocked_on->lock != lock);
+ if (!WARN_ON(!pendowner->pi_blocked_on) &&
+ !WARN_ON((long)pendowner->pi_blocked_on == PI_WAKEUP_INPROGRESS)) {
+ WARN_ON(pendowner->pi_blocked_on != waiter);
+ WARN_ON(pendowner->pi_blocked_on->lock != lock);
+ }

pendowner->pi_blocked_on = NULL;

@@ -624,7 +651,8 @@ static void remove_waiter(struct rt_mutex *lock,
}
__rt_mutex_adjust_prio(owner);

- if (owner->pi_blocked_on)
+ if (owner->pi_blocked_on &&
+ (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
chain_walk = 1;

raw_spin_unlock(&owner->pi_lock);
@@ -658,7 +686,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
raw_spin_lock_irqsave(&task->pi_lock, flags);

waiter = task->pi_blocked_on;
- if (!waiter || waiter->list_entry.prio == task->prio) {
+ if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS ||
+ waiter->list_entry.prio == task->prio) {
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
return;
}
@@ -1527,7 +1556,7 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
flags);

- if (ret && !waiter->task) {
+ if (ret == -EDEADLK && !waiter->task) {
/*
* Reset the return value. We might have
* returned with -EDEADLK and the owner
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 4df690c..94a856f 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -115,6 +115,7 @@ static inline unsigned long rt_mutex_owner_pending(struct rt_mutex *lock)
/*
* PI-futex support (proxy locking functions, etc.):
*/
+#define PI_WAKEUP_INPROGRESS 1
extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock);
extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
struct task_struct *proxy_owner);
diff --git a/kernel/sched.c b/kernel/sched.c
index aa5dced..9d4337e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -83,6 +83,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>

+#include "rtmutex_common.h"
+
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
*/
if (unlikely(p == rq->idle)) {
WARN_ON(p != rq->curr);
- WARN_ON(p->pi_blocked_on);
+ WARN_ON(p->pi_blocked_on &&
+ (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);
goto out_unlock;
}

--
1.7.0.4


2010-07-13 09:26:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI

On Tue, 13 Jul 2010, Darren Hart wrote:

> Thanks to Thomas, Steven, and Mike for hashing this over me. After an
> IRC discussion with Thomas, I put the following together. It resolves
> the issue for me, Mike please test and let us know if it fixes it for
> you. A couple of points of discussion before we commit this:
>
> The use of the new state flag, PI_WAKEUP_INPROGRESS, is pretty ugly.
> Would a new task_pi_blocked_on_valid() method be preferred (in
> rtmutex.c)?
>
> The new WARN_ON() in task_blocks_on_rt_mutex() is complex. It didn't
> exist before and we've now closed this gap, should we just drop it?

We can simplify it to:

WARN_ON(task->pi_blocked_on &&
task->pi_blocked_on != PI_WAKEUP_INPROGRESS);

We check for !=current and PI_WAKEUP_INPROGRESS just above.

> I've added a couple BUG_ON()s in futex_wait_requeue_pi() dealing with
> the race with requeue and q.lock_ptr. I'd like to leave this for the
> time being if nobody strongly objects.
> -
> /*
> - * In order for us to be here, we know our q.key == key2, and since
> - * we took the hb->lock above, we also know that futex_requeue() has
> - * completed and we no longer have to concern ourselves with a wakeup
> - * race with the atomic proxy lock acquition by the requeue code.
> + * Avoid races with requeue and trying to block on two mutexes
> + * (hb->lock and uaddr2's rtmutex) by serializing access to
> + * pi_blocked_on with pi_lock and setting PI_BLOCKED_ON_PENDING.
> + */
> + raw_spin_lock(&current->pi_lock);

Needs to be raw_spin_lock_irq()

> + if (current->pi_blocked_on) {
> + raw_spin_unlock(&current->pi_lock);
> + } else {
> + current->pi_blocked_on = (struct rt_mutex_waiter *)PI_WAKEUP_INPROGRESS;

#define PI_WAKEUP_INPROGRESS ((struct rt_mutex_waiter *) 1)

perhaps ? That gets rid of all type casts

> + raw_spin_unlock(&current->pi_lock);
> +
> + spin_lock(&hb->lock);

We need to cleanup current->pi_blocked_on here. If we succeed in the
hb->lock fast path then we might leak the PI_WAKEUP_INPROGRESS to user space
and the next requeue will fail.

> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 23dd443..0399108 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
> * reached or the state of the chain has changed while we
> * dropped the locks.
> */
> - if (!waiter || !waiter->task)
> + if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
> goto out_unlock_pi;

Why do we need that check ? Either the requeue succeeded then
task->pi_blocked_on is set to the real waiter or the wakeup won and
we are in no lock chain.

If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
rt_mutex_adjust_prio_chain() then it's a bug nothing else.

> @@ -469,7 +493,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
> plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
>
> __rt_mutex_adjust_prio(owner);
> - if (owner->pi_blocked_on)
> + if (owner->pi_blocked_on &&
> + (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)

Again, that can never happen

> chain_walk = 1;
> raw_spin_unlock(&owner->pi_lock);
> }
> @@ -579,9 +604,11 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>
> raw_spin_lock(&pendowner->pi_lock);
>
> - WARN_ON(!pendowner->pi_blocked_on);
> - WARN_ON(pendowner->pi_blocked_on != waiter);
> - WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + if (!WARN_ON(!pendowner->pi_blocked_on) &&
> + !WARN_ON((long)pendowner->pi_blocked_on == PI_WAKEUP_INPROGRESS)) {

Ditto

> + WARN_ON(pendowner->pi_blocked_on != waiter);
> + WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + }
>
> pendowner->pi_blocked_on = NULL;
>
> @@ -624,7 +651,8 @@ static void remove_waiter(struct rt_mutex *lock,
> }
> __rt_mutex_adjust_prio(owner);
>
> - if (owner->pi_blocked_on)
> + if (owner->pi_blocked_on &&
> + (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
> chain_walk = 1;

Same here.

> raw_spin_unlock(&owner->pi_lock);
> @@ -658,7 +686,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
> raw_spin_lock_irqsave(&task->pi_lock, flags);
>
> waiter = task->pi_blocked_on;
> - if (!waiter || waiter->list_entry.prio == task->prio) {
> + if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS ||
> + waiter->list_entry.prio == task->prio) {

And here

> /*
> * Convert user-nice values [ -20 ... 0 ... 19 ]
> * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
> @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
> */
> if (unlikely(p == rq->idle)) {
> WARN_ON(p != rq->curr);
> - WARN_ON(p->pi_blocked_on);
> + WARN_ON(p->pi_blocked_on &&
> + (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);

Yuck. Paranoia ? If we ever requeue idle, then .....

I'm going to cleanup the stuff and send out a new patch for Mike
to test.

Thanks,

tglx

2010-07-13 09:59:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI

On Tue, 13 Jul 2010, Darren Hart wrote:
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..c92978d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1336,6 +1336,9 @@ retry_private:
> requeue_pi_wake_futex(this, &key2, hb2);
> drop_count++;
> continue;
> + } else if (ret == -EAGAIN) {
> + /* Waiter woken by timeout or signal. */

This leaks the pi_state.

2010-07-13 10:29:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI

On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Darren Hart wrote:
>
> > --- a/kernel/rtmutex.c
> > +++ b/kernel/rtmutex.c
> > @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
> > * reached or the state of the chain has changed while we
> > * dropped the locks.
> > */
> > - if (!waiter || !waiter->task)
> > + if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
> > goto out_unlock_pi;
>
> Why do we need that check ? Either the requeue succeeded then
> task->pi_blocked_on is set to the real waiter or the wakeup won and
> we are in no lock chain.
>
> If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
> rt_mutex_adjust_prio_chain() then it's a bug nothing else.

Grrr, I'm wrong. If we take hb->lock in the fast path then something
else might try to boost us and trip over this :(

This code causes braindamage. I really wonder whether we need to
remove it according to the "United Nations Convention against Torture
and Other Cruel, Inhuman or Degrading Treatment or Punishment".

> > @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
> > */
> > if (unlikely(p == rq->idle)) {
> > WARN_ON(p != rq->curr);
> > - WARN_ON(p->pi_blocked_on);
> > + WARN_ON(p->pi_blocked_on &&
> > + (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);
>
> Yuck. Paranoia ? If we ever requeue idle, then .....

At least one which is bogus :)

Thanks,

tglx

2010-07-13 11:52:47

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2

On Tue, 13 Jul 2010, Thomas Gleixner wrote:
>
> This code causes braindamage. I really wonder whether we need to
> remove it according to the "United Nations Convention against Torture
> and Other Cruel, Inhuman or Degrading Treatment or Punishment".
>

Ok, finally managed to twist my brain around it. Mike, can you give it
a test ride ?

Thanks,

tglx

-------->
Subject: futex: Protect against pi_blocked_on corruption during requeue PI
From: Darren Hart <[email protected]>
Date: Fri, 9 Jul 2010 17:50:23 -0400

The requeue_pi mechanism introduced proxy locking of the rtmutex. This
creates a scenario where a task can wakeup, not knowing it has been
enqueued on an rtmutex. Blocking on an hb->lock() can overwrite a
valid value in current->pi_blocked_on, leading to an inconsistent
state.

Prevent overwriting pi_blocked_on by serializing on the waiter's
pi_lock and using the new PI_WAKEUP_INPROGRESS state flag to indicate
a waiter that has been woken by a timeout or signal. This prevents the
rtmutex code from adding the waiter to the rtmutex wait list,
returning EAGAIN to futex_requeue(), which will in turn ignore the
waiter during a requeue. Care is taken to allow current to block on
locks even if PI_WAKEUP_INPROGRESS is set.

During normal wakeup, this results in one less hb->lock protected
section. In the pre-requeue-timeout-or-signal wakeup, this removes the
"greedy locking" behavior, no attempt will be made to acquire the
lock.

[ tglx: take pi_lock with lock_irq(), removed paranoid warning,
plugged pi_state and pi_blocked_on leak, adjusted some
comments ]

Signed-off-by: Darren Hart <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/futex.c | 50 +++++++++++++++++++++++++++++++++-------------
kernel/rtmutex.c | 45 ++++++++++++++++++++++++++++++++++-------
kernel/rtmutex_common.h | 1 +
kernel/sched.c | 5 +++-
4 files changed, 78 insertions(+), 23 deletions(-)

Index: linux-2.6-tip/kernel/futex.c
===================================================================
--- linux-2.6-tip.orig/kernel/futex.c
+++ linux-2.6-tip/kernel/futex.c
@@ -1336,6 +1336,16 @@ retry_private:
requeue_pi_wake_futex(this, &key2, hb2);
drop_count++;
continue;
+ } else if (ret == -EAGAIN) {
+ /*
+ * Waiter was woken by timeout or
+ * signal and has set pi_blocked_on to
+ * PI_WAKEUP_INPROGRESS before we
+ * tried to enqueue it on the rtmutex.
+ */
+ this->pi_state = NULL;
+ free_pi_state(pi_state);
+ continue;
} else if (ret) {
/* -EDEADLK */
this->pi_state = NULL;
@@ -2211,9 +2221,9 @@ static int futex_wait_requeue_pi(u32 __u
int clockrt, u32 __user *uaddr2)
{
struct hrtimer_sleeper timeout, *to = NULL;
+ struct futex_hash_bucket *hb, *hb2;
struct rt_mutex_waiter rt_waiter;
struct rt_mutex *pi_mutex = NULL;
- struct futex_hash_bucket *hb;
union futex_key key2;
struct futex_q q;
int res, ret;
@@ -2255,18 +2265,51 @@ static int futex_wait_requeue_pi(u32 __u
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
futex_wait_queue_me(hb, &q, to);

- spin_lock(&hb->lock);
- ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
- spin_unlock(&hb->lock);
- if (ret)
- goto out_put_keys;
+ /*
+ * Avoid races with requeue and trying to block on two mutexes
+ * (hb->lock and uaddr2's rtmutex) by serializing access to
+ * pi_blocked_on with pi_lock.
+ */
+ raw_spin_lock_irq(&current->pi_lock);
+ if (current->pi_blocked_on) {
+ /* Requeue happened already */
+ raw_spin_unlock_irq(&current->pi_lock);
+ } else {
+ /*
+ * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS
+ * prevents a concurrent requeue from enqueuein us on
+ * the uaddr2 rtmutex. After that we can safely
+ * acquire (and possibly block on) hb->lock.
+ */
+ current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
+ raw_spin_unlock_irq(&current->pi_lock);
+
+ spin_lock(&hb->lock);
+
+ /*
+ * Clean up pi_blocked_on. We might leak it otherwise
+ * when we succeeded with the hb->lock in the fast
+ * path.
+ */
+ raw_spin_lock_irq(&current->pi_lock);
+ current->pi_blocked_on = NULL;
+ raw_spin_unlock_irq(&current->pi_lock);
+
+ ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
+ spin_unlock(&hb->lock);
+ if (ret)
+ goto out_put_keys;
+ }

/*
- * In order for us to be here, we know our q.key == key2, and since
- * we took the hb->lock above, we also know that futex_requeue() has
- * completed and we no longer have to concern ourselves with a wakeup
- * race with the atomic proxy lock acquition by the requeue code.
+ * In order to be here, we have either been requeued, are in
+ * the process of being requeued, or requeue successfully
+ * acquired uaddr2 on our behalf. If pi_blocked_on was
+ * non-null above, we may be racing with a requeue. Do not
+ * rely on q->lock_ptr to be hb2->lock until after blocking on
+ * hb->lock or hb2->lock.
*/
+ hb2 = hash_futex(&key2);

/* Check if the requeue code acquired the second futex for us. */
if (!q.rt_waiter) {
@@ -2275,10 +2318,12 @@ static int futex_wait_requeue_pi(u32 __u
* did a lock-steal - fix up the PI-state in that case.
*/
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ spin_lock(&hb2->lock);
+ BUG_ON(&hb2->lock != q.lock_ptr);
+
ret = fixup_pi_state_owner(uaddr2, &q, current,
fshared);
- spin_unlock(q.lock_ptr);
+ spin_unlock(&hb2->lock);
}
} else {
/*
@@ -2291,7 +2336,9 @@ static int futex_wait_requeue_pi(u32 __u
ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
debug_rt_mutex_free_waiter(&rt_waiter);

- spin_lock(q.lock_ptr);
+ spin_lock(&hb2->lock);
+ BUG_ON(&hb2->lock != q.lock_ptr);
+
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
Index: linux-2.6-tip/kernel/rtmutex.c
===================================================================
--- linux-2.6-tip.orig/kernel/rtmutex.c
+++ linux-2.6-tip/kernel/rtmutex.c
@@ -82,6 +82,11 @@ static void fixup_rt_mutex_waiters(struc
clear_rt_mutex_waiters(lock);
}

+static inline int rt_mutex_real_waiter(struct rt_mutex_waiter *waiter)
+{
+ return waiter && waiter != PI_WAKEUP_INPROGRESS;
+}
+
/*
* We can speed up the acquire/release, if the architecture
* supports cmpxchg and if there's no debugging state to be set up
@@ -227,7 +232,7 @@ static int rt_mutex_adjust_prio_chain(st
* reached or the state of the chain has changed while we
* dropped the locks.
*/
- if (!waiter || !waiter->task)
+ if (!rt_mutex_real_waiter(waiter) || !waiter->task)
goto out_unlock_pi;

/*
@@ -448,6 +453,23 @@ static int task_blocks_on_rt_mutex(struc
int chain_walk = 0, res;

raw_spin_lock(&task->pi_lock);
+
+ /*
+ * In the case of futex requeue PI, this will be a proxy
+ * lock. The task will wake unaware that it is enqueueed on
+ * this lock. Avoid blocking on two locks and corrupting
+ * pi_blocked_on via the PI_WAKEUP_INPROGRESS
+ * flag. futex_wait_requeue_pi() sets this when it wakes up
+ * before requeue (due to a signal or timeout). Do not enqueue
+ * the task if PI_WAKEUP_INPROGRESS is set.
+ */
+ if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
+ raw_spin_unlock(&task->pi_lock);
+ return -EAGAIN;
+ }
+
+ BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on));
+
__rt_mutex_adjust_prio(task);
waiter->task = task;
waiter->lock = lock;
@@ -469,7 +491,7 @@ static int task_blocks_on_rt_mutex(struc
plist_add(&waiter->pi_list_entry, &owner->pi_waiters);

__rt_mutex_adjust_prio(owner);
- if (owner->pi_blocked_on)
+ if (rt_mutex_real_waiter(owner->pi_blocked_on))
chain_walk = 1;
raw_spin_unlock(&owner->pi_lock);
}
@@ -624,7 +646,7 @@ static void remove_waiter(struct rt_mute
}
__rt_mutex_adjust_prio(owner);

- if (owner->pi_blocked_on)
+ if (rt_mutex_real_waiter(owner->pi_blocked_on))
chain_walk = 1;

raw_spin_unlock(&owner->pi_lock);
@@ -658,7 +680,8 @@ void rt_mutex_adjust_pi(struct task_stru
raw_spin_lock_irqsave(&task->pi_lock, flags);

waiter = task->pi_blocked_on;
- if (!waiter || waiter->list_entry.prio == task->prio) {
+ if (!rt_mutex_real_waiter(waiter) ||
+ waiter->list_entry.prio == task->prio) {
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
return;
}
@@ -1527,7 +1550,7 @@ int rt_mutex_start_proxy_lock(struct rt_
ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
flags);

- if (ret && !waiter->task) {
+ if (ret == -EDEADLK && !waiter->task) {
/*
* Reset the return value. We might have
* returned with -EDEADLK and the owner
Index: linux-2.6-tip/kernel/rtmutex_common.h
===================================================================
--- linux-2.6-tip.orig/kernel/rtmutex_common.h
+++ linux-2.6-tip/kernel/rtmutex_common.h
@@ -115,6 +115,9 @@ static inline unsigned long rt_mutex_own
/*
* PI-futex support (proxy locking functions, etc.):
*/
+
+#define PI_WAKEUP_INPROGRESS ((struct rt_mutex_waiter *) 1)
+
extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock);
extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
struct task_struct *proxy_owner);

2010-07-13 15:56:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2

On Tue, 2010-07-13 at 13:52 +0200, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> >
> > This code causes braindamage. I really wonder whether we need to
> > remove it according to the "United Nations Convention against Torture
> > and Other Cruel, Inhuman or Degrading Treatment or Punishment".
> >
>
> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

I'm away from box until Friday evening. I'll test, but it'll likely be
ancient history by then.

-Mike

2010-07-13 19:00:15

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2

On 07/13/2010 04:52 AM, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:
>>
>> This code causes braindamage. I really wonder whether we need to
>> remove it according to the "United Nations Convention against Torture
>> and Other Cruel, Inhuman or Degrading Treatment or Punishment".
>>
>
> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

Since Mike is out I built this version and ran it a few times. I saw a
100% reproduction rate previously. I haven't seen any errors in a
handful of runs (~10) with this patch.

I do still see the hrtimer latency message on the first run, so this is
likely unrelated to the issue we're addressing:

Jul 13 14:47:59 elm9m94 kernel: hrtimer: interrupt took 123924 ns

As for Thomas's changes, only a couple nitpics below:


> @@ -2255,18 +2265,51 @@ static int futex_wait_requeue_pi(u32 __u
> /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> futex_wait_queue_me(hb,&q, to);
>
> - spin_lock(&hb->lock);
> - ret = handle_early_requeue_pi_wakeup(hb,&q,&key2, to);
> - spin_unlock(&hb->lock);
> - if (ret)
> - goto out_put_keys;
> + /*
> + * Avoid races with requeue and trying to block on two mutexes
> + * (hb->lock and uaddr2's rtmutex) by serializing access to
> + * pi_blocked_on with pi_lock.
> + */
> + raw_spin_lock_irq(&current->pi_lock);
> + if (current->pi_blocked_on) {
> + /* Requeue happened already */

This comment isn't quite accurate. The requeue may be in progress, which
means the q.lock_ptr is not trustworthy as noted below. Consider:

/*
* We have been requeued, or are in the process
* of being requeued.
*/

> + raw_spin_unlock_irq(&current->pi_lock);
> + } else {
> + /*
> + * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS
> + * prevents a concurrent requeue from enqueuein us on

s/enqueuein/enqueueing/ not that my dictionary thinks either one of
them are words ;-)

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-07-18 08:32:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2

On Tue, 2010-07-13 at 13:52 +0200, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:

> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

It's a bit late, but I took it out for a spin, no problem noted.

-Mike