2010-07-09 22:33:21

by Darren Hart

[permalink] [raw]
Subject: [PATCH 0/4][RT] futex: fix tasks blocking on two rt_mutex locks

This patch-set address the following problem reported by Mike Galbraith:

> Stress testing, looking to trigger RCU stalls, I've managed to find a
> way to repeatably create fireworks. (got RCU stall, see attached)
>
> 1. download ltp-full-20100630. Needs to be this version because of
> testcase bustage in earlier versions, and must be built with gcc > 4.3,
> else testcases will segfault due to a gcc bug.
>
> 2. apply patchlet so you can run testcases/realtime/perf/latency/run.sh
> at all.
>
> --- pthread_cond_many.c.org 2010-07-05 09:05:59.000000000 +0200
> +++ pthread_cond_many.c 2010-07-04 12:12:25.000000000 +0200
> @@ -259,7 +259,7 @@ void usage(void)
>
> int parse_args(int c, char *v)
> {
> - int handled;
> + int handled = 1;
> switch (c) {
> case 'h':
> usage();
>
> 3. add --realtime for no particular reason.
>
> --- run.sh.org 2010-07-06 15:54:58.000000000 +0200
> +++ run.sh 2010-07-06 16:37:34.000000000 +0200
> @@ -22,7 +22,7 @@ make
> # process to run realtime. The remainder of the processes (if any)
> # will run non-realtime in any case.
>
> -nthread=5000
> +nthread=500
> iter=400
> nproc=5
>
> @@ -39,7 +39,7 @@ i=0
> i=1
> while test $i -lt $nproc
> do
> - ./pthread_cond_many --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
> + ./pthread_cond_many --realtime --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
> i=`expr $i + 1`
> done
> wait

The LTP wreckage will be addressed separately <ugh>

> 4. run it.
>
> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
> this does not make it to consoles (poking sysrq-foo doesn't either).
> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
> explosion, which does make it to consoles.
>
> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
> fallout. The start of bad juju is always pi_blocked_on != waiter.
>
> [ 141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [ 141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [ 141.609268] PGD 20e174067 PUD 20e253067 PMD 0
> [ 141.609268] Oops: 0000 [#1] PREEMPT SMP
> [ 141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> [ 141.609268] CPU 0
> [ 141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G W 2.6.33.6-rt23 #12 MS-7502/MS-7502
> [ 141.609268] RIP: 0010:[<ffffffff8106856d>] [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [ 141.609268] RSP: 0018:ffff88020e3cdd78 EFLAGS: 00010097
> [ 141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
> [ 141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
> [ 141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
> [ 141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
> [ 141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
> [ 141.609268] FS: 00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
> [ 141.609268] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
> [ 141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
> [ 141.609268] Stack:
> [ 141.609268] 0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
> [ 141.609268] <0> 0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
> [ 141.609268] <0> ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
> [ 141.609268] Call Trace:
> [ 141.609268] [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
> [ 141.609268] [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
> [ 141.609268] [<ffffffff81067d7f>] do_futex+0x83c/0x935
> [ 141.609268] [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
> [ 141.609268] [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
> [ 141.609268] [<ffffffff81067fba>] sys_futex+0x142/0x154
> [ 141.609268] [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
> [ 141.609268] [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
> [ 141.609268] [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
> [ 141.609268] [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
> [ 141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00 <4c> 39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34
> [ 141.609268] RIP [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [ 141.609268] RSP <ffff88020e3cdd78>
> [ 141.609268] CR2: 0000000000000058
> [ 141.609268] ---[ end trace 58805b944e6f93ce ]---
> [ 141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2


2010-07-09 22:33:30

by Darren Hart

[permalink] [raw]
Subject: [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON

If the pi_blocked_on variable is NULL, the subsequent WARN_ON's
will cause an OOPS. Only perform the susequent checks if
pi_blocked_on is valid.

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/rtmutex.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..baac7d9 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -579,9 +579,10 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)

raw_spin_lock(&pendowner->pi_lock);

- WARN_ON(!pendowner->pi_blocked_on);
- WARN_ON(pendowner->pi_blocked_on != waiter);
- WARN_ON(pendowner->pi_blocked_on->lock != lock);
+ if (!WARN_ON(!pendowner->pi_blocked_on)) {
+ WARN_ON(pendowner->pi_blocked_on != waiter);
+ WARN_ON(pendowner->pi_blocked_on->lock != lock);
+ }

pendowner->pi_blocked_on = NULL;

--
1.7.0.4

2010-07-09 22:33:49

by Darren Hart

[permalink] [raw]
Subject: [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks

rtmutex proxy locking complicates the logic a bit and opens up
the possibility for a task to wake and attempt to take another
sleeping lock without knowing it has been enqueued on another
lock already. Add a BUG_ON to catch this scenario early.

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/rtmutex.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index baac7d9..22f9d18 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
top_waiter = rt_mutex_top_waiter(lock);
plist_add(&waiter->list_entry, &lock->wait_list);

+ /* Tasks can only block on one lock at a time. */
+ BUG_ON(task->pi_blocked_on != NULL);
+
task->pi_blocked_on = waiter;

raw_spin_unlock(&task->pi_lock);
--
1.7.0.4

2010-07-09 22:34:01

by Darren Hart

[permalink] [raw]
Subject: [PATCH 3/4] futex: free_pi_state outside of hb->lock sections

free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
get the calls to free_pi_state() out of the hb->lock() sections.

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/futex.c | 14 ++++++++------
1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..b217972 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -114,7 +114,7 @@ struct futex_q {
struct plist_node list;

struct task_struct *task;
- spinlock_t *lock_ptr;
+ raw_spinlock_t *lock_ptr;
union futex_key key;
struct futex_pi_state *pi_state;
struct rt_mutex_waiter *rt_waiter;
@@ -487,7 +487,7 @@ void exit_pi_state_list(struct task_struct *curr)
* task still owns the PI-state:
*/
if (head->next != next) {
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
continue;
}

@@ -1339,7 +1339,6 @@ retry_private:
} else if (ret) {
/* -EDEADLK */
this->pi_state = NULL;
- free_pi_state(pi_state);
goto out_unlock;
}
}
@@ -1437,7 +1436,7 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
*/
static int unqueue_me(struct futex_q *q)
{
- spinlock_t *lock_ptr;
+ raw_spinlock_t *lock_ptr;
int ret = 0;

/* In the common case we don't take the spinlock, which is nice. */
@@ -1483,16 +1482,19 @@ retry:
*/
static void unqueue_me_pi(struct futex_q *q)
{
+ struct futex_pi_state *pi_state = NULL;
+
WARN_ON(plist_node_empty(&q->list));
plist_del(&q->list, &q->list.plist);

BUG_ON(!q->pi_state);
- free_pi_state(q->pi_state);
+ pi_state = q->pi_state;
q->pi_state = NULL;

spin_unlock(q->lock_ptr);
-
drop_futex_key_refs(&q->key);
+
+ free_pi_state(pi_state);
}

/*
--
1.7.0.4

2010-07-09 22:33:59

by Darren Hart

[permalink] [raw]
Subject: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
a scenario where a task can wake-up, not knowing it has been enqueued on an
rtmutex. In order to detect this, the task would have to be able to take either
task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
without already holding one of these, the pi_blocked_on variable can change
from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
to take a sleeping lock after wakeup or it could end up trying to block on two
locks, the second overwriting a valid pi_blocked_on value. This obviously
breaks the pi mechanism.

This patch increases latency, while running the ltp pthread_cond_many test
which Michal reported the bug with, I see double digit hrtimer latencies
(typically only on the first run after boo):

kernel: hrtimer: interrupt took 75911 ns

This might be addressed by changing the various loops in the futex code to be
incremental, probably at an additional throughput hit. The private hash_bucket
lists discussed in the past could reduce hb->lock contention in some scenarios.
It should be noted that pthread_cond_many is a rather pathological case.

This also introduces problems for plists which want a spinlock_t rather
than a raw_spinlock_t. Any thoughts on how to address this?

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/futex.c | 67 ++++++++++++++++++++++++++++++--------------------------
1 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index b217972..0ad5a85 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -128,7 +128,7 @@ struct futex_q {
* waiting on a futex.
*/
struct futex_hash_bucket {
- spinlock_t lock;
+ raw_spinlock_t lock;
struct plist_head chain;
};

@@ -479,7 +479,7 @@ void exit_pi_state_list(struct task_struct *curr)
hb = hash_futex(&key);
raw_spin_unlock_irq(&curr->pi_lock);

- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);

raw_spin_lock_irq(&curr->pi_lock);
/*
@@ -499,7 +499,7 @@ void exit_pi_state_list(struct task_struct *curr)

rt_mutex_unlock(&pi_state->pi_mutex);

- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);

raw_spin_lock_irq(&curr->pi_lock);
}
@@ -860,21 +860,21 @@ static inline void
double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
if (hb1 <= hb2) {
- spin_lock(&hb1->lock);
+ raw_spin_lock(&hb1->lock);
if (hb1 < hb2)
- spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
+ raw_spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
} else { /* hb1 > hb2 */
- spin_lock(&hb2->lock);
- spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
+ raw_spin_lock(&hb2->lock);
+ raw_spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
}
}

static inline void
double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
- spin_unlock(&hb1->lock);
+ raw_spin_unlock(&hb1->lock);
if (hb1 != hb2)
- spin_unlock(&hb2->lock);
+ raw_spin_unlock(&hb2->lock);
}

/*
@@ -896,7 +896,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
goto out;

hb = hash_futex(&key);
- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);
head = &hb->chain;

plist_for_each_entry_safe(this, next, head, list) {
@@ -916,7 +916,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
}
}

- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
put_futex_key(fshared, &key);
out:
return ret;
@@ -1070,6 +1070,7 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,

q->lock_ptr = &hb->lock;
#ifdef CONFIG_DEBUG_PI_LIST
+ /* FIXME: we're converting this to a raw lock... */
q->list.plist.spinlock = &hb->lock;
#endif

@@ -1377,14 +1378,14 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
hb = hash_futex(&q->key);
q->lock_ptr = &hb->lock;

- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);
return hb;
}

static inline void
queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
{
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
drop_futex_key_refs(&q->key);
}

@@ -1416,11 +1417,12 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)

plist_node_init(&q->list, prio);
#ifdef CONFIG_DEBUG_PI_LIST
+ /* FIXME: we're converting this to a raw_spinlock */
q->list.plist.spinlock = &hb->lock;
#endif
plist_add(&q->list, &hb->chain);
q->task = current;
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
}

/**
@@ -1444,7 +1446,7 @@ retry:
lock_ptr = q->lock_ptr;
barrier();
if (lock_ptr != NULL) {
- spin_lock(lock_ptr);
+ raw_spin_lock(lock_ptr);
/*
* q->lock_ptr can change between reading it and
* spin_lock(), causing us to take the wrong lock. This
@@ -1459,7 +1461,7 @@ retry:
* we can detect whether we acquired the correct lock.
*/
if (unlikely(lock_ptr != q->lock_ptr)) {
- spin_unlock(lock_ptr);
+ raw_spin_unlock(lock_ptr);
goto retry;
}
WARN_ON(plist_node_empty(&q->list));
@@ -1467,7 +1469,7 @@ retry:

BUG_ON(q->pi_state);

- spin_unlock(lock_ptr);
+ raw_spin_unlock(lock_ptr);
ret = 1;
}

@@ -1491,7 +1493,7 @@ static void unqueue_me_pi(struct futex_q *q)
pi_state = q->pi_state;
q->pi_state = NULL;

- spin_unlock(q->lock_ptr);
+ raw_spin_unlock(q->lock_ptr);
drop_futex_key_refs(&q->key);

free_pi_state(pi_state);
@@ -1579,11 +1581,11 @@ retry:
* simply return.
*/
handle_fault:
- spin_unlock(q->lock_ptr);
+ raw_spin_unlock(q->lock_ptr);

ret = fault_in_user_writeable(uaddr);

- spin_lock(q->lock_ptr);
+ raw_spin_lock(q->lock_ptr);

/*
* Check if someone else fixed it for us:
@@ -1976,7 +1978,7 @@ retry_private:
ret = ret ? 0 : -EWOULDBLOCK;
}

- spin_lock(q.lock_ptr);
+ raw_spin_lock(q.lock_ptr);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
@@ -2053,7 +2055,7 @@ retry:
goto out;

hb = hash_futex(&key);
- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);

/*
* To avoid races, try to do the TID -> 0 atomic transition
@@ -2102,14 +2104,14 @@ retry:
}

out_unlock:
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
put_futex_key(fshared, &key);

out:
return ret;

pi_faulted:
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
put_futex_key(fshared, &key);

ret = fault_in_user_writeable(uaddr);
@@ -2257,9 +2259,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
futex_wait_queue_me(hb, &q, to);

- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
if (ret)
goto out_put_keys;

@@ -2277,10 +2279,10 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
* did a lock-steal - fix up the PI-state in that case.
*/
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ raw_spin_lock(q.lock_ptr);
ret = fixup_pi_state_owner(uaddr2, &q, current,
fshared);
- spin_unlock(q.lock_ptr);
+ raw_spin_unlock(q.lock_ptr);
}
} else {
/*
@@ -2293,7 +2295,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
debug_rt_mutex_free_waiter(&rt_waiter);

- spin_lock(q.lock_ptr);
+ raw_spin_lock(q.lock_ptr);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
@@ -2668,8 +2670,11 @@ static int __init futex_init(void)
futex_cmpxchg_enabled = 1;

for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
- plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
- spin_lock_init(&futex_queues[i].lock);
+ /*
+ * FIXME: plist wants a spinlock, but the hb->lock is a raw_spinlock_t
+ */
+ plist_head_init(&futex_queues[i].chain, NULL /*&futex_queues[i].lock*/);
+ raw_spin_lock_init(&futex_queues[i].lock);
}

return 0;
--
1.7.0.4

2010-07-09 22:56:04

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections

Apologies, mangled a rebase and this patch had bits of 4/4 in it.
Corrected below:


>From f40b8e684df6ed2b1ba9167cad3922ab43f4a717 Mon Sep 17 00:00:00 2001
From: Darren Hart <[email protected]>
Date: Fri, 9 Jul 2010 17:07:10 -0400
Subject: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections

free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
get the calls to free_pi_state() out of the hb->lock() sections.

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/futex.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..2cd58a2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1339,7 +1339,6 @@ retry_private:
} else if (ret) {
/* -EDEADLK */
this->pi_state = NULL;
- free_pi_state(pi_state);
goto out_unlock;
}
}
@@ -1483,16 +1482,19 @@ retry:
*/
static void unqueue_me_pi(struct futex_q *q)
{
+ struct futex_pi_state *pi_state = NULL;
+
WARN_ON(plist_node_empty(&q->list));
plist_del(&q->list, &q->list.plist);

BUG_ON(!q->pi_state);
- free_pi_state(q->pi_state);
+ pi_state = q->pi_state;
q->pi_state = NULL;

spin_unlock(q->lock_ptr);
-
drop_futex_key_refs(&q->key);
+
+ free_pi_state(pi_state);
}

/*
--
1.7.0.4

2010-07-09 22:57:20

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t

This version pulls in the bits mistakenly left in 3/4.


>From 9f8b4faac79518f98131464c2d21a1c64fb841d2 Mon Sep 17 00:00:00 2001
From: Darren Hart <[email protected]>
Date: Fri, 9 Jul 2010 16:44:47 -0400
Subject: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t

The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
a scenario where a task can wake-up, not knowing it has been enqueued on an
rtmutex. In order to detect this, the task would have to be able to take either
task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
without already holding one of these, the pi_blocked_on variable can change
from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
to take a sleeping lock after wakeup or it could end up trying to block on two
locks, the second overwriting a valid pi_blocked_on value. This obviously
breaks the pi mechanism.

This patch increases latency, while running the ltp pthread_cond_many test
which Michal reported the bug with, I see double digit hrtimer latencies
(typically only on the first run after boo):

kernel: hrtimer: interrupt took 75911 ns

This might be addressed by changing the various loops in the futex code to be
incremental, probably at an additional throughput hit. The private hash_bucket
lists discussed in the past could reduce hb->lock contention in some scenarios.
It should be noted that pthread_cond_many is a rather pathological case.

This also introduces problems for plists which want a spinlock_t rather
than a raw_spinlock_t. Any thoughts on how to address this?

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/futex.c | 73 ++++++++++++++++++++++++++++++--------------------------
1 files changed, 39 insertions(+), 34 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2cd58a2..0ad5a85 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -114,7 +114,7 @@ struct futex_q {
struct plist_node list;

struct task_struct *task;
- spinlock_t *lock_ptr;
+ raw_spinlock_t *lock_ptr;
union futex_key key;
struct futex_pi_state *pi_state;
struct rt_mutex_waiter *rt_waiter;
@@ -128,7 +128,7 @@ struct futex_q {
* waiting on a futex.
*/
struct futex_hash_bucket {
- spinlock_t lock;
+ raw_spinlock_t lock;
struct plist_head chain;
};

@@ -479,7 +479,7 @@ void exit_pi_state_list(struct task_struct *curr)
hb = hash_futex(&key);
raw_spin_unlock_irq(&curr->pi_lock);

- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);

raw_spin_lock_irq(&curr->pi_lock);
/*
@@ -487,7 +487,7 @@ void exit_pi_state_list(struct task_struct *curr)
* task still owns the PI-state:
*/
if (head->next != next) {
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
continue;
}

@@ -499,7 +499,7 @@ void exit_pi_state_list(struct task_struct *curr)

rt_mutex_unlock(&pi_state->pi_mutex);

- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);

raw_spin_lock_irq(&curr->pi_lock);
}
@@ -860,21 +860,21 @@ static inline void
double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
if (hb1 <= hb2) {
- spin_lock(&hb1->lock);
+ raw_spin_lock(&hb1->lock);
if (hb1 < hb2)
- spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
+ raw_spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
} else { /* hb1 > hb2 */
- spin_lock(&hb2->lock);
- spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
+ raw_spin_lock(&hb2->lock);
+ raw_spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
}
}

static inline void
double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
- spin_unlock(&hb1->lock);
+ raw_spin_unlock(&hb1->lock);
if (hb1 != hb2)
- spin_unlock(&hb2->lock);
+ raw_spin_unlock(&hb2->lock);
}

/*
@@ -896,7 +896,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
goto out;

hb = hash_futex(&key);
- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);
head = &hb->chain;

plist_for_each_entry_safe(this, next, head, list) {
@@ -916,7 +916,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
}
}

- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
put_futex_key(fshared, &key);
out:
return ret;
@@ -1070,6 +1070,7 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,

q->lock_ptr = &hb->lock;
#ifdef CONFIG_DEBUG_PI_LIST
+ /* FIXME: we're converting this to a raw lock... */
q->list.plist.spinlock = &hb->lock;
#endif

@@ -1377,14 +1378,14 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
hb = hash_futex(&q->key);
q->lock_ptr = &hb->lock;

- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);
return hb;
}

static inline void
queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
{
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
drop_futex_key_refs(&q->key);
}

@@ -1416,11 +1417,12 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)

plist_node_init(&q->list, prio);
#ifdef CONFIG_DEBUG_PI_LIST
+ /* FIXME: we're converting this to a raw_spinlock */
q->list.plist.spinlock = &hb->lock;
#endif
plist_add(&q->list, &hb->chain);
q->task = current;
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
}

/**
@@ -1436,7 +1438,7 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
*/
static int unqueue_me(struct futex_q *q)
{
- spinlock_t *lock_ptr;
+ raw_spinlock_t *lock_ptr;
int ret = 0;

/* In the common case we don't take the spinlock, which is nice. */
@@ -1444,7 +1446,7 @@ retry:
lock_ptr = q->lock_ptr;
barrier();
if (lock_ptr != NULL) {
- spin_lock(lock_ptr);
+ raw_spin_lock(lock_ptr);
/*
* q->lock_ptr can change between reading it and
* spin_lock(), causing us to take the wrong lock. This
@@ -1459,7 +1461,7 @@ retry:
* we can detect whether we acquired the correct lock.
*/
if (unlikely(lock_ptr != q->lock_ptr)) {
- spin_unlock(lock_ptr);
+ raw_spin_unlock(lock_ptr);
goto retry;
}
WARN_ON(plist_node_empty(&q->list));
@@ -1467,7 +1469,7 @@ retry:

BUG_ON(q->pi_state);

- spin_unlock(lock_ptr);
+ raw_spin_unlock(lock_ptr);
ret = 1;
}

@@ -1491,7 +1493,7 @@ static void unqueue_me_pi(struct futex_q *q)
pi_state = q->pi_state;
q->pi_state = NULL;

- spin_unlock(q->lock_ptr);
+ raw_spin_unlock(q->lock_ptr);
drop_futex_key_refs(&q->key);

free_pi_state(pi_state);
@@ -1579,11 +1581,11 @@ retry:
* simply return.
*/
handle_fault:
- spin_unlock(q->lock_ptr);
+ raw_spin_unlock(q->lock_ptr);

ret = fault_in_user_writeable(uaddr);

- spin_lock(q->lock_ptr);
+ raw_spin_lock(q->lock_ptr);

/*
* Check if someone else fixed it for us:
@@ -1976,7 +1978,7 @@ retry_private:
ret = ret ? 0 : -EWOULDBLOCK;
}

- spin_lock(q.lock_ptr);
+ raw_spin_lock(q.lock_ptr);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
@@ -2053,7 +2055,7 @@ retry:
goto out;

hb = hash_futex(&key);
- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);

/*
* To avoid races, try to do the TID -> 0 atomic transition
@@ -2102,14 +2104,14 @@ retry:
}

out_unlock:
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
put_futex_key(fshared, &key);

out:
return ret;

pi_faulted:
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
put_futex_key(fshared, &key);

ret = fault_in_user_writeable(uaddr);
@@ -2257,9 +2259,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
futex_wait_queue_me(hb, &q, to);

- spin_lock(&hb->lock);
+ raw_spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
- spin_unlock(&hb->lock);
+ raw_spin_unlock(&hb->lock);
if (ret)
goto out_put_keys;

@@ -2277,10 +2279,10 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
* did a lock-steal - fix up the PI-state in that case.
*/
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ raw_spin_lock(q.lock_ptr);
ret = fixup_pi_state_owner(uaddr2, &q, current,
fshared);
- spin_unlock(q.lock_ptr);
+ raw_spin_unlock(q.lock_ptr);
}
} else {
/*
@@ -2293,7 +2295,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
debug_rt_mutex_free_waiter(&rt_waiter);

- spin_lock(q.lock_ptr);
+ raw_spin_lock(q.lock_ptr);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
@@ -2668,8 +2670,11 @@ static int __init futex_init(void)
futex_cmpxchg_enabled = 1;

for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
- plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
- spin_lock_init(&futex_queues[i].lock);
+ /*
+ * FIXME: plist wants a spinlock, but the hb->lock is a raw_spinlock_t
+ */
+ plist_head_init(&futex_queues[i].chain, NULL /*&futex_queues[i].lock*/);
+ raw_spin_lock_init(&futex_queues[i].lock);
}

return 0;
--
1.7.0.4

2010-07-10 00:29:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON

On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
> If the pi_blocked_on variable is NULL, the subsequent WARN_ON's
> will cause an OOPS. Only perform the susequent checks if
> pi_blocked_on is valid.
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> ---
> kernel/rtmutex.c | 7 ++++---
> 1 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 23dd443..baac7d9 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -579,9 +579,10 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>
> raw_spin_lock(&pendowner->pi_lock);
>
> - WARN_ON(!pendowner->pi_blocked_on);
> - WARN_ON(pendowner->pi_blocked_on != waiter);
> - WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + if (!WARN_ON(!pendowner->pi_blocked_on)) {
> + WARN_ON(pendowner->pi_blocked_on != waiter);

The above actually has no issue if the pi_blocked_on is NULL.

The below, well yeah.

-- Steve

> + WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + }
>
> pendowner->pi_blocked_on = NULL;
>

2010-07-10 00:30:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks

On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
> rtmutex proxy locking complicates the logic a bit and opens up
> the possibility for a task to wake and attempt to take another
> sleeping lock without knowing it has been enqueued on another
> lock already. Add a BUG_ON to catch this scenario early.
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> ---
> kernel/rtmutex.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index baac7d9..22f9d18 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
> top_waiter = rt_mutex_top_waiter(lock);
> plist_add(&waiter->list_entry, &lock->wait_list);
>
> + /* Tasks can only block on one lock at a time. */
> + BUG_ON(task->pi_blocked_on != NULL);

WARN_ON may be better. Since it may not cause a system crash or other
huge bug if it is not true.

-- Steve

> +
> task->pi_blocked_on = waiter;
>
> raw_spin_unlock(&task->pi_lock);

2010-07-10 00:32:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections

On Fri, 2010-07-09 at 15:55 -0700, Darren Hart wrote:
> Apologies, mangled a rebase and this patch had bits of 4/4 in it.
> Corrected below:
>
>
> >From f40b8e684df6ed2b1ba9167cad3922ab43f4a717 Mon Sep 17 00:00:00 2001
> From: Darren Hart <[email protected]>
> Date: Fri, 9 Jul 2010 17:07:10 -0400
> Subject: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
>
> free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
> get the calls to free_pi_state() out of the hb->lock() sections.
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> ---
> kernel/futex.c | 8 +++++---
> 1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..2cd58a2 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1339,7 +1339,6 @@ retry_private:

This is why I always add a space before labels. I have no idea what
function this was in.

> } else if (ret) {
> /* -EDEADLK */
> this->pi_state = NULL;
> - free_pi_state(pi_state);

Where do we free the pi state here?

-- Steve

> goto out_unlock;
> }
> }
> @@ -1483,16 +1482,19 @@ retry:
> */
> static void unqueue_me_pi(struct futex_q *q)
> {
> + struct futex_pi_state *pi_state = NULL;
> +
> WARN_ON(plist_node_empty(&q->list));
> plist_del(&q->list, &q->list.plist);
>
> BUG_ON(!q->pi_state);
> - free_pi_state(q->pi_state);
> + pi_state = q->pi_state;
> q->pi_state = NULL;
>
> spin_unlock(q->lock_ptr);
> -
> drop_futex_key_refs(&q->key);
> +
> + free_pi_state(pi_state);
> }
>
> /*

2010-07-10 00:34:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t

On Fri, 2010-07-09 at 15:57 -0700, Darren Hart wrote:
> This version pulls in the bits mistakenly left in 3/4.
>
>
> >From 9f8b4faac79518f98131464c2d21a1c64fb841d2 Mon Sep 17 00:00:00 2001
> From: Darren Hart <[email protected]>
> Date: Fri, 9 Jul 2010 16:44:47 -0400
> Subject: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t
>
> The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.
>
> This patch increases latency, while running the ltp pthread_cond_many test
> which Michal reported the bug with, I see double digit hrtimer latencies
> (typically only on the first run after boo):
>
> kernel: hrtimer: interrupt took 75911 ns
>
> This might be addressed by changing the various loops in the futex code to be
> incremental, probably at an additional throughput hit. The private hash_bucket
> lists discussed in the past could reduce hb->lock contention in some scenarios.
> It should be noted that pthread_cond_many is a rather pathological case.
>
> This also introduces problems for plists which want a spinlock_t rather
> than a raw_spinlock_t. Any thoughts on how to address this?
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: John Kacur <[email protected]>

Acked-by: Steven Rostedt <[email protected]>

-- Steve

> Cc: Mike Galbraith <[email protected]>
> ---
> kernel/futex.c | 73 ++++++++++++++++++++++++++++++--------------------------
> 1 files changed, 39 insertions(+), 34 deletions(-)

2010-07-10 14:41:29

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections

On 07/09/2010 05:32 PM, Steven Rostedt wrote:
> On Fri, 2010-07-09 at 15:55 -0700, Darren Hart wrote:
>> Apologies, mangled a rebase and this patch had bits of 4/4 in it.
>> Corrected below:
>>
>>
>> > From f40b8e684df6ed2b1ba9167cad3922ab43f4a717 Mon Sep 17 00:00:00 2001
>> From: Darren Hart<[email protected]>
>> Date: Fri, 9 Jul 2010 17:07:10 -0400
>> Subject: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
>>
>> free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
>> get the calls to free_pi_state() out of the hb->lock() sections.
>>
>> Signed-off-by: Darren Hart<[email protected]>
>> Cc: Thomas Gleixner<[email protected]>
>> Cc: Peter Zijlstra<[email protected]>
>> Cc: Ingo Molnar<[email protected]>
>> Cc: Eric Dumazet<[email protected]>
>> Cc: John Kacur<[email protected]>
>> Cc: Steven Rostedt<[email protected]>
>> Cc: Mike Galbraith<[email protected]>
>> ---
>> kernel/futex.c | 8 +++++---
>> 1 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/futex.c b/kernel/futex.c
>> index a6cec32..2cd58a2 100644
>> --- a/kernel/futex.c
>> +++ b/kernel/futex.c
>> @@ -1339,7 +1339,6 @@ retry_private:
>
> This is why I always add a space before labels. I have no idea what
> function this was in.
>
>> } else if (ret) {
>> /* -EDEADLK */
>> this->pi_state = NULL;
>> - free_pi_state(pi_state);
>
> Where do we free the pi state here?

Near the end after goto out_unlock which falls through to the out:
label. In this case I didn't move it, it just wasn't necessary.

out:
if (pi_state != NULL)
free_pi_state(pi_state);


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-07-10 14:42:51

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON

On 07/09/2010 05:29 PM, Steven Rostedt wrote:
> On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
>> If the pi_blocked_on variable is NULL, the subsequent WARN_ON's
>> will cause an OOPS. Only perform the susequent checks if
>> pi_blocked_on is valid.
>>
>> Signed-off-by: Darren Hart<[email protected]>
>> Cc: Thomas Gleixner<[email protected]>
>> Cc: Peter Zijlstra<[email protected]>
>> Cc: Ingo Molnar<[email protected]>
>> Cc: Eric Dumazet<[email protected]>
>> Cc: John Kacur<[email protected]>
>> Cc: Steven Rostedt<[email protected]>
>> Cc: Mike Galbraith<[email protected]>
>> ---
>> kernel/rtmutex.c | 7 ++++---
>> 1 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>> index 23dd443..baac7d9 100644
>> --- a/kernel/rtmutex.c
>> +++ b/kernel/rtmutex.c
>> @@ -579,9 +579,10 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>>
>> raw_spin_lock(&pendowner->pi_lock);
>>
>> - WARN_ON(!pendowner->pi_blocked_on);
>> - WARN_ON(pendowner->pi_blocked_on != waiter);
>> - WARN_ON(pendowner->pi_blocked_on->lock != lock);
>> + if (!WARN_ON(!pendowner->pi_blocked_on)) {
>> + WARN_ON(pendowner->pi_blocked_on != waiter);
>
> The above actually has no issue if the pi_blocked_on is NULL.

It doesn't, but it's also redundant and makes the console noisier for no
reason. Seemed worth while to drop it under the if in the same go.

--
Darren


> The below, well yeah.
>
> -- Steve
>
>> + WARN_ON(pendowner->pi_blocked_on->lock != lock);
>> + }
>>
>> pendowner->pi_blocked_on = NULL;
>>
>
>


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-07-10 17:30:53

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 2/4 V2] rtmutex: add BUG_ON if a task attempts to block on two locks

On 07/09/2010 05:30 PM, Steven Rostedt wrote:
> On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
>> rtmutex proxy locking complicates the logic a bit and opens up
>> the possibility for a task to wake and attempt to take another
>> sleeping lock without knowing it has been enqueued on another
>> lock already. Add a BUG_ON to catch this scenario early.
>>
>> Signed-off-by: Darren Hart<[email protected]>
>> Cc: Thomas Gleixner<[email protected]>
>> Cc: Peter Zijlstra<[email protected]>
>> Cc: Ingo Molnar<[email protected]>
>> Cc: Eric Dumazet<[email protected]>
>> Cc: John Kacur<[email protected]>
>> Cc: Steven Rostedt<[email protected]>
>> Cc: Mike Galbraith<[email protected]>
>> ---
>> kernel/rtmutex.c | 3 +++
>> 1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>> index baac7d9..22f9d18 100644
>> --- a/kernel/rtmutex.c
>> +++ b/kernel/rtmutex.c
>> @@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
>> top_waiter = rt_mutex_top_waiter(lock);
>> plist_add(&waiter->list_entry,&lock->wait_list);
>>
>> + /* Tasks can only block on one lock at a time. */
>> + BUG_ON(task->pi_blocked_on != NULL);
>
> WARN_ON may be better. Since it may not cause a system crash or other
> huge bug if it is not true.
>

No objection.


>From 31c7b6c5657bcc897ddc79d7f9bb1942eb4c854a Mon Sep 17 00:00:00 2001
From: Darren Hart <[email protected]>
Date: Fri, 9 Jul 2010 17:46:05 -0400
Subject: [PATCH 2/4] rtmutex: add WARN_ON if a task attempts to block on two locks

rtmutex proxy locking complicates the logic a bit and opens up
the possibility for a task to wake and attempt to take another
sleeping lock without knowing it has been enqueued on another
lock already. Add a WARN_ON to catch this scenario early.

V2: use a WARN_ON instead of a BUG_ON per Steven's request.

Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
---
kernel/rtmutex.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index baac7d9..5262d0f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
top_waiter = rt_mutex_top_waiter(lock);
plist_add(&waiter->list_entry, &lock->wait_list);

+ /* Tasks can only block on one lock at a time. */
+ WARN_ON(task->pi_blocked_on != NULL);
+
task->pi_blocked_on = waiter;

raw_spin_unlock(&task->pi_lock);
--
1.7.0.4

2010-07-10 19:41:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.

copy/paste offline query/reply at Darren's request..

On Sat, 2010-07-10 at 10:26 -0700, Darren Hart wrote:
On 07/09/2010 09:32 PM, Mike Galbraith wrote:
> > On Fri, 2010-07-09 at 13:05 -0700, Darren Hart wrote:
> >
> >> The core of the problem is that the proxy_lock blocks a task on a lock
> >> the task knows nothing about. So when it wakes up inside of
> >> futex_wait_requeue_pi, it immediately tries to block on hb->lock to
> >> check why it woke up. This has the potential to block the task on two
> >> locks (thus overwriting the pi_blocked_on). Any attempt preventing this
> >> involves a lock, and ultimiately the hb->lock. The only solution I see
> >> is to make the hb->locks raw locks (thanks to Steven Rostedt for
> >> original idea and batting this around with me in IRC).
> >
> > Hm, so wakee _was_ munging his own state after all.
> >
> > Out of curiosity, what's wrong with holding his pi_lock across the
> > wakeup? He can _try_ to block, but can't until pi state is stable.
> >
> > I presume there's a big fat gotcha that's just not obvious to futex
> > locking newbie :)
>
> It'll take me more time that I have right now to positive, but:
>
>
> rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
>
> raw_spin_unlock(&current->pi_lock);
>
> Your patch moved the unlock before the set_owner. I _believe_ this can
> break the pi boosting logic - current is the owner until it calls
> set_owner to be pendowner. I haven't traced this entire path yet, but
> that's my gut feel.

I _think_ it should be fine to do that. Setting an owner seems to only
require holding the wait_lock. I could easily be missing subtleties
though. Looking around, I didn't see any reason not to unlock the
owner's pi_lock after twiddling pi_waiters (and still don't, but...).

> However, you're idea has merit as we have to take our ->pi_lock before
> we can block on the hb->lock (inside task_blocks_on_rt_mutex()).
>
> If we can't move the unlock above before set_owner, then we may need a:
>
> retry:
> cur->lock()
> top_waiter = get_top_waiter()
> cur->unlock()
>
> double_lock(cur, topwaiter)
> if top_waiter != get_top_waiter()
> double_unlock(cur, topwaiter)
> goto retry
>
> Not ideal, but I think I prefer that to making all the hb locks raw.
>
> You dropped the CC list for some reason, probably a good idea to send
> this back out in response to my raw lock patch (4/4) - your question and
> my reply. This is crazy stuff, no harm in putting the question out there.
>
> I'll take a closer look at this when I can, if not tonight, Monday morning.

-Mike

2010-07-11 13:33:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:
> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:

> > If we can't move the unlock above before set_owner, then we may need a:
> >
> > retry:
> > cur->lock()
> > top_waiter = get_top_waiter()
> > cur->unlock()
> >
> > double_lock(cur, topwaiter)
> > if top_waiter != get_top_waiter()
> > double_unlock(cur, topwaiter)
> > goto retry
> >
> > Not ideal, but I think I prefer that to making all the hb locks raw.

Another option: only scratch the itchy spot.

futex: non-blocking synchronization point for futex_wait_requeue_pi() and futex_requeue().

Problem analysis by Darren Hart;
The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
a scenario where a task can wake-up, not knowing it has been enqueued on an
rtmutex. In order to detect this, the task would have to be able to take either
task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
without already holding one of these, the pi_blocked_on variable can change
from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
to take a sleeping lock after wakeup or it could end up trying to block on two
locks, the second overwriting a valid pi_blocked_on value. This obviously
breaks the pi mechanism.

Rather than convert the bh-lock to a raw spinlock, do so only in the spot where
blocking cannot be allowed, ie before we know that lock handoff has completed.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Steven Rostedt <[email protected]>

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..ef489f3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
futex_wait_queue_me(hb, &q, to);

- spin_lock(&hb->lock);
+ /*
+ * Non-blocking synchronization point with futex_requeue().
+ *
+ * We dare not block here because this will alter PI state, possibly
+ * before our waker finishes modifying same in wakeup_next_waiter().
+ */
+ while(!spin_trylock(&hb->lock))
+ cpu_relax();
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
spin_unlock(&hb->lock);
if (ret)

2010-07-11 15:10:22

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On 07/11/2010 06:33 AM, Mike Galbraith wrote:
> On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:
>> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
>
>>> If we can't move the unlock above before set_owner, then we may need a:
>>>
>>> retry:
>>> cur->lock()
>>> top_waiter = get_top_waiter()
>>> cur->unlock()
>>>
>>> double_lock(cur, topwaiter)
>>> if top_waiter != get_top_waiter()
>>> double_unlock(cur, topwaiter)
>>> goto retry
>>>
>>> Not ideal, but I think I prefer that to making all the hb locks raw.
>
> Another option: only scratch the itchy spot.
>
> futex: non-blocking synchronization point for futex_wait_requeue_pi() and futex_requeue().
>
> Problem analysis by Darren Hart;
> The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.
>
> Rather than convert the bh-lock to a raw spinlock, do so only in the spot where
> blocking cannot be allowed, ie before we know that lock handoff has completed.

I like it. I especially like the change is only evident if you are using
the code path that introduced the problem in the first place. If you're
doing a lot of requeue_pi operations, then the waking waiters have an
advantage over new pending waiters or other tasks with futex keyed on
the same hash-bucket... but that seems acceptable to me.

I'd like to confirm that holding the pendowner->pi-lock across the
wakeup in wakeup_next_waiter() isn't feasible first. If it can work, I
think the impact would be lower. I'll have a look tomorrow.

Nice work Mike.

--
Darrem

> Signed-off-by: Mike Galbraith<[email protected]>
> Cc: Darren Hart<[email protected]>
> Cc: Thomas Gleixner<[email protected]>
> Cc: Peter Zijlstra<[email protected]>
> Cc: Ingo Molnar<[email protected]>
> Cc: Eric Dumazet<[email protected]>
> Cc: John Kacur<[email protected]>
> Cc: Steven Rostedt<[email protected]>
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..ef489f3 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
> /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> futex_wait_queue_me(hb,&q, to);
>
> - spin_lock(&hb->lock);
> + /*
> + * Non-blocking synchronization point with futex_requeue().
> + *
> + * We dare not block here because this will alter PI state, possibly
> + * before our waker finishes modifying same in wakeup_next_waiter().
> + */
> + while(!spin_trylock(&hb->lock))
> + cpu_relax();
> ret = handle_early_requeue_pi_wakeup(hb,&q,&key2, to);
> spin_unlock(&hb->lock);
> if (ret)
>
>


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-07-12 10:36:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] futex: free_pi_state outside of hb->lock sections

On Fri, 9 Jul 2010, Darren Hart wrote:

> free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
> get the calls to free_pi_state() out of the hb->lock() sections.
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> ---
> kernel/futex.c | 14 ++++++++------
> 1 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..b217972 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -114,7 +114,7 @@ struct futex_q {
> struct plist_node list;
>
> struct task_struct *task;
> - spinlock_t *lock_ptr;
> + raw_spinlock_t *lock_ptr;

How is this related to the changelog ?

Thanks,

tglx

2010-07-12 10:46:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/4] futex: free_pi_state outside of hb->lock sections

On Mon, 2010-07-12 at 12:35 +0200, Thomas Gleixner wrote:
> On Fri, 9 Jul 2010, Darren Hart wrote:
>
> > free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
> > get the calls to free_pi_state() out of the hb->lock() sections.
> >
> > Signed-off-by: Darren Hart <[email protected]>

> > struct task_struct *task;
> > - spinlock_t *lock_ptr;
> > + raw_spinlock_t *lock_ptr;
>
> How is this related to the changelog ?

Read the reply to himself (V2) ;-)

-- Steve

2010-07-12 11:45:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On Sun, 2010-07-11 at 15:33 +0200, Mike Galbraith wrote:
> On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:

> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..ef489f3 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
> /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> futex_wait_queue_me(hb, &q, to);
>
> - spin_lock(&hb->lock);
> + /*
> + * Non-blocking synchronization point with futex_requeue().
> + *
> + * We dare not block here because this will alter PI state, possibly
> + * before our waker finishes modifying same in wakeup_next_waiter().
> + */
> + while(!spin_trylock(&hb->lock))
> + cpu_relax();

I agree that this would work. But I wonder if this should have an:

#ifdef PREEMPT_RT
[...]
#else
spin_lock(&hb->lock);
#endif

around it. Or encapsulate this lock in a macro that does the same thing
(just to keep the actual code cleaner)

-- Steve

> ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> spin_unlock(&hb->lock);
> if (ret)
>
>

2010-07-12 12:12:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On Mon, 2010-07-12 at 07:45 -0400, Steven Rostedt wrote:
> On Sun, 2010-07-11 at 15:33 +0200, Mike Galbraith wrote:
> > On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:
>
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index a6cec32..ef489f3 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
> > /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> > futex_wait_queue_me(hb, &q, to);
> >
> > - spin_lock(&hb->lock);
> > + /*
> > + * Non-blocking synchronization point with futex_requeue().
> > + *
> > + * We dare not block here because this will alter PI state, possibly
> > + * before our waker finishes modifying same in wakeup_next_waiter().
> > + */
> > + while(!spin_trylock(&hb->lock))
> > + cpu_relax();
>
> I agree that this would work. But I wonder if this should have an:
>
> #ifdef PREEMPT_RT
> [...]
> #else
> spin_lock(&hb->lock);
> #endif
>
> around it. Or encapsulate this lock in a macro that does the same thing
> (just to keep the actual code cleaner)

Yeah, it should. I'll wait to see what Darren/others say about holding
the wakee's pi_lock across wakeup to plug it. If he submits something
along that line, I can bin this.

-Mike

2010-07-12 13:06:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On Fri, 9 Jul 2010, Darren Hart wrote:

> The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.
>
> This patch increases latency, while running the ltp pthread_cond_many test
> which Michal reported the bug with, I see double digit hrtimer latencies
> (typically only on the first run after boo):
>
> kernel: hrtimer: interrupt took 75911 ns

Eewwww. There must be some more intelligent and less intrusive way to
detect this.

Thanks,

tglx

2010-07-12 19:11:11

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
>> The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates
>> a scenario where a task can wake-up, not knowing it has been enqueued on an
>> rtmutex. In order to detect this, the task would have to be able to take either
>> task->pi_blocked_on->lock->wait_lock and/or the hb->lock. Unfortunately,
>> without already holding one of these, the pi_blocked_on variable can change
>> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
>> to take a sleeping lock after wakeup or it could end up trying to block on two
>> locks, the second overwriting a valid pi_blocked_on value. This obviously
>> breaks the pi mechanism.
>
> copy/paste offline query/reply at Darren's request..
>
> On Sat, 2010-07-10 at 10:26 -0700, Darren Hart wrote:
> On 07/09/2010 09:32 PM, Mike Galbraith wrote:
>>> On Fri, 2010-07-09 at 13:05 -0700, Darren Hart wrote:
>>>
>>>> The core of the problem is that the proxy_lock blocks a task on a lock
>>>> the task knows nothing about. So when it wakes up inside of
>>>> futex_wait_requeue_pi, it immediately tries to block on hb->lock to
>>>> check why it woke up. This has the potential to block the task on two
>>>> locks (thus overwriting the pi_blocked_on). Any attempt preventing this
>>>> involves a lock, and ultimiately the hb->lock. The only solution I see
>>>> is to make the hb->locks raw locks (thanks to Steven Rostedt for
>>>> original idea and batting this around with me in IRC).
>>>
>>> Hm, so wakee _was_ munging his own state after all.
>>>
>>> Out of curiosity, what's wrong with holding his pi_lock across the
>>> wakeup? He can _try_ to block, but can't until pi state is stable.
>>>
>>> I presume there's a big fat gotcha that's just not obvious to futex
>>> locking newbie :)

Nor to some of us that have been engrossed in futexes for the last
couple years! I discussed the pi_lock across the wakeup issue with
Thomas. While this fixes the problem for this particular failure case,
it doesn't protect against:

<tglx> assume the following:
<tglx> t1 is on the condvar
<tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
<tglx> t3 takes hb->lock for a futex in the same bucket
<tglx> t2 wakes due to signal/timeout
<tglx> t2 blocks on hb->lock

You are likely to have not hit the above scenario because you only had
one condvar, so the hash_buckets were not heavily shared and you weren't
likely to hit:

<tglx> t3 takes hb->lock for a futex in the same bucket


I'm going to roll up a patchset with your (Mike) spin_trylock patch and
run it through some tests. I'd still prefer a way to detect early wakeup
without having to grab the hb->lock(), but I haven't found it yet.

+ while(!spin_trylock(&hb->lock))
+ cpu_relax();
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
spin_unlock(&hb->lock);

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-07-12 20:40:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On Mon, 12 Jul 2010, Darren Hart wrote:
> On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > wakeup? He can _try_ to block, but can't until pi state is stable.
> > > >
> > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > locking newbie :)
>
> Nor to some of us that have been engrossed in futexes for the last couple
> years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> fixes the problem for this particular failure case, it doesn't protect
> against:
>
> <tglx> assume the following:
> <tglx> t1 is on the condvar
> <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> <tglx> t3 takes hb->lock for a futex in the same bucket
> <tglx> t2 wakes due to signal/timeout
> <tglx> t2 blocks on hb->lock
>
> You are likely to have not hit the above scenario because you only had one
> condvar, so the hash_buckets were not heavily shared and you weren't likely to
> hit:
>
> <tglx> t3 takes hb->lock for a futex in the same bucket
>
>
> I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> through some tests. I'd still prefer a way to detect early wakeup without
> having to grab the hb->lock(), but I haven't found it yet.
>
> + while(!spin_trylock(&hb->lock))
> + cpu_relax();
> ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> spin_unlock(&hb->lock);

And this is nasty as it will create unbound priority inversion :(

We discussed another solution on IRC in meantime:

in futex_wait_requeue_pi()

futex_wait_queue_me(hb, &q, to);

raw_spin_lock(current->pi_lock);
if (current->pi_blocked_on) {
/*
* We know that we can only be blocked on the outer futex
* so we can skip the early wakeup check
*/
raw_spin_unlock(current->pi_lock);
ret = 0;
} else {
current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
raw_spin_unlock(current->pi_lock);

spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup();
....
spin_lock(&hb->lock);
}

Now in the rtmutex magic we need in task_blocks_on_rt_mutex():

raw_spin_lock(task->pi_lock);

/*
* Add big fat comment why this is only relevant to futex
* requeue_pi
*/

if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
raw_spin_lock(task->pi_lock);

/*
* Returning 0 here is fine. the requeue code is just going to
* move the futex_q to the other bucket, but that'll be fixed
* up in handle_early_requeue_pi_wakeup()
*/

return 0;
}

Thanks,

tglx

2010-07-12 20:43:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t



On Mon, 12 Jul 2010, Thomas Gleixner wrote:

> On Mon, 12 Jul 2010, Darren Hart wrote:
> > On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > > wakeup? He can _try_ to block, but can't until pi state is stable.
> > > > >
> > > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > > locking newbie :)
> >
> > Nor to some of us that have been engrossed in futexes for the last couple
> > years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> > fixes the problem for this particular failure case, it doesn't protect
> > against:
> >
> > <tglx> assume the following:
> > <tglx> t1 is on the condvar
> > <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > <tglx> t2 wakes due to signal/timeout
> > <tglx> t2 blocks on hb->lock
> >
> > You are likely to have not hit the above scenario because you only had one
> > condvar, so the hash_buckets were not heavily shared and you weren't likely to
> > hit:
> >
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> >
> >
> > I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> > through some tests. I'd still prefer a way to detect early wakeup without
> > having to grab the hb->lock(), but I haven't found it yet.
> >
> > + while(!spin_trylock(&hb->lock))
> > + cpu_relax();
> > ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> > spin_unlock(&hb->lock);
>
> And this is nasty as it will create unbound priority inversion :(
>
> We discussed another solution on IRC in meantime:
>
> in futex_wait_requeue_pi()
>
> futex_wait_queue_me(hb, &q, to);
>
> raw_spin_lock(current->pi_lock);
> if (current->pi_blocked_on) {
> /*
> * We know that we can only be blocked on the outer futex
> * so we can skip the early wakeup check
> */
> raw_spin_unlock(current->pi_lock);
> ret = 0;
> } else {
> current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
> raw_spin_unlock(current->pi_lock);
>
> spin_lock(&hb->lock);
> ret = handle_early_requeue_pi_wakeup();
> ....
> spin_lock(&hb->lock);
> }
>
> Now in the rtmutex magic we need in task_blocks_on_rt_mutex():
>
> raw_spin_lock(task->pi_lock);
>
> /*
> * Add big fat comment why this is only relevant to futex
> * requeue_pi
> */
>
> if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
> raw_spin_lock(task->pi_lock);
>
> /*
> * Returning 0 here is fine. the requeue code is just going to
> * move the futex_q to the other bucket, but that'll be fixed
> * up in handle_early_requeue_pi_wakeup()
> */
>
> return 0;

We might also return a sensible error code here and just remove the
waiter from all queues, which needs to be handled in
handle_early_requeue_pi_wakeup() after acquiring hb->lock then.

Thanks,

tglx

2010-07-13 03:09:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On Mon, 2010-07-12 at 22:40 +0200, Thomas Gleixner wrote:
> On Mon, 12 Jul 2010, Darren Hart wrote:
> > On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > > wakeup? He can _try_ to block, but can't until pi state is stable.
> > > > >
> > > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > > locking newbie :)
> >
> > Nor to some of us that have been engrossed in futexes for the last couple
> > years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> > fixes the problem for this particular failure case, it doesn't protect
> > against:
> >
> > <tglx> assume the following:
> > <tglx> t1 is on the condvar
> > <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > <tglx> t2 wakes due to signal/timeout
> > <tglx> t2 blocks on hb->lock
> >
> > You are likely to have not hit the above scenario because you only had one
> > condvar, so the hash_buckets were not heavily shared and you weren't likely to
> > hit:
> >
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> >
> >
> > I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> > through some tests. I'd still prefer a way to detect early wakeup without
> > having to grab the hb->lock(), but I haven't found it yet.
> >
> > + while(!spin_trylock(&hb->lock))
> > + cpu_relax();
> > ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> > spin_unlock(&hb->lock);
>
> And this is nasty as it will create unbound priority inversion :(

Oh ma gawd, _it's a train_ :>

> We discussed another solution on IRC in meantime:
>
> in futex_wait_requeue_pi()
>
> futex_wait_queue_me(hb, &q, to);
>
> raw_spin_lock(current->pi_lock);
> if (current->pi_blocked_on) {
> /*
> * We know that we can only be blocked on the outer futex
> * so we can skip the early wakeup check
> */
> raw_spin_unlock(current->pi_lock);
> ret = 0;
> } else {
> current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
> raw_spin_unlock(current->pi_lock);
>
> spin_lock(&hb->lock);
> ret = handle_early_requeue_pi_wakeup();
> ....
> spin_lock(&hb->lock);
> }
>
> Now in the rtmutex magic we need in task_blocks_on_rt_mutex():
>
> raw_spin_lock(task->pi_lock);
>
> /*
> * Add big fat comment why this is only relevant to futex
> * requeue_pi
> */
>
> if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
> raw_spin_lock(task->pi_lock);
>
> /*
> * Returning 0 here is fine. the requeue code is just going to
> * move the futex_q to the other bucket, but that'll be fixed
> * up in handle_early_requeue_pi_wakeup()
> */
>
> return 0;
> }
>
> Thanks,
>
> tglx
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-07-13 07:12:53

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t

On 07/12/2010 08:09 PM, Mike Galbraith wrote:
> On Mon, 2010-07-12 at 22:40 +0200, Thomas Gleixner wrote:
>> On Mon, 12 Jul 2010, Darren Hart wrote:
>>> On 07/10/2010 12:41 PM, Mike Galbraith wrote:
>>>> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
>>>>>> Out of curiosity, what's wrong with holding his pi_lock across the
>>>>>> wakeup? He can _try_ to block, but can't until pi state is stable.
>>>>>>
>>>>>> I presume there's a big fat gotcha that's just not obvious to futex
>>>>>> locking newbie :)
>>>
>>> Nor to some of us that have been engrossed in futexes for the last couple
>>> years! I discussed the pi_lock across the wakeup issue with Thomas. While this
>>> fixes the problem for this particular failure case, it doesn't protect
>>> against:
>>>
>>> <tglx> assume the following:
>>> <tglx> t1 is on the condvar
>>> <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
>>> <tglx> t3 takes hb->lock for a futex in the same bucket
>>> <tglx> t2 wakes due to signal/timeout
>>> <tglx> t2 blocks on hb->lock
>>>
>>> You are likely to have not hit the above scenario because you only had one
>>> condvar, so the hash_buckets were not heavily shared and you weren't likely to
>>> hit:
>>>
>>> <tglx> t3 takes hb->lock for a futex in the same bucket
>>>
>>>
>>> I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
>>> through some tests. I'd still prefer a way to detect early wakeup without
>>> having to grab the hb->lock(), but I haven't found it yet.
>>>
>>> + while(!spin_trylock(&hb->lock))
>>> + cpu_relax();
>>> ret = handle_early_requeue_pi_wakeup(hb,&q,&key2, to);
>>> spin_unlock(&hb->lock);
>>
>> And this is nasty as it will create unbound priority inversion :(
>
> Oh ma gawd, _it's a train_ :>

Seriously.

I have a fix. Cleaning it up as we speak, still hope to send out tonight.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team