Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753651AbZCFBmr (ORCPT ); Thu, 5 Mar 2009 20:42:47 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751933AbZCFBmi (ORCPT ); Thu, 5 Mar 2009 20:42:38 -0500 Received: from e39.co.us.ibm.com ([32.97.110.160]:47930 "EHLO e39.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751871AbZCFBmh (ORCPT ); Thu, 5 Mar 2009 20:42:37 -0500 Message-ID: <49B07F87.2020808@us.ibm.com> Date: Thu, 05 Mar 2009 17:42:31 -0800 From: Darren Hart User-Agent: Thunderbird 2.0.0.19 (X11/20090105) MIME-Version: 1.0 To: "lkml, " CC: Thomas Gleixner , Steven Rostedt , Sripathi Kodi , John Stultz , Peter Zijlstra Subject: Re: [TIP][RFC 6/7] futex: add requeue_pi calls References: <49AC73A9.4040804@us.ibm.com> <49AC77D1.6090106@us.ibm.com> <49AE3386.1070604@us.ibm.com> <49B00302.5000607@us.ibm.com> In-Reply-To: <49B00302.5000607@us.ibm.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 15560 Lines: 462 Darren Hart wrote: > Darren Hart wrote: >> Darren Hart wrote: >>> From: Darren Hart >>> >>> PI Futexes must have an owner at all times, so the standard requeue >>> commands >>> aren't sufficient. The new commands properly manage pi futex >>> ownership by >>> ensuring a futex with waiters has an owner at all times. Once >>> complete these >>> patches will allow glibc to properly handle pi mutexes with >>> pthread_condvars. >>> >>> The approach taken here is to create two new futex op codes: >>> >>> FUTEX_WAIT_REQUEUE_PI: >>> Threads will use this op code to wait on a futex (such as a non-pi >>> waitqueue) >>> and wake after they have been requeued to a pi futex. Prior to >>> returning to >>> userspace, they will take this pi futex (and the underlying rt_mutex). >>> >>> futex_wait_requeue_pi() is currently the result of a high speed >>> collision >>> between futex_wait and futex_lock_pi (with the first part of >>> futex_lock_pi >>> being done by futex_requeue_pi_init() on behalf of the waiter). >>> >>> FUTEX_REQUEUE_PI: >>> This call must be used to wake threads waiting with >>> FUTEX_WAIT_REQUEUE_PI, >>> regardless of how many threads the caller intends to wake or requeue. >>> pthread_cond_broadcast should call this with nr_wake=1 and >>> nr_requeue=-1 (all). >>> pthread_cond_signal should call this with nr_wake=1 and >>> nr_requeue=0. The >>> reason being we need both callers to get the benefit of the >>> futex_requeue_pi_init() routine which will prepare the top_waiter >>> (the thread >>> to be woken) to take possesion of the pi futex by setting >>> FUTEX_WAITERS and >>> preparing the futex_q.pi_state. futex_requeue() also enqueues the >>> top_waiter >>> on the rt_mutex via rt_mutex_start_proxy_lock(). If >>> pthread_cond_signal used >>> FUTEX_WAKE, we would have a similar race window where the caller can >>> return and >>> release the mutex before the waiters can fully wake, potentially >>> leaving the >>> rt_mutex with waiters but no owner. >>> >>> We hit a failed paging request running the testcase (7/7) in a loop >>> (only takes a few minutes at most to hit on my 8way x86_64 test >>> machine). It appears to be the result of splitting rt_mutex_slowlock() >>> across two execution contexts by means of rt_mutex_start_proxy_lock() >>> and rt_mutex_finish_proxy_lock(). The former calls >>> task_blocks_on_rt_mutex() on behalf of the waiting task prior to >>> requeuing and waking it by the requeueing thread. The latter is >>> executed upon wakeup by the waiting thread which somehow manages to call >>> the new __rt_mutex_slowlock() with waiter->task != NULL and still >>> succeed with try_to_take_lock(), this leads to corruption of the plists >>> and an eventual failed paging request. See 7/7 for the rather crude >>> testcase that causes this. Any tips on where this race might be >>> occuring are welcome. > > I've updated my tracing and can show that rt_mutex_start_proxy_lock() is > not setting RT_MUTEX_HAS_WAITERS like it should be: > > ------------[ cut here ]------------ > kernel BUG at kernel/rtmutex.c:646! > invalid opcode: 0000 [#1] PREEMPT SMP > last sysfs file: > /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0: > 0/end_device-0:0/target0:0:0/0:0:0:0/vendor > Dumping ftrace buffer: > --------------------------------- > <...>-3793 1d..3 558351872us : lookup_pi_state: allocating a new pi > state > <...>-3793 1d..3 558351876us : lookup_pi_state: initial rt_mutex > owner: ffff88023d9486c0 > <...>-3793 1...2 558351877us : futex_requeue: futex_lock_pi_atomic > returned: 0 > <...>-3793 1...2 558351877us : futex_requeue: futex_requeue_pi_init > returned: 0 > <...>-3793 1...3 558351879us : rt_mutex_start_proxy_lock: > task_blocks_on_rt_mutex returned 0 > <...>-3793 1...3 558351880us : rt_mutex_start_proxy_lock: lock has > waiterflag: 0 > <...>-3793 1...1 558351888us : rt_mutex_unlock: unlocking > ffff88023b5f6950 > <...>-3793 1...1 558351888us : rt_mutex_unlock: lock waiter flag: 0 > <...>-3793 1...1 558351889us : rt_mutex_unlock: unlocked > ffff88023b5f6950 > <...>-3783 0...1 558351893us : __rt_mutex_slowlock: waiter->task is > ffff88023c872440 > <...>-3783 0...1 558351897us : try_to_take_rt_mutex: assigned > rt_mutex (ffff88023b5f6950) owner to current ffff88023c872440 > <...>-3783 0...1 558351897us : __rt_mutex_slowlock: got the lock > --------------------------------- > > I'll start digging into why that's happening, but I wanted to share the > trace output. > As it turns out I missed setting RT_MUTEX_HAS_WAITERS on the rt_mutex in rt_mutex_start_proxy_lock() - seems awfully silly in retrospect - but a little non-obvious while writing it. I added mark_rt_mutex_waiters() after the call to task_blocks_on_rt_mutex() and the test has completed more than 400 iterations successfully (it would fail after no more than 2 most of the time before). Steven, there are several ways to set RT_MUTEX_HAS_WAITERS - but this seemed like a reasonable approach, would you agree? Since I'm holding the wait_lock I don't technically need the atomic cmpxchg and could probably just set it explicity - do you have a preference? RFC: rt_mutex: add proxy lock routines From: Darren Hart This patch is required for the first half of requeue_pi to function. It basically splits rt_mutex_slowlock() right down the middle, just before the first call to schedule(). This patch uses a new futex_q field, rt_waiter, for now. I think I should be able to use task->pi_blocked_on in a future versino of this patch. V6: -add mark_rt_mutex_waiters() to rt_mutex_start_procy_lock() to avoid the race condition evident in previous versions V5: -remove EXPORT_SYMBOL_GPL from the new routines -minor cleanups V4: -made detect_deadlock a parameter to rt_mutex_enqueue_task -refactored rt_mutex_slowlock to share code with new functions -renamed rt_mutex_enqueue_task and rt_mutex_handle_wakeup to rt_mutex_start_proxy_lock and rt_mutex_finish_proxy_lock, respectively Signed-off-by: Darren Hart Cc: Thomas Gleixner Cc: Sripathi Kodi Cc: Peter Zijlstra Cc: John Stultz Cc: Steven Rostedt --- kernel/rtmutex.c | 193 +++++++++++++++++++++++++++++++++++++---------- kernel/rtmutex_common.h | 8 ++ 2 files changed, 161 insertions(+), 40 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 69d9cb9..f438362 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -411,6 +411,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock) */ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, + struct task_struct *task, int detect_deadlock) { struct task_struct *owner = rt_mutex_owner(lock); @@ -418,21 +419,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, unsigned long flags; int chain_walk = 0, res; - spin_lock_irqsave(¤t->pi_lock, flags); - __rt_mutex_adjust_prio(current); - waiter->task = current; + spin_lock_irqsave(&task->pi_lock, flags); + __rt_mutex_adjust_prio(task); + waiter->task = task; waiter->lock = lock; - plist_node_init(&waiter->list_entry, current->prio); - plist_node_init(&waiter->pi_list_entry, current->prio); + plist_node_init(&waiter->list_entry, task->prio); + plist_node_init(&waiter->pi_list_entry, task->prio); /* Get the top priority waiter on the lock */ if (rt_mutex_has_waiters(lock)) top_waiter = rt_mutex_top_waiter(lock); plist_add(&waiter->list_entry, &lock->wait_list); - current->pi_blocked_on = waiter; + task->pi_blocked_on = waiter; - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock_irqrestore(&task->pi_lock, flags); if (waiter == rt_mutex_top_waiter(lock)) { spin_lock_irqsave(&owner->pi_lock, flags); @@ -460,7 +461,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, spin_unlock(&lock->wait_lock); res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter, - current); + task); spin_lock(&lock->wait_lock); @@ -605,37 +606,25 @@ void rt_mutex_adjust_pi(struct task_struct *task) rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task); } -/* - * Slow path lock function: +/** + * __rt_mutex_slowlock - perform the wait-wake-try-to-take loop + * @lock the rt_mutex to take + * @state: the state the task should block in (TASK_INTERRUPTIBLE + * or TASK_UNINTERRUPTIBLE) + * @timeout: the pre-initialized and started timer, or NULL for none + * @waiter: the pre-initialized rt_mutex_waiter + * @detect_deadlock: passed to task_blocks_on_rt_mutex + * + * lock->wait_lock must be held by the caller. */ static int __sched -rt_mutex_slowlock(struct rt_mutex *lock, int state, - struct hrtimer_sleeper *timeout, - int detect_deadlock) +__rt_mutex_slowlock(struct rt_mutex *lock, int state, + struct hrtimer_sleeper *timeout, + struct rt_mutex_waiter *waiter, + int detect_deadlock) { - struct rt_mutex_waiter waiter; int ret = 0; - debug_rt_mutex_init_waiter(&waiter); - waiter.task = NULL; - - spin_lock(&lock->wait_lock); - - /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { - spin_unlock(&lock->wait_lock); - return 0; - } - - set_current_state(state); - - /* Setup the timer, when timeout != NULL */ - if (unlikely(timeout)) { - hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS); - if (!hrtimer_active(&timeout->timer)) - timeout->task = NULL; - } - for (;;) { /* Try to acquire the lock: */ if (try_to_take_rt_mutex(lock)) @@ -656,19 +645,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, } /* - * waiter.task is NULL the first time we come here and + * waiter->task is NULL the first time we come here and * when we have been woken up by the previous owner * but the lock got stolen by a higher prio task. */ - if (!waiter.task) { - ret = task_blocks_on_rt_mutex(lock, &waiter, + if (!waiter->task) { + ret = task_blocks_on_rt_mutex(lock, waiter, current, detect_deadlock); /* * If we got woken up by the owner then start loop * all over without going into schedule to try * to get the lock now: */ - if (unlikely(!waiter.task)) { + if (unlikely(!waiter->task)) { /* * Reset the return value. We might * have returned with -EDEADLK and the @@ -684,15 +673,52 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, spin_unlock(&lock->wait_lock); - debug_rt_mutex_print_deadlock(&waiter); + debug_rt_mutex_print_deadlock(waiter); - if (waiter.task) + if (waiter->task) schedule_rt_mutex(lock); spin_lock(&lock->wait_lock); set_current_state(state); } + return ret; +} + +/* + * Slow path lock function: + */ +static int __sched +rt_mutex_slowlock(struct rt_mutex *lock, int state, + struct hrtimer_sleeper *timeout, + int detect_deadlock) +{ + struct rt_mutex_waiter waiter; + int ret = 0; + + debug_rt_mutex_init_waiter(&waiter); + waiter.task = NULL; + + spin_lock(&lock->wait_lock); + + /* Try to acquire the lock again: */ + if (try_to_take_rt_mutex(lock)) { + spin_unlock(&lock->wait_lock); + return 0; + } + + set_current_state(state); + + /* Setup the timer, when timeout != NULL */ + if (unlikely(timeout)) { + hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS); + if (!hrtimer_active(&timeout->timer)) + timeout->task = NULL; + } + + ret = __rt_mutex_slowlock(lock, state, timeout, &waiter, + detect_deadlock); + set_current_state(TASK_RUNNING); if (unlikely(waiter.task)) @@ -986,6 +1012,42 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock, } /** + * rt_mutex_start_proxy_lock - prepare another task to take the lock + * + * @lock: the rt_mutex to take + * @waiter: the rt_mutex_waiter initialized by the waiter + * @task: the task to prepare + * @detext_deadlock: passed to task_blocks_on_rt_mutex + * + * The lock should have an owner, and it should not be task. + * Special API call for FUTEX_REQUEUE_PI support. + */ +int rt_mutex_start_proxy_lock(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter, + struct task_struct *task, int detect_deadlock) +{ + int ret; + + spin_lock(&lock->wait_lock); + ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock); + mark_rt_mutex_waiters(lock); + if (ret && !waiter->task) { + /* + * Reset the return value. We might have + * returned with -EDEADLK and the owner + * released the lock while we were walking the + * pi chain. Let the waiter sort it out. + */ + ret = 0; + } + spin_unlock(&lock->wait_lock); + + debug_rt_mutex_print_deadlock(waiter); + + return ret; +} + +/** * rt_mutex_next_owner - return the next owner of the lock * * @lock: the rt lock query @@ -1004,3 +1066,54 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock) return rt_mutex_top_waiter(lock)->task; } + +/** + * rt_mutex_finish_proxy_lock - Complete the taking of the lock initialized on + * our behalf by another thread. + * @lock: the rt_mutex we were woken on + * @to: the timeout, null if none. hrtimer should already have been started. + * @waiter: the pre-initialized rt_mutex_waiter + * @detect_deadlock: for use by __rt_mutex_slowlock + * + * Special API call for PI-futex requeue support + */ +int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, + struct hrtimer_sleeper *to, + struct rt_mutex_waiter *waiter, + int detect_deadlock) +{ + int ret; + + if (waiter->task) + schedule_rt_mutex(lock); + + spin_lock(&lock->wait_lock); + + set_current_state(TASK_INTERRUPTIBLE); + + ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter, + detect_deadlock); + + set_current_state(TASK_RUNNING); + + if (unlikely(waiter->task)) + remove_waiter(lock, waiter); + + /* + * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might + * have to fix that up. + */ + fixup_rt_mutex_waiters(lock); + + spin_unlock(&lock->wait_lock); + + /* + * Readjust priority, when we did not get the lock. We might have been + * the pending owner and boosted. Since we did not take the lock, the + * PI boost has to go. + */ + if (unlikely(ret)) + rt_mutex_adjust_prio(current); + + return ret; +} diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h index e124bf5..97a2f81 100644 --- a/kernel/rtmutex_common.h +++ b/kernel/rtmutex_common.h @@ -120,6 +120,14 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner); extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, struct task_struct *proxy_owner); +extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter, + struct task_struct *task, + int detect_deadlock); +extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, + struct hrtimer_sleeper *to, + struct rt_mutex_waiter *waiter, + int detect_deadlock); #ifdef CONFIG_DEBUG_RT_MUTEXES # include "rtmutex-debug.h" -- Darren Hart IBM Linux Technology Center Real-Time Linux Team -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/