Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755671AbZIVJK7 (ORCPT ); Tue, 22 Sep 2009 05:10:59 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755563AbZIVJK5 (ORCPT ); Tue, 22 Sep 2009 05:10:57 -0400 Received: from mx3.mail.elte.hu ([157.181.1.138]:58812 "EHLO mx3.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755521AbZIVJK4 (ORCPT ); Tue, 22 Sep 2009 05:10:56 -0400 Date: Tue, 22 Sep 2009 11:10:45 +0200 From: Ingo Molnar To: Eric Dumazet Cc: Darren Hart , linux-kernel@vger.kernel.org, Thomas Gleixner , Peter Zijlstra , Steven Rostedt , Dinakar Guniguntala , John Stultz Subject: Re: [PATCH 5/5] futex: fix wakeup race by setting TASK_INTERRUPTIBLE before queue_me Message-ID: <20090922091045.GB7755@elte.hu> References: <20090922052452.8717.39673.stgit@Aeon> <20090922053038.8717.97838.stgit@Aeon> <4AB88D09.3080907@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4AB88D09.3080907@gmail.com> User-Agent: Mutt/1.5.18 (2008-05-17) X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.5 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3378 Lines: 88 * Eric Dumazet wrote: > Darren Hart a ??crit : > > PI futexes do not use the same plist_node_empty() test for wakeup. It was > > possible for the waiter (in futex_wait_requeue_pi()) to set TASK_INTERRUPTIBLE > > after the waker assigned the rtmutex to the waiter. The waiter would then note > > the plist was not empty and call schedule(). The task would not be found by any > > subsequeuent futex wakeups, resulting in a userspace hang. By moving the > > setting of TASK_INTERRUPTIBLE to before the call to queue_me(), the race with > > the waker is eliminated. Since we no longer call get_user() from within > > queue_me(), there is no need to delay the setting of TASK_INTERRUPTIBLE until > > after the call to queue_me(). > > > > The FUTEX_LOCK_PI operation is not affected as futex_lock_pi() relies entirely > > on the rtmutex code to handle schedule() and wakeup. The requeue PI code is > > affected because the waiter starts as a non-PI waiter and is woken on a PI > > futex. > > > > Remove the crusty old comment about holding spinlocks() across get_user() as we > > no longer do that. Correct the locking statement with a description of why the > > test is performed. > > I am very confused by this ChangeLog... > > > > > Signed-off-by: Darren Hart > > Cc: Thomas Gleixner > > Cc: Peter Zijlstra > > Cc: Steven Rostedt > > Cc: Ingo Molnar > > CC: Eric Dumazet > > CC: Dinakar Guniguntala > > CC: John Stultz > > --- > > > > kernel/futex.c | 15 +++------------ > > 1 files changed, 3 insertions(+), 12 deletions(-) > > > > diff --git a/kernel/futex.c b/kernel/futex.c > > index f92afbe..463af2e 100644 > > --- a/kernel/futex.c > > +++ b/kernel/futex.c > > @@ -1656,17 +1656,8 @@ out: > > static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q, > > struct hrtimer_sleeper *timeout) > > { > > - queue_me(q, hb); > > - > > - /* > > - * There might have been scheduling since the queue_me(), as we > > - * cannot hold a spinlock across the get_user() in case it > > - * faults, and we cannot just set TASK_INTERRUPTIBLE state when > > - * queueing ourselves into the futex hash. This code thus has to > > - * rely on the futex_wake() code removing us from hash when it > > - * wakes us up. > > - */ > > set_current_state(TASK_INTERRUPTIBLE); > > Hmm, you missed the smp_mb() properties here... > > Before : > queue_me() > set_mb(current->state, TASK_INTERRUPTIBLE); > if (timeout) {...} > if (likely(!plist_node_empty(&q->list))) { > ... > } > > After : > set_mb(current->state, TASK_INTERRUPTIBLE); > queue_me(); > if (timeout) {...} > // no barrier... why ar we still testing q->list > // since it has no synchro between queue_me() and test ? > if (likely(!plist_node_empty(&q->list))) { > ... > } queue_me() itself does a spin_unlock(), so at least for the bits protected by hb->lock it should be half-serializing. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/