2014-04-08 08:47:35

by Jan Stancek

[permalink] [raw]
Subject: [PATCH] futex: avoid race between requeue and wake

pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
occasionally fails, because some threads fail to wake up.

Testcase creates 5 threads, which are all waiting on same condition.
Main thread then calls pthread_cond_broadcast() without holding mutex,
which calls:
futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)
This immediately wakes up single thread A, which unlocks mutex and
tries to wake up another thread:
futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)
If thread A manages to call futex_wake() before any waiters are requeued
for uaddr2, no other thread is woken up.

This patch is re-introducing check removed by:
commit 11d4616bd07f38d496bd489ed8fad1dc4d928823
futex: revert back to the explicit waiter counting code

Taking hb->lock in this situation will ensure that thread A needs to wait
in futex_wake() until main thread finishes requeue operation.

Signed-off-by: Jan Stancek <[email protected]>
---
kernel/futex.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 67dacaf..5163899 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -284,7 +284,10 @@ static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
{
#ifdef CONFIG_SMP
- return atomic_read(&hb->waiters);
+ if (spin_is_locked(&hb->lock))
+ return 1;
+ else
+ return atomic_read(&hb->waiters);
#else
return 1;
#endif
--
1.7.1


2014-04-08 16:20:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

Davidlohr, comments?

On Tue, Apr 8, 2014 at 1:47 AM, Jan Stancek <[email protected]> wrote:
> pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
> occasionally fails, because some threads fail to wake up.

Jan, I _assume_ this is on x86(-64), but can you please confirm?

Because if it's on anything else, the whole situation changes.

> Taking hb->lock in this situation will ensure that thread A needs to wait
> in futex_wake() until main thread finishes requeue operation.

So the argument was that doing *both* spin_is_locked() and
atomic_read(&hb->waiters) _should_ be unnecessary, because
hb_waiters_inc() is done *before* getting the spinlock

However, one exception to this is "requeue_futex()". Which is in fact
the test-case that Jan points to. There, when we move a futex from one
hash bucket to another, we do the increment inside the spinlock.

So I think the change is correct, although the commit message might
need a bit of improvement. I also hate that "if/else" thing, since
there's no point in an "else" if the if-statement did a "return". So
either make it just

if (spin_is_locked(&hb->lock))
return 1;
return atomic_read(&hb->waiters);

or (perhaps preferably) make it

return spin_is_locked(&hb->lock) || atomic_read(&hb->waiters);

but that "if/else" just makes me go "why?".

But I'd also like to have Davidlohr look at this, because I have a few
questions:

- how did this never show up in the original loads? No requeueing
under those test-loads?

- would we be better off incrementing the waiter count at the top of
futex_requeue(), at the retry_private label?

That would make us follow the "has to be incremented before taking the
lock" rule, but at the expense of making the error case handling more
complex. Although maybe we could do it as part of
"double_lock/unlock_hb()" and just mark both hb1/hb2 busy?

Linus

2014-04-08 17:33:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Tue, Apr 8, 2014 at 9:20 AM, Linus Torvalds
<[email protected]> wrote:
>
> However, one exception to this is "requeue_futex()". Which is in fact
> the test-case that Jan points to. There, when we move a futex from one
> hash bucket to another, we do the increment inside the spinlock.
>
> So I think the change is correct, although the commit message might
> need a bit of improvement.

Hmm. And let's think about that powerpc race, where "spin_is_locked()"
can be false when somebody is waiting to get the lock..

We're good on the *normal* paths, exactly because we do that
hb_waiters_inc() before getting the lock (which also forces a memory
barrier), so the spin_is_locked() test is unnecessary and we have no
race.

But now Jan's patch is re-introducing spin_is_locked(), so let's think
about what that means for the requeue_futex() case. Especially since
the two accesses in

spin_is_locked(&hb->lock) || atomic_read(&hb->waiters)

are not ordered any way wrt each other (they are both just
unconstrained reads). How does this all work? This is making me
nervous.

Moving the hb_waiters_inc() to before taking the spinlock (inside
double_lock_hb() or whatever) would make me more comfortable wrt this
whole situation, if only because now the rules would be the same for
that requeue case as for adding the futex in the first place.

But maybe somebody can just show that Jan's simpler patch is
sufficient for that case for some fundamental reason that I
overlooked. In which case I would just want a comment.

It's a day for misquoting Star Wars: "Help me, Davidlohr Bueso. You
are my only hope"

Although comments from others would be good too. PeterZ? He wasn't
originally cc'd, but was involved with the original ordering
discussion. Adding..

Linus

2014-04-08 18:01:04

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake





----- Original Message -----
> From: "Linus Torvalds" <[email protected]>
> To: "Jan Stancek" <[email protected]>
> Cc: "Linux Kernel Mailing List" <[email protected]>, "Srikar Dronamraju" <[email protected]>,
> "Davidlohr Bueso" <[email protected]>, "Ingo Molnar" <[email protected]>, "Larry Woodman" <[email protected]>
> Sent: Tuesday, 8 April, 2014 6:20:42 PM
> Subject: Re: [PATCH] futex: avoid race between requeue and wake
>
> Davidlohr, comments?
>
> On Tue, Apr 8, 2014 at 1:47 AM, Jan Stancek <[email protected]> wrote:
> > pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
> > occasionally fails, because some threads fail to wake up.
>
> Jan, I _assume_ this is on x86(-64), but can you please confirm?

I was able to reproduce this on x86-64 and s390x. I used mostly
s390x system with 2 CPUs, because I could reproduce it there
a lot faster (usually in seconds).

My reproducer is modified pthread_cond_broadcast/4-1.c testcase,
with some sleeps removed/shortened:
http://jan.stancek.eu/tmp/futex_race/pcb.c
gcc pcb.c -o pcb -lpthread
I'm running it in loop until retcode != 0.

I also experimented with this patch when trying to make it
more visible (so I could capture some strace logs):

diff --git a/kernel/futex.c b/kernel/futex.c
index 5ff12f5..290a67f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -47,6 +47,7 @@
#include <linux/slab.h>
#include <linux/poll.h>
#include <linux/fs.h>
+#include <linux/delay.h>
#include <linux/file.h>
#include <linux/jhash.h>
#include <linux/init.h>
@@ -1554,6 +1555,7 @@ retry_private:
*/
if (++task_count <= nr_wake && !requeue_pi) {
wake_futex(this);
+ udelay(10000);
continue;
}

>
> Because if it's on anything else, the whole situation changes.
>
> > Taking hb->lock in this situation will ensure that thread A needs to wait
> > in futex_wake() until main thread finishes requeue operation.
>
> So the argument was that doing *both* spin_is_locked() and
> atomic_read(&hb->waiters) _should_ be unnecessary, because
> hb_waiters_inc() is done *before* getting the spinlock
>
> However, one exception to this is "requeue_futex()". Which is in fact
> the test-case that Jan points to. There, when we move a futex from one
> hash bucket to another, we do the increment inside the spinlock.
>
> So I think the change is correct, although the commit message might
> need a bit of improvement. I also hate that "if/else" thing, since
> there's no point in an "else" if the if-statement did a "return". So
> either make it just
>
> if (spin_is_locked(&hb->lock))
> return 1;
> return atomic_read(&hb->waiters);
>
> or (perhaps preferably) make it
>
> return spin_is_locked(&hb->lock) || atomic_read(&hb->waiters);
>
> but that "if/else" just makes me go "why?".

Understood, I can submit v2 if Davidlohr will also be in favor.
If not, I'm happy to test different patch proposal with my setup.

Regards,
Jan

>
> But I'd also like to have Davidlohr look at this, because I have a few
> questions:
>
> - how did this never show up in the original loads? No requeueing
> under those test-loads?
>
> - would we be better off incrementing the waiter count at the top of
> futex_requeue(), at the retry_private label?
>
> That would make us follow the "has to be incremented before taking the
> lock" rule, but at the expense of making the error case handling more
> complex. Although maybe we could do it as part of
> "double_lock/unlock_hb()" and just mark both hb1/hb2 busy?
>
> Linus
>

2014-04-08 18:13:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Tue, Apr 08, 2014 at 10:33:55AM -0700, Linus Torvalds wrote:
> Hmm. And let's think about that powerpc race, where "spin_is_locked()"
> can be false when somebody is waiting to get the lock..

Right; so in the original discussion I never really got to why that is a
problem. A pending waiter cannot modify the wait list, so either we see
the current holder, or we see a stable list.

So I had to hop on a plane while that discussion was happening and while
I did briefly task to Davidlohr in Napa his explanation didn't stick in
my jet-leg addled brain.

Afterwards the discussion got lost in the pile of unread mail.

I'll go re-read that discussion and look up this thread to see if
things will start to make sense.

2014-04-08 18:53:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Tue, Apr 8, 2014 at 11:13 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Apr 08, 2014 at 10:33:55AM -0700, Linus Torvalds wrote:
>> Hmm. And let's think about that powerpc race, where "spin_is_locked()"
>> can be false when somebody is waiting to get the lock..
>
> Right; so in the original discussion I never really got to why that is a
> problem. A pending waiter cannot modify the wait list, so either we see
> the current holder, or we see a stable list.

It's not just the list vs lock, it's also the ordering wrt the actual
futex value accesses. I have to say, the "waiters" approach makes
things a lot simpler, because then the main ordering issues are just
between "waiters and futex value".

The spinlock+waitqueue thing is rather more complex, because the
spinlock is gotten before reading the futex value (on the waiting
side), but *after* writing the futex value (on the waking side). While
the waitqueue is then modified *after* reading the futex value (on
both the waiting and waking side).

Now, on x86, my argument for using the "spinlock + wait queue"
combination was that getting the spinlock actually has the exact same
semantics as the atomic waiters update (because it really *is* an
atomic add), and while the spin-unlock destroys it, we can largely
ignore that because by that time we'd have been guaranteed to see the
wait-queue change too.

But even on x86, I was a *bit* worried about the ordering of the
"spin_is_locked()" read and the waitqueue value read. I thought about
it, and the smp_rmb() we put in place should have been safe, but I
have to admit that I was a bit nervous. It gets really hard to think
about these things.

So while I still think it should all work, I have to admit that I
didn't really mind horribly much going back to the explicit "waiters"
model that had a lot simpler ordering: just between the "waiters"
accesses and the "futex value" accesses, and without that combination
of "before and after" thing.

Which is also why I think I'd prefer to make the requeueing case have
the *same* rules, ie "waiters" needs to be incremented *before*
getting the spinlock.

A simple patch that adds a "hb_waiters_inc(&hb2);" to the top of
double_lock_hb(), and a "hb_waiters_decc(&hb2);" to double_unlock_hb()
would seem to be the simplest solution. But that adds some unnecessary
atomics (in particular, to the futex_wake_op() case too, which doesn't
need it at all, it's just futex_requeue() that needs it, afaik).

But that alternative patch might be worth testing, just to verify that
"yes, doing the hb_waiters_inc before getting the lock really does fix
Jan's case *without* the use spin_is_locked()". Jan? Can you try that?

Linus

2014-04-08 21:02:53

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake



----- Original Message -----
> From: "Linus Torvalds" <[email protected]>
> To: "Peter Zijlstra" <[email protected]>
> Cc: "Jan Stancek" <[email protected]>, "Linux Kernel Mailing List" <[email protected]>, "Srikar
> Dronamraju" <[email protected]>, "Davidlohr Bueso" <[email protected]>, "Ingo Molnar" <[email protected]>,
> "Larry Woodman" <[email protected]>
> Sent: Tuesday, 8 April, 2014 8:53:47 PM
> Subject: Re: [PATCH] futex: avoid race between requeue and wake
>
> On Tue, Apr 8, 2014 at 11:13 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Apr 08, 2014 at 10:33:55AM -0700, Linus Torvalds wrote:
> >> Hmm. And let's think about that powerpc race, where "spin_is_locked()"
> >> can be false when somebody is waiting to get the lock..
> >
> > Right; so in the original discussion I never really got to why that is a
> > problem. A pending waiter cannot modify the wait list, so either we see
> > the current holder, or we see a stable list.
>
> It's not just the list vs lock, it's also the ordering wrt the actual
> futex value accesses. I have to say, the "waiters" approach makes
> things a lot simpler, because then the main ordering issues are just
> between "waiters and futex value".
>
> The spinlock+waitqueue thing is rather more complex, because the
> spinlock is gotten before reading the futex value (on the waiting
> side), but *after* writing the futex value (on the waking side). While
> the waitqueue is then modified *after* reading the futex value (on
> both the waiting and waking side).
>
> Now, on x86, my argument for using the "spinlock + wait queue"
> combination was that getting the spinlock actually has the exact same
> semantics as the atomic waiters update (because it really *is* an
> atomic add), and while the spin-unlock destroys it, we can largely
> ignore that because by that time we'd have been guaranteed to see the
> wait-queue change too.
>
> But even on x86, I was a *bit* worried about the ordering of the
> "spin_is_locked()" read and the waitqueue value read. I thought about
> it, and the smp_rmb() we put in place should have been safe, but I
> have to admit that I was a bit nervous. It gets really hard to think
> about these things.
>
> So while I still think it should all work, I have to admit that I
> didn't really mind horribly much going back to the explicit "waiters"
> model that had a lot simpler ordering: just between the "waiters"
> accesses and the "futex value" accesses, and without that combination
> of "before and after" thing.
>
> Which is also why I think I'd prefer to make the requeueing case have
> the *same* rules, ie "waiters" needs to be incremented *before*
> getting the spinlock.
>
> A simple patch that adds a "hb_waiters_inc(&hb2);" to the top of
> double_lock_hb(), and a "hb_waiters_decc(&hb2);" to double_unlock_hb()
> would seem to be the simplest solution. But that adds some unnecessary
> atomics (in particular, to the futex_wake_op() case too, which doesn't
> need it at all, it's just futex_requeue() that needs it, afaik).
>
> But that alternative patch might be worth testing, just to verify that
> "yes, doing the hb_waiters_inc before getting the lock really does fix
> Jan's case *without* the use spin_is_locked()". Jan? Can you try that?

I ran reproducer with following change on s390x system, where this
can be reproduced usually within seconds:

diff --git a/kernel/futex.c b/kernel/futex.c
index 67dacaf..9150ffd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1095,6 +1095,7 @@ static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
static inline void
double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
+ hb_waiters_inc(hb2);
if (hb1 <= hb2) {
spin_lock(&hb1->lock);
if (hb1 < hb2)
@@ -1111,6 +1112,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
spin_unlock(&hb1->lock);
if (hb1 != hb2)
spin_unlock(&hb2->lock);
+ hb_waiters_dec(hb2);
}

/*

Reproducer is running without failures over an hour now and
made ~1.4 million iterations.

Regards,
Jan

>
> Linus
>

2014-04-08 22:30:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Tue, Apr 8, 2014 at 2:02 PM, Jan Stancek <[email protected]> wrote:
>
> I ran reproducer with following change on s390x system, where this
> can be reproduced usually within seconds:
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 67dacaf..9150ffd 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1095,6 +1095,7 @@ static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
> static inline void
> double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> {
> + hb_waiters_inc(hb2);
> if (hb1 <= hb2) {
> spin_lock(&hb1->lock);
> if (hb1 < hb2)
> @@ -1111,6 +1112,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> spin_unlock(&hb1->lock);
> if (hb1 != hb2)
> spin_unlock(&hb2->lock);
> + hb_waiters_dec(hb2);
> }
>
> /*
>
> Reproducer is running without failures over an hour now and
> made ~1.4 million iterations.

Ok, that's encouraging. That is the smallest patch I could come up
with, but as mentioned, it's not optimal. We only need it for
futex_requeue(), but if we do it there we'd have to handle all the
different error cases (there's only one call to double_lock_hb(), but
due to the error cases there's four calls to double_unlock_hb().

I'm not sure how much we care. The simple patch basically adds two
(unnecessary) atomics to the futex_wake_op() path. I don't know how
critical that path is - not as critical as the regular "futex_wake()",
I'd expect, but I guess pthread_cond_signal() is the main user.

So I'll have to leave this decision to the futex people. But the
attached slightly more complex patch *may* be the better one.

May I bother you to test this one too? I really think that
futex_requeue() is the only user that should need this, so doing it
there rather than in double_[un]lock_hb() should be slightly more
optimal, but who knows what I've missed. We clearly *all* missed this
race back when the ordering rules were documented..

Still hoping for comments from PeterZ and Davidlohr.

Linus


Attachments:
patch.diff (1.12 kB)

2014-04-09 02:20:06

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

Adding Thomas to the thread.

Sorry for the late reply, I was out running errands all day just to get
home to find this futex jewel in my inbox.

On Tue, 2014-04-08 at 15:30 -0700, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 2:02 PM, Jan Stancek <[email protected]> wrote:
> >
> > I ran reproducer with following change on s390x system, where this
> > can be reproduced usually within seconds:
> >
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index 67dacaf..9150ffd 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -1095,6 +1095,7 @@ static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
> > static inline void
> > double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > {
> > + hb_waiters_inc(hb2);
> > if (hb1 <= hb2) {
> > spin_lock(&hb1->lock);
> > if (hb1 < hb2)
> > @@ -1111,6 +1112,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > spin_unlock(&hb1->lock);
> > if (hb1 != hb2)
> > spin_unlock(&hb2->lock);
> > + hb_waiters_dec(hb2);
> > }
> >
> > /*
> >
> > Reproducer is running without failures over an hour now and
> > made ~1.4 million iterations.
>
> Ok, that's encouraging. That is the smallest patch I could come up
> with, but as mentioned, it's not optimal. We only need it for
> futex_requeue(), but if we do it there we'd have to handle all the
> different error cases (there's only one call to double_lock_hb(), but
> due to the error cases there's four calls to double_unlock_hb().

For consistency and mental sanity, I definitely prefer this alternative
to adding back the spin_is_locked check.

Linus, from what I see from your approach in always adding and
decrementing the hb->waiters count in futex_requeue right before the
whole double_[un]lock_hb() calls, we're basically saying "lets not do
this pending waiters optimization" for futex requeuing, right?
Which is fine, requeing isn't really that used or performance critical
in many cases. But I say it since the legitimate accounting for is done
in requeue_futex(), which can obviously be bogus as we increment *after*
taking the hb->lock. Just want to make sure we're on the same page here.

>
> I'm not sure how much we care. The simple patch basically adds two
> (unnecessary) atomics to the futex_wake_op() path. I don't know how
> critical that path is - not as critical as the regular "futex_wake()",
> I'd expect, but I guess pthread_cond_signal() is the main user.

Agreed, since the issue occurs because we're requeuing *waiters*, lets
keep it inside the requeueing only. In the case of futex_wake_op() it
doesn't matter as we don't need to account for them. It's more code, but
that's it. I'd rather add error house keeping than add more unnecessary
logic to other paths of futexes.

>
> So I'll have to leave this decision to the futex people. But the
> attached slightly more complex patch *may* be the better one.
>
> May I bother you to test this one too? I really think that
> futex_requeue() is the only user that should need this, so doing it
> there rather than in double_[un]lock_hb() should be slightly more
> optimal, but who knows what I've missed. We clearly *all* missed this
> race back when the ordering rules were documented..

Yep, it's quite an obvious thing we overlooked here, and not even arch
specific... I'm surprised that the requeueing path isn't stressed more
often, and while the race window is quite small (I'm still running Jan's
program in a loop and cannot trigger it on my x86-64 80 core box), it
should have been seen earlier by some program/benchmark.

Thanks,
Davidlohr

2014-04-09 05:41:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Tue, Apr 08, 2014 at 03:30:07PM -0700, Linus Torvalds wrote:
> So I'll have to leave this decision to the futex people. But the
> attached slightly more complex patch *may* be the better one.

Of course, tglx is the main futex 'people' and he's not on CC.. *sigh*.

2014-04-09 05:51:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Wed, 2014-04-09 at 07:41 +0200, Peter Zijlstra wrote:
> On Tue, Apr 08, 2014 at 03:30:07PM -0700, Linus Torvalds wrote:
> > So I'll have to leave this decision to the futex people. But the
> > attached slightly more complex patch *may* be the better one.
>
> Of course, tglx is the main futex 'people' and he's not on CC.. *sigh*.

(Darren is a futex people _and_ a futextest people)

2014-04-09 11:47:09

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake



----- Original Message -----
> From: "Linus Torvalds" <[email protected]>
> To: "Jan Stancek" <[email protected]>
> Cc: "Peter Zijlstra" <[email protected]>, "Linux Kernel Mailing List" <[email protected]>, "Srikar
> Dronamraju" <[email protected]>, "Davidlohr Bueso" <[email protected]>, "Ingo Molnar" <[email protected]>,
> "Larry Woodman" <[email protected]>
> Sent: Wednesday, 9 April, 2014 12:30:07 AM
> Subject: Re: [PATCH] futex: avoid race between requeue and wake
>
> On Tue, Apr 8, 2014 at 2:02 PM, Jan Stancek <[email protected]> wrote:
> >
> > I ran reproducer with following change on s390x system, where this
> > can be reproduced usually within seconds:
> >
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index 67dacaf..9150ffd 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -1095,6 +1095,7 @@ static int unlock_futex_pi(u32 __user *uaddr, u32
> > uval)
> > static inline void
> > double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket
> > *hb2)
> > {
> > + hb_waiters_inc(hb2);
> > if (hb1 <= hb2) {
> > spin_lock(&hb1->lock);
> > if (hb1 < hb2)
> > @@ -1111,6 +1112,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1,
> > struct futex_hash_bucket *hb2)
> > spin_unlock(&hb1->lock);
> > if (hb1 != hb2)
> > spin_unlock(&hb2->lock);
> > + hb_waiters_dec(hb2);
> > }
> >
> > /*
> >
> > Reproducer is running without failures over an hour now and
> > made ~1.4 million iterations.

I let this version run over night on single s390x system,
there were no failures.

>
> Ok, that's encouraging. That is the smallest patch I could come up
> with, but as mentioned, it's not optimal. We only need it for
> futex_requeue(), but if we do it there we'd have to handle all the
> different error cases (there's only one call to double_lock_hb(), but
> due to the error cases there's four calls to double_unlock_hb().
>
> I'm not sure how much we care. The simple patch basically adds two
> (unnecessary) atomics to the futex_wake_op() path. I don't know how
> critical that path is - not as critical as the regular "futex_wake()",
> I'd expect, but I guess pthread_cond_signal() is the main user.
>
> So I'll have to leave this decision to the futex people. But the
> attached slightly more complex patch *may* be the better one.
>
> May I bother you to test this one too? I really think that
> futex_requeue() is the only user that should need this, so doing it
> there rather than in double_[un]lock_hb() should be slightly more
> optimal, but who knows what I've missed. We clearly *all* missed this
> race back when the ordering rules were documented..

I'm running reproducer with this patch applied on 3 systems:
- two s390x systems where this can be reproduced within seconds
- x86_64 Intel(R) Xeon(R) CPU E5240 @ 3.00GHz, where I could
reproduce it on average in ~3 minutes.

It's running without failure over 4 hours now.

Regards,
Jan

>
> Still hoping for comments from PeterZ and Davidlohr.
>
> Linus
>

2014-04-09 15:12:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Wed, Apr 9, 2014 at 4:46 AM, Jan Stancek <[email protected]> wrote:
>
>
> I'm running reproducer with this patch applied on 3 systems:
> - two s390x systems where this can be reproduced within seconds
> - x86_64 Intel(R) Xeon(R) CPU E5240 @ 3.00GHz, where I could
> reproduce it on average in ~3 minutes.
>
> It's running without failure over 4 hours now.

Ok. I committed my second patch.

It might be possible to avoid the two extra atomics by simply not
incrementing the target hash queue waiters count (again) in
requeue_futex() the first time we hit that case, and then avoiding the
final decrement too. But that is actually fairly complicated because
we might be requeuing multiple entries (or fail to requeue any at
all). We do have all that "drop_count" logic, so it's certainly quite
possible, but it gets complex and we'd need to be crazy careful and
pass in the state to everybody involved. So it isn't something I'm
personally willing to do. But if somebody cares, there's a slight
optimization opportunity in this whole futex_requeue() situation wrt
the waiter count increment/decrement thing.

Linus

2014-04-09 19:03:27

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Wed, 2014-04-09 at 08:12 -0700, Linus Torvalds wrote:
> On Wed, Apr 9, 2014 at 4:46 AM, Jan Stancek <[email protected]> wrote:
> >
> >
> > I'm running reproducer with this patch applied on 3 systems:
> > - two s390x systems where this can be reproduced within seconds
> > - x86_64 Intel(R) Xeon(R) CPU E5240 @ 3.00GHz, where I could
> > reproduce it on average in ~3 minutes.
> >
> > It's running without failure over 4 hours now.
>
> Ok. I committed my second patch.

As things stand now, we *really* need to update our documentation before
time passes and our brain goes somewhere else just to get back to the
code and get (more) headaches. Either now, or sometime before 3.15 is
out. Anyway, here's my suggestion.

8<-------------------

From: Davidlohr Bueso <[email protected]>
Date: Wed, 9 Apr 2014 11:55:07 -0700
Subject: [PATCH] futex: update documentation for ordering guarantees

Commits 11d4616bd07f (futex: revert back to the explicit waiter counting code)
and 69cd9eba3886 (futex: avoid race between requeue and wake) changed some of
the finer details of how we think about futexes. One was a late fix and the
other a consequence of overlooking the whole requeuing logic. The first change
caused our documentation to be incorrect, and the second made us aware that
we need to explicitly add more details to it.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 32 +++++++++++++++++++++++---------
1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 6801b37..5f58927 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -70,7 +70,10 @@
#include "locking/rtmutex_common.h"

/*
- * Basic futex operation and ordering guarantees:
+ * READ this before attempting to hack on futexes!
+ *
+ * Basic futex operation and ordering guarantees
+ * =============================================
*
* The waiter reads the futex value in user space and calls
* futex_wait(). This function computes the hash bucket and acquires
@@ -119,7 +122,7 @@
* sys_futex(WAIT, futex, val);
* futex_wait(futex, val);
*
- * waiters++;
+ * waiters++; (a)
* mb(); (A) <-- paired with -.
* |
* lock(hash_bucket(futex)); |
@@ -135,14 +138,14 @@
* unlock(hash_bucket(futex));
* schedule(); if (waiters)
* lock(hash_bucket(futex));
- * wake_waiters(futex);
- * unlock(hash_bucket(futex));
+ * else wake_waiters(futex);
+ * waiters--; (b) unlock(hash_bucket(futex));
*
- * Where (A) orders the waiters increment and the futex value read -- this
- * is guaranteed by the head counter in the hb spinlock; and where (B)
- * orders the write to futex and the waiters read -- this is done by the
- * barriers in get_futex_key_refs(), through either ihold or atomic_inc,
- * depending on the futex type.
+ * Where (A) orders the waiters increment and the futex value read through
+ * atomic operations (see hb_waiters_inc) and where (B) orders the write
+ * to futex and the waiters read -- this is done by the barriers in
+ * get_futex_key_refs(), through either ihold or atomic_inc, depending on the
+ * futex type.
*
* This yields the following case (where X:=waiters, Y:=futex):
*
@@ -155,6 +158,17 @@
* Which guarantees that x==0 && y==0 is impossible; which translates back into
* the guarantee that we cannot both miss the futex variable change and the
* enqueue.
+ *
+ * Note that a new waiter is accounted for in (a) even when it is possible that
+ * the wait call can return error, in which case we backtrack from it in (b).
+ * Refer to the comment in queue_lock().
+ *
+ * Similarly, in order to account for waiters being requeued on another
+ * address we always increment the waiters for the destination bucket before
+ * acquiring the lock. It then decrements them again after releasing it -
+ * the code that actually moves the futex(es) between hash buckets (requeue_futex)
+ * will do the additional required waiter count housekeeping. This is done for
+ * double_lock_hb() and double_unlock_hb(), respectively.
*/

#ifndef CONFIG_HAVE_FUTEX_CMPXCHG
--
1.8.1.4