2004-11-14 09:01:16

by Jamie Lokier

[permalink] [raw]
Subject: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

Andrew Morton wrote:
> Jamie, if you're around: help!

Revert the patch which moves queue_me(); it's buggy. It is a bug to
move queue_me() after get_user().

It fully explains the blocking threads in Apache and Evolution.

Even if it worked, the patch wouldn't have saved any time, as it's a
rare condition if the caller is using futex properly.

The patch below provides an explanation; I'd appreciate it being
applied.

---

> - According to man of futex:
> "If the futex was not equal to the expected value, the operation
> returns -EWOULDBLOCK."
> but now, here is no description about the rare case:
> "returns 0 if the futex was not equal to the expected value, but
> the process was woken by a FUTEX_WAKE call."
> this behavior on rare case causes the hang which I found.

This case can occur, by design.

Bert, are you still updating the futex man pages? (Or is anyone else?)

If you are, then:

The patch below might provide some text for use in the manual, but
even if you can't easily explain why, the possibility of FUTEX_WAIT
returning 0 and counting as a wakeup when the memory word doesn't
equal val should be mentioned.

I'd appreciate being added to the authors list while you're there,
thanks :)

I think the man page would be a little clearer if the various E
conditions (ETIMEDOUT etc.) were listed in the errors section (even
though they aren't errors). Think about consitency with other man
pages which list EINTR and EAGAIN there. Also, it would be consistent
to say EAGAIN instead of EWOULDBLOCK (they're synonyms in Linux
anyway, but other man pages use EAGAIN as it's the modern name for it).

The phrase "(or other spurious error)" should be removed as it's
actually a kernel bug (but not serious) for that to occur, and
no different to EINTR from other syscalls in that respect.

In the section for FUTEX_WAIT behaviour, you might explain what
"atomically verifies .. and sleeps awaiting FUTEX_WAKE" really means,
perhaps removing the word atomic. It's not really
atomic-test-conditional-sleep, it's just carefully ordered. (Though
it's equivalent to atomic-sleep-test followed by conditional-wake).

The difference is precisely that it may return 0 and count as a wakeup
even when the memory word doesn't match prior to the
effectively-atomic region.

---

Hidetoshi Seto's example (at the FTP URL with the patch) calls
pthread_cond_signal without mentioning a mutex. That's the wrong way
to use pthread_cond_signal, as explained in the Glibc documentation.

Note that moving queue_me() after get_user() in futex_wake() does NOT
fix Hidetoshi's observed problem.

Just think about the same 4 threads in "[simulation]", but scheduled
in a slightly different sequence. Especially, look at splitting up
the sequence _beteen_ get_user() and queue_me(), and _between_ "wake++
and updated futex val" and "FUTEX_WAKE: no one is in waitqueue / A is
in waitqueue".

The basic logical reason why Hidetoshi's patch doesn't fix anything is
that if the get_user() test is done before queue_me() in the kernel,
that is *exactly the same* as if userspace does the word test itself
just before calling FUTEX_WAIT and FUTEX_WAIT doesn't do any test at all.

In Hidetoshi's pseudo-code, the bug is in pthread_cond_signal: it
should test the return value of FUTEX_WAKE and increment the wake
variable conditionally, not unconditionally as it does. Fix that, and
subsequent signals will wake B. The reason B is not woken initially
is because mutexes aren't used. These aren't futex bugs.

---

You're right about the double-down_read() problem - I hadn't realised
that could deadlock. That will require a per-task flag to make the
fault handler not take the semaphore when the fault occurs in these
places. But that's a separate bug, not addressed here.

---

That ->nqueued loop in FUTEX_CMP_REQUEUE is able to return -EAGAIN
even when the memory word does equal the argument - that's quite ugly.

That and the smp_mb() section look dubious. They're a workaround to
simulate doing something inside the spinlocks, but that is different
to the ordering properties that FUTEX_WAIT offers.

I mention this because it's nearly the same problem as prompted this
patch: that FUTEX_WAIT isn't as atomic as some people think it is, and
most importantly, making it more atomic (by using the spinlocks) does
not fix design problems in the caller.

That suggests to me that the callers of FUTEX_CMP_REQUEUE, if they
depend on that ->nqueued / smb_mb() loop, then they may have a race
which will cause problems. If they don't depend on it, then it
shouldn't be there.

In fact that whole primitive does not look very conceptually
convincing. Some kind of requeue-and-test primtive makes sense, but
conceptually, it would make sense to be testing *uaddr2 at the same
time, but it doesn't.

---

Signed-off-by: Jamie Lokier <[email protected]>

Explain why futex waiters must queue the current thread before testing
the memory word, not after. Consequently, futex_wait() can return 0
and count as a wakeup even if the memory word doesn't match the
value at the start of the syscall.

c.orig 2004-11-03 04:04:50.000000000 +0000
+++ linux-2.6.9/kernel/futex.c 2004-11-14 08:58:27.067607610 +0000
@@ -6,7 +6,7 @@
* (C) Copyright 2003 Red Hat Inc, All Rights Reserved
*
* Removed page pinning, fix privately mapped COW pages and other cleanups
- * (C) Copyright 2003 Jamie Lokier
+ * (C) Copyright 2003, 2004 Jamie Lokier
*
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
@@ -489,9 +489,24 @@
queue_me(&q, -1, NULL);

/*
- * Access the page after the futex is queued.
+ * Access the page AFTER the futex is queued.
+ * Order is important:
+ *
+ * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
+ * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
+ *
+ * The basic logical guarantee of a futex is that it blocks ONLY
+ * if cond(var) is known to be true at the time of blocking, for
+ * any cond. If we queued after testing *uaddr, that would open
+ * a race condition where we could block indefinitely with
+ * cond(var) false, which would violate the guarantee.
+ *
+ * A consequence is that futex_wait() can return zero and absorb
+ * a wakeup when *uaddr != val on entry to the syscall. This is
+ * rare, but normal.
+ *
* We hold the mmap semaphore, so the mapping cannot have changed
- * since we looked it up.
+ * since we looked it up in get_futex_key.
*/
if (get_user(curval, (int __user *)uaddr) != 0) {
ret = -EFAULT;


2004-11-14 09:09:56

by Andrew Morton

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

Emergency Services Jamie Lokier <[email protected]> wrote:
>
> Revert the patch which moves queue_me(); it's buggy. It is a bug to
> move queue_me() after get_user().

yup.

> It fully explains the blocking threads in Apache and Evolution.
>
> Even if it worked, the patch wouldn't have saved any time, as it's a
> rare condition if the caller is using futex properly.

The patch wasn't supposed to optimise anything. It fixed a bug which was
causing hangs. See
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch

Or are you saying that userspace is buggy??

2004-11-14 09:23:23

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

Andrew Morton wrote:
> The patch wasn't supposed to optimise anything. It fixed a bug which was
> causing hangs. See
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch
>
> Or are you saying that userspace is buggy??

I haven't looked at the NPTL code, but that URL's pseudo-code is buggy.
The call to FUTEX_WAKE should be doing wake++ conditionally on the
return value, not unconditionally.

Also, the patch doesn't actually fix the described problem.

It may hide it in tests, but the race or a similar one is present in a
different execution order.

The real NPTL code is more complicated than described at that URL,
because real pthread_cond_wait takes a mutex argument as well. The
bug report does not say how that is handled, and it is critically
important that the mutex and convar are updated concurrently in the
right way.

So I don't know if NPTL is buggy, but the pseudo-code given in the bug
report is (because of unconditional wake++), and so is the failure
example (because it doesn't use a mutex).

-- Jamie

2004-11-14 09:53:49

by bert hubert

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote:

> So I don't know if NPTL is buggy, but the pseudo-code given in the bug
> report is (because of unconditional wake++), and so is the failure
> example (because it doesn't use a mutex).

Please advise if 'Emergency Services''s update to the manpage is correct
(two levels up this message thread), if so, I can apply it and forward to
aeb.

Thanks.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2004-11-15 00:56:52

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Jamie Lokier wrote:
> Andrew Morton wrote:
>
>>The patch wasn't supposed to optimise anything. It fixed a bug which was
>>causing hangs. See
>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch
>>
>>Or are you saying that userspace is buggy??
>
>
> I haven't looked at the NPTL code, but that URL's pseudo-code is buggy.
> The call to FUTEX_WAKE should be doing wake++ conditionally on the
> return value, not unconditionally.
(snip)
>
> So I don't know if NPTL is buggy, but the pseudo-code given in the bug
> report is (because of unconditional wake++), and so is the failure
> example (because it doesn't use a mutex).
>
> -- Jamie

from glibc-2.3.3(RHEL4b2):

31 int
32 __pthread_cond_signal (cond)
33 pthread_cond_t *cond;
34 {
35 /* Make sure we are alone. */
36 lll_mutex_lock (cond->__data.__lock);
37
38 /* Are there any waiters to be woken? */
39 if (cond->__data.__total_seq > cond->__data.__wakeup_seq)
40 {
41 /* Yes. Mark one of them as woken. */
42 ++cond->__data.__wakeup_seq;
43 ++cond->__data.__futex;
44
45 /* Wake one. */
46 lll_futex_wake (&cond->__data.__futex, 1);
47 }
48
49 /* We are done. */
50 lll_mutex_unlock (cond->__data.__lock);
51
52 return 0;
53 }

Ingo, is this buggy?

We should start again with a question:
Is this a kernel's bug or NPTL's bug?


Thanks,
H.Seto

2004-11-15 02:02:17

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Hidetoshi Seto wrote:
> >So I don't know if NPTL is buggy, but the pseudo-code given in the bug
> >report is (because of unconditional wake++), and so is the failure
> >example (because it doesn't use a mutex).
>
> from glibc-2.3.3(RHEL4b2):
>
> 31 int
> 32 __pthread_cond_signal (cond)
> 33 pthread_cond_t *cond;
> 34 {
> 35 /* Make sure we are alone. */
> 36 lll_mutex_lock (cond->__data.__lock);
> 37
> 38 /* Are there any waiters to be woken? */
> 39 if (cond->__data.__total_seq > cond->__data.__wakeup_seq)
> 40 {
> 41 /* Yes. Mark one of them as woken. */
> 42 ++cond->__data.__wakeup_seq;
> 43 ++cond->__data.__futex;
> 44
> 45 /* Wake one. */
> 46 lll_futex_wake (&cond->__data.__futex, 1);
> 47 }
> 48
> 49 /* We are done. */
> 50 lll_mutex_unlock (cond->__data.__lock);
> 51
> 52 return 0;
> 53 }
>
> Ingo, is this buggy?
>
> We should start again with a question:
> Is this a kernel's bug or NPTL's bug?

Third possibility: your test is buggy. Do you actually use a mutex in
your test when you call pthread_cond_wait, and does the waker hold it
when it calls pthread_cond_signal?

If you don't use a mutex as you are supposed to with condvars, then it
might not be a kernel or NPTL bug. I'm not sure if POSIX-specified
behaviour is defined when you use condvars without a mutex.

If you do use a mutex (and you just didn't mention it), then the code
above is not enough to decide if there's an NPTL bug. We need to look
at pthread_cond_wait as well, to see how it does the "atomic" wait and
mutex release.

-- Jamie

2004-11-15 03:13:01

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Jamie Lokier wrote:
> Third possibility: your test is buggy. Do you actually use a mutex in
> your test when you call pthread_cond_wait, and does the waker hold it
> when it calls pthread_cond_signal?
>
> If you don't use a mutex as you are supposed to with condvars, then it
> might not be a kernel or NPTL bug. I'm not sure if POSIX-specified
> behaviour is defined when you use condvars without a mutex.
>
> If you do use a mutex (and you just didn't mention it), then the code
> above is not enough to decide if there's an NPTL bug. We need to look
> at pthread_cond_wait as well, to see how it does the "atomic" wait and
> mutex release.
>
> -- Jamie

Now I'm asking our test team about that.

Again, from glibc-2.3.3(RHEL4b2):

[nptl/sysdeps/pthread/pthread_cond_wait.c]
85 int
86 __pthread_cond_wait (cond, mutex)
87 pthread_cond_t *cond;
88 pthread_mutex_t *mutex;
89 {
90 struct _pthread_cleanup_buffer buffer;
91 struct _condvar_cleanup_buffer cbuffer;
92 int err;
93
94 /* Make sure we are along. */
95 lll_mutex_lock (cond->__data.__lock);
96
97 /* Now we can release the mutex. */
98 err = __pthread_mutex_unlock_usercnt (mutex, 0);
99 if (__builtin_expect (err, 0))
100 {
101 lll_mutex_unlock (cond->__data.__lock);
102 return err;
103 }
104
105 /* We have one new user of the condvar. */
106 ++cond->__data.__total_seq;
107 ++cond->__data.__futex;
108 cond->__data.__nwaiters += 1 << COND_CLOCK_BITS;
109
110 /* Remember the mutex we are using here. If there is already a
111 different address store this is a bad user bug. Do not store
112 anything for pshared condvars. */
113 if (cond->__data.__mutex != (void *) ~0l)
114 cond->__data.__mutex = mutex;
115
116 /* Prepare structure passed to cancellation handler. */
117 cbuffer.cond = cond;
118 cbuffer.mutex = mutex;
119
120 /* Before we block we enable cancellation. Therefore we have to
121 install a cancellation handler. */
122 __pthread_cleanup_push (&buffer, __condvar_cleanup, &cbuffer);
123
124 /* The current values of the wakeup counter. The "woken" counter
125 must exceed this value. */
126 unsigned long long int val;
127 unsigned long long int seq;
128 val = seq = cond->__data.__wakeup_seq;
129 /* Remember the broadcast counter. */
130 cbuffer.bc_seq = cond->__data.__broadcast_seq;
131
132 do
133 {
134 unsigned int futex_val = cond->__data.__futex;
135
136 /* Prepare to wait. Release the condvar futex. */
137 lll_mutex_unlock (cond->__data.__lock);
138
139 /* Enable asynchronous cancellation. Required by the standard. */
140 cbuffer.oldtype = __pthread_enable_asynccancel ();
141
142 /* Wait until woken by signal or broadcast. */
143 lll_futex_wait (&cond->__data.__futex, futex_val);
144
145 /* Disable asynchronous cancellation. */
146 __pthread_disable_asynccancel (cbuffer.oldtype);
147
148 /* We are going to look at shared data again, so get the lock. */
149 lll_mutex_lock (cond->__data.__lock);
150
151 /* If a broadcast happened, we are done. */
152 if (cbuffer.bc_seq != cond->__data.__broadcast_seq)
153 goto bc_out;
154
155 /* Check whether we are eligible for wakeup. */
156 val = cond->__data.__wakeup_seq;
157 }
158 while (val == seq || cond->__data.__woken_seq == val);
159
160 /* Another thread woken up. */
161 ++cond->__data.__woken_seq;
162
163 bc_out:
164
165 cond->__data.__nwaiters -= 1 << COND_CLOCK_BITS;
166
167 /* If pthread_cond_destroy was called on this varaible already,
168 notify the pthread_cond_destroy caller all waiters have left
169 and it can be successfully destroyed. */
170 if (cond->__data.__total_seq == -1ULL
171 && cond->__data.__nwaiters < (1 << COND_CLOCK_BITS))
172 lll_futex_wake (&cond->__data.__nwaiters, 1);
173
174 /* We are done with the condvar. */
175 lll_mutex_unlock (cond->__data.__lock);
176
177 /* The cancellation handling is back to normal, remove the handler. */
178 __pthread_cleanup_pop (&buffer, 0);
179
180 /* Get the mutex before returning. */
181 return __pthread_mutex_cond_lock (mutex);
182 }

I'm not sure but it seems that the pseudo-code could be:

(mutex must be locked before calling pthread_cond_wait.)
-A01 pthread_cond_wait {
+A01 pthread_cond_wait (futex,mutex) {
+A0* mutex_unlock(mutex);
A02 timeout = 0;
A03 lock(counters);
A04 total++;
A05 val = get_from(futex);
A06 unlock(counters);
A07
A08 sys_futex(futex, FUTEX_WAIT, val, timeout);
A09
A10 lock(counters);
A11 woken++;
A12 unlock(counters);
+A1* mutex_lock(mutex);
A13 }

(and it's better to replace var "futex" to "cond".)

Is it possible that NPTL shut the window between mutex_unlock()
and actual queueing in futex_wait?


Thanks,
H.Seto

2004-11-15 04:28:22

by Chuck Ebbert

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote:

>+ * The basic logical guarantee of a futex is that it blocks ONLY
>+ * if cond(var) is known to be true at the time of blocking, for
>+ * any cond. If we queued after testing *uaddr, that would open
>+ * a race condition where we could block indefinitely with
>+ * cond(var) false, which would violate the guarantee.
>+ *
>+ * A consequence is that futex_wait() can return zero and absorb
>+ * a wakeup when *uaddr != val on entry to the syscall. This is
>+ * rare, but normal.


Why can't it absorb a wakeup and still return -EAGAIN when this happens?

IOW why not apply this patch to the original code?

================================================================================
return -EINTR;

out_unqueue:
- /* If we were woken (and unqueued), we succeeded, whatever. */
- if (!unqueue_me(&q))
- ret = 0;
+ unqueue_me(&q); /* ignore result from unqueue */
out_release_sem:
up_read(&current->mm->mmap_sem);
return ret;
================================================================================

...and what is "Emergency Services", BTW?

--Chuck Ebbert 14-Nov-04 21:28:56

2004-11-15 08:08:59

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

Chuck Ebbert wrote:
> On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote:
>
> >+ * The basic logical guarantee of a futex is that it blocks ONLY
> >+ * if cond(var) is known to be true at the time of blocking, for
> >+ * any cond. If we queued after testing *uaddr, that would open
> >+ * a race condition where we could block indefinitely with
> >+ * cond(var) false, which would violate the guarantee.
> >+ *
> >+ * A consequence is that futex_wait() can return zero and absorb
> >+ * a wakeup when *uaddr != val on entry to the syscall. This is
> >+ * rare, but normal.
>
> Why can't it absorb a wakeup and still return -EAGAIN when this happens?
> IOW why not apply this patch to the original code?
>
> out_unqueue:
> - /* If we were woken (and unqueued), we succeeded, whatever. */
> - if (!unqueue_me(&q))
> - ret = 0;
> + unqueue_me(&q); /* ignore result from unqueue */
> out_release_sem:
> up_read(&current->mm->mmap_sem);
> return ret;

Because the number of wakeups reported to FUTEX_WAKE must _exactly_
match the number of wakeups reported to FUTEX_WAIT.

They are like tokens, and for some data structures the return value
mustn't be lost or ignored, because that would break structure
invariants - such as the matching counters in the pthread condvars
which precipitated this thread.

> ...and what is "Emergency Services", BTW?

My little joke, as I wouldn't have known about this if Andrew Morton
hadn't forwarded me the message asking about it (I've been away from l-k).

-- Jamie

2004-11-15 13:22:59

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Hidetoshi Seto wrote:
> I'm not sure but it seems that the pseudo-code could be:
>
> (mutex must be locked before calling pthread_cond_wait.)
> -A01 pthread_cond_wait {
> +A01 pthread_cond_wait (futex,mutex) {
> +A0* mutex_unlock(mutex);
> A02 timeout = 0;
> A03 lock(counters);

No, it is:

> -A01 pthread_cond_wait {
> +A01 pthread_cond_wait (futex,mutex) {
> A02 timeout = 0;
> A03 lock(counters);
> +A0* mutex_unlock(mutex);

An important difference!

However, I must humbly apologise. Having studied your failure case
more, I see that the problems you observe can occur even if you do
take the mutex properly.

Consider 4 threads, doing these in parallel (the same as your example
but explicitly mentioning the mutex):

wait_A: lock mutex; pthread_cond_wait(cond, mutex); unlock mutex
wake_X: lock mutex; pthread_cond_signal(cond); unlock mutex
wait_B: lock mutex; pthread_cond_wait(cond, mutex); unlock mutex
wake_Y: lock mutex; pthread_cond_signal(cond); unlock mutex

Then with the code you have posted, it is possible to see the
sequence of events which you described. The observed problems are:

1. A lost wakeup.

wait_A is woken, but wait_B is not, even though the second
pthread_cond_signal is "observably" after wait_B.

The operation order is observable in sense that wait_B could
update the data structure which is protected by cond+mutex, and
wake_Y could read that update prior to deciding to signal.

_Logically_, what happens is that wait_A is woken by wake_X, but
it does not immediately re-acquire the mutex. In this time
window, wait_B and wake_Y both run, and then wait_A acquires the
mutex. During this window, wait_A is able to absorb the second
signal.

It's not clear to me if POSIX requires wait_B to be signalled or
not in this case.

2. Future lost wakeups.

Future calls to pthread_cond_signal(cond) fail to wake wait_B,
even much later, because cond's NPTL data structure is
inconsistent. It's invariant is broken.

This is a bug in NPTL and it's easy to fix. Never increment wake
unconditionally. Instead, increment it conditionally when (a)
FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN.

Both these problem are possible, in exactly the way you described,
even if you do take the mutex properly.

Unfortunately, the kernel patch you tried does not fix these lost
wakeups (in addition to other problems it causes).

This is the sequence which fails (I've added numbers):

> 1. wait_A: calls pthread_cond_wait:
> total++, prepare to call FUTEX_WAIT with val=0.
> # status: [1/0/0] (0) queue={}(empty) #
>
> 2. wake_X: calls pthread_cond_signal:
> no one in waitqueue, just wake++ and update futex val.
> # status: [1/1/0] (1) queue={}(empty) #
>
> 3. wait_B: calls pthread_cond_wait:
> total++, prepare to call FUTEX_WAIT with val=1.
> # status: [2/1/0] (1) queue={}(empty) #
>
> 4. wait_A: calls FUTEX_WAIT with val=0:
> after queueing, compare val. 0!=1 ... this should be blocked...
> # status: [2/1/0] (1) queue={A} #
>
> 5. wait_B: calls FUTEX_WAIT with val=1:
> after queueing, compare val. 1==1 ... OK, let's schedule()...
> # status: [2/1/0] (1) queue={A,B} (B=sleeping) #
>
> 6. wake_Y: calls pthread_cond_signal:
> A is in waitqueue ... dequeue A, wake++ and update futex val.
> # status: [2/2/0] (2) queue={B} (B=sleeping) #
>
> 7. wait_A: end of FUTEX_WAIT with val=0:
> try to dequeue but already dequeued, return anyway.
> # status: [2/2/0] (2) queue={B} (B=sleeping) #
>
> 8. wait_A: end of pthread_cond_wait:
> woken++.
> # status: [2/2/1] (2) queue={B} (B=sleeping) #
>
> This is bug:
> wait_A: wakeup
> wait_B: sleeping
> wake_X: wake A
> wake_Y: wake A again
>
> if subsequent wake_Z try to wake B:
>
> wake_Z: calls pthread_cond_signal:
> since total==wake, do nothing.
> # status: [2/2/1] (2) queue={B} (B=sleeping) #
>
> If wait_C comes, B become to can be woken, but C...

With your kernel patch, the above sequence is prevented.

Unfortunately, a very similar sequence shows the same problems. I
think the reason you do not see them in testing is because the timing
is changed.

This is the sequence, very similar to your example, which fails in the
same way with your kernel patch:

1. wait_A: calls pthread_cond_wait:
total++, prepare to call FUTEX_WAIT with val=0.
+ calls FUTEX_WAIT with val=0.
+ inside futex_wait(), kernel compares val. 0==0, not yet queued.
# status: [1/0/0] (0) queue={}(empty) #

2. wake_X: calls pthread_cond_signal:
no one in waitqueue, just wake++ and update futex val.
# status: [1/1/0] (1) queue={}(empty) #

3. wait_B: calls pthread_cond_wait:
total++, prepare to call FUTEX_WAIT with val=1.
# status: [2/1/0] (1) queue={}(empty) #

- 4. wait_A: calls FUTEX_WAIT with val=0:
- after queueing, compare val. 0!=1 ... this should be blocked...
+ 4. wait_A: inside futex_wait(), already compared val. and will block:
+ calls queue_me()... then schedule()...
# status: [2/1/0] (1) queue={A} #

5. wait_B: calls FUTEX_WAIT with val=1:
after queueing, compare val. 1==1 ... OK, let's schedule()...
# status: [2/1/0] (1) queue={A,B} (B=sleeping) #

6. wake_Y: calls pthread_cond_signal:
A is in waitqueue ... dequeue A, wake++ and update futex val.
# status: [2/2/0] (2) queue={B} (B=sleeping) #

7. wait_A: end of FUTEX_WAIT with val=0:
- try to dequeue but already dequeued, return anyway.
+ woken, return.
# status: [2/2/0] (2) queue={B} (B=sleeping) #

8. wait_A: end of pthread_cond_wait:
woken++.
# status: [2/2/1] (2) queue={B} (B=sleeping) #


I hope that explains why this is not fixed by changing the order of
operations in the kernel.

The problem of a wakeup being lost during many concurrent operations
is not easy to fix. However, the most important property is that at
least one waiter is running and has the mutex at the end of all the
concurrent operations. That property is satisfied. So first it is
important to know if this specific lost wakeup is really a bug, or if
it is acceptable POSIX behaviour.

The problem of multiple future wakeups being lost is easy to fix in
NPTL, by updating the state conditionally on the return values from
FUTEX_WAKE and FUTEX_WAIT instead of ignoring the return values.

-- Jamie

2004-11-15 14:14:53

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

bert hubert wrote:
> On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote:
>
> > So I don't know if NPTL is buggy, but the pseudo-code given in the bug
> > report is (because of unconditional wake++), and so is the failure
> > example (because it doesn't use a mutex).
>
> Please advise if 'Emergency Services''s update to the manpage is correct
> (two levels up this message thread), if so, I can apply it and forward to
> aeb.

'Emergency Services' was me, if that's what you're asking. I believe
the updates to be correct and I have studied the futex code quite a
lot.

Two more things for the man page. You wrote:

To reiterate, bare futexes are not intended as an easy to use
abstraction for end-users. Implementors are expected to be
assembly literate and to have read the sources of the futex
userspace library referenced below.

I agree they are not intended as an easy to use abstraction. However,
users do not have to be assembly literate, in the sense that it is
possible to write code using futex which is architecture-indepedent.

For mutexes, architecture-dependent locked bus cycles are used, but
some code which uses futex is written in C using counters.
pthread_cond_signal/wait which started this thread is an example. So
I suggest a change to read:

To reiterate, bare futexes are not intended as an easy to use
abstraction for end-users. Implementors are expected to
understand processor memory ordering, barriers and
synchronisation, and to have read the sources of the futex
userspace library referenced below.

Secondly, is it appropriate to add Ulrich Drepper's "Futexes Are
Tricky" paper to SEE ALSO?

"Futexes Are Tricky", Ulrich Drepper, June 2004,
http://people.redhat.com/drepper/futex.pdf

It's a very interesting paper, worth reading. But note that Ulrich's
description of the FUTEX_WAIT operation in that paper is *wrong*:

This means that the operation to wait on a futex is composed of
getting the lock for the futex, checking the current value, if
necessary adding the thread to the wait queue, and releasing the lock.

In fact, waiting does not get the lock for the futex. It relies on
the ordering of (1) adding to the wait queue, (2) checking the current
value, and (3) removing from the wait queue if the value doesn't
match. Among other things, this is necessary because checking the
current value cannot be done with a spinlock held.

The effect is very similar, but not exactly the same.

-- Jamie

2004-11-16 08:28:41

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

OMG... Wait, wait... Don't do anything.

I have to deeply apologize to all for my mistake.
If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!!
I had swallow the story that 2.6 futex has the same problem...

So I realize that 2.6 futex never behave:
>> "returns 0 if the futex was not equal to the expected value, but
>> the process was woken by a FUTEX_WAKE call."

Update of manpage is now unnecessary, I think.

#

First of all, I would appreciate if you could read my old post:
"Kernel bug in futex_wait, cause application hang with NPTL"
http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html

#

Then, let's go on to the main subject.

Jamie Lokier wrote:
> In fact, waiting does not get the lock for the futex. It relies on
> the ordering of (1) adding to the wait queue, (2) checking the current
> value, and (3) removing from the wait queue if the value doesn't
> match. Among other things, this is necessary because checking the
> current value cannot be done with a spinlock held.

If my understanding is correct, 2.6 futex does not get any spinlocks,
but a semaphore:

[kernel/futex.c](from 2.6, RHEL4b2)
286 static int futex_wake(unsigned long uaddr, int nr_wake)
287 {
:
294 down_read(&current->mm->mmap_sem);
:
306 wake_futex(this);
:
314 up_read(&current->mm->mmap_sem);
315 return ret;
316 }
:
477 static int futex_wait(unsigned long uaddr, int val, unsigned long time)
478 {
:
483 down_read(&current->mm->mmap_sem);
:
489 queue_me(&q, -1, NULL);
:
500 if (curval != val) {
501 ret = -EWOULDBLOCK;
502 goto out_unqueue;
503 }
:
509 up_read(&current->mm->mmap_sem);
:
528 time = schedule_timeout(time);
:
536 /* If we were woken (and unqueued), we succeeded, whatever. */
537 if (!unqueue_me(&q))
538 return 0;
539 if (time == 0)
540 return -ETIMEDOUT;
541 /* A spurious wakeup should never happen. */
542 WARN_ON(!signal_pending(current));
543 return -EINTR;
544
545 out_unqueue:
546 /* If we were woken (and unqueued), we succeeded, whatever. */
547 if (!unqueue_me(&q))
548 ret = 0;
549 out_release_sem:
550 up_read(&current->mm->mmap_sem);
551 return ret;
552 }

This semaphore prevents a waiter which temporarily queued to check the val
from being target of wakeup.

So my "[simulation]" is wrong if it is on 2.6, since wake_Y never be able to
touch the queue while wait_A is in the queue to have the val to be checked.

(If it is not possible that there are threads which go around with same
futex/condvar but each have different mmap_sem,) 2.6 futex is quite good.

#

Next, let's see how about 2.4 futex:

[kernel/futex.c](from 2.4, RHEL3U2)
154 static inline int futex_wake(unsigned long uaddr, int offset, int num)
155 {
:
160 lock_futex_mm();
:
176 wake_up_all(&this->waiters);
:
185 unlock_futex_mm();
:
188 return ret;
189 }
:
310 static inline int futex_wait(unsigned long uaddr,
311 int offset,
312 int val,
313 unsigned long time)
314 {
:
323 lock_futex_mm();
:
330 __queue_me(&q, page, uaddr, offset, -1, NULL);
:
342 if (curval != val) {
343 unlock_futex_mm();
344 ret = -EWOULDBLOCK;
345 goto out;
346 }
:
357 unlock_futex_mm();
358 time = schedule_timeout(time);
:
365 if (time == 0) {
366 ret = -ETIMEDOUT;
367 goto out;
368 }
369 if (signal_pending(current))
370 ret = -EINTR;
371 out:
372 /* Were we woken up anyway? */
373 if (!unqueue_me(&q))
374 ret = 0;
375 put_page(q.page);
376
377 return ret;
:
383 }

2.4 futex uses spinlocks.

74 static inline void lock_futex_mm(void)
75 {
76 spin_lock(&current->mm->page_table_lock);
77 spin_lock(&vcache_lock);
78 spin_lock(&futex_lock);
79 }
80
81 static inline void unlock_futex_mm(void)
82 {
83 spin_unlock(&futex_lock);
84 spin_unlock(&vcache_lock);
85 spin_unlock(&current->mm->page_table_lock);
86 }

However, this spinlocks fail to prevent topical waiters from wakeups.
Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373).
So this failure allows wake_Y to touch the queue while wait_A is in it.

Of course as you know, this brings bug which I have mentioned.
(I don't know how many distributions have 2.4 futex in itself, but)
At least 2.4 futex in RHEL3U2 is buggy.

#

I regret that I could not notice this fact earlier.
I'm sorry... I hope you'll accept my apology.


Thanks,
H.Seto

2004-11-16 15:02:48

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Hidetoshi Seto wrote:
> I have to deeply apologize to all for my mistake.
> If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!!
> I had swallow the story that 2.6 futex has the same problem...

Wrong, 2.6 has the same behaviour!

> So I realize that 2.6 futex never behave:
> >> "returns 0 if the futex was not equal to the expected value, but
> >> the process was woken by a FUTEX_WAKE call."
>
> Update of manpage is now unnecessary, I think.

It is necessary.

> First of all, I would appreciate if you could read my old post:
> "Kernel bug in futex_wait, cause application hang with NPTL"
> http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html

> If my understanding is correct, 2.6 futex does not get any spinlocks,
> but a semaphore:
>
> 286 static int futex_wake(unsigned long uaddr, int nr_wake)
> :
> 294 down_read(&current->mm->mmap_sem);
>
> 477 static int futex_wait(unsigned long uaddr, int val, unsigned long time)
> :
> 483 down_read(&current->mm->mmap_sem);

> This semaphore prevents a waiter which temporarily queued to check the val
> from being target of wakeup.

No, because it's a read-write semaphore, and we do "down_read" on it
which is a shared lock. It does not prevent concurrent wake and wait
operations!

The only reason we use this semaphore is to block against vma-changing
operations (like mmap) while we look up the futex key and memory word.

> (If it is not possible that there are threads which go around with same
> futex/condvar but each have different mmap_sem,)

Actually it is possible, with process-shared condvars, but it's
irrelevant because down_read doesn't prevent concurrent wakes and
waits.

[About 2.4 futex in RHEL3U2 which takes spinlocks instead]:
> However, this spinlocks fail to prevent topical waiters from wakeups.
> Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373).
> So this failure allows wake_Y to touch the queue while wait_A is in it.

This order is necessary, because it's not safe to call get_user()
while holding any spinlocks. It is not a bug in RHEL.

> At least 2.4 futex in RHEL3U2 is buggy.

I don't think it is, because I think the behaviour you'll see with
RHEL3U2 is no different than 2.6, just slower ;)

-- Jamie

2004-11-17 08:47:43

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote:
> 1. A lost wakeup.
>
> wait_A is woken, but wait_B is not, even though the second
> pthread_cond_signal is "observably" after wait_B.
>
> The operation order is observable in sense that wait_B could
> update the data structure which is protected by cond+mutex, and
> wake_Y could read that update prior to deciding to signal.
>
> _Logically_, what happens is that wait_A is woken by wake_X, but
> it does not immediately re-acquire the mutex. In this time
> window, wait_B and wake_Y both run, and then wait_A acquires the
> mutex. During this window, wait_A is able to absorb the second
> signal.
>
> It's not clear to me if POSIX requires wait_B to be signalled or
> not in this case.
>
> 2. Future lost wakeups.
>
> Future calls to pthread_cond_signal(cond) fail to wake wait_B,
> even much later, because cond's NPTL data structure is
> inconsistent. It's invariant is broken.
>
> This is a bug in NPTL and it's easy to fix. Never increment wake
> unconditionally. Instead, increment it conditionally when (a)
> FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN.

If you think it is fixable in userland, please write at least the pseudo
code that you believe should work. We have spent quite a lot of time
on that code and don't believe this is solvable in userland.
E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise
the woken tasks wouldn't see the effect.

I believe the only place this is solvable in is the kernel, by ensuring
atomicity (i.e. queuing task iff curval == expected_val operation atomic
wrt. futex_wake/futex_requeue in other tasks). In the RHEL3 2.4.x backport
this is easy, as spinlock is held around the user access (the page is first
pinned, then lock taken, then value compared (but that is guaranteed to
be non-blocking) and if equal queued, then unlocked and unpinned.
In 2.6.x this is harder if the kernel cannot allow some spinlock to be
taken while doing user access, but I guess the kernel needs to cope
with this, e.g. by queueing the task early but mark it as maybe queued
only. If futex_wake sees such a bit, it would wait until futex_wait
notifies it it has decided whether that one should be queued or not.
Or something else, whatever, as long as the right semantics is ensured.

Just FYI, current pseudo code is (not mentioning cancellation stuff here,
code/data to deal with pthread_cond_destroy semantics, timedwait and
pshared condvars):

typedef struct { int lock, futex; uint64_t total_seq, wakeup_seq, woken_seq;
void *mutex; uint32_t broadcast_seq; } pthread_cond_t;
pthread_cond_signal (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
++wakeup_seq, ++futex;
futex (&futex, FUTEX_WAKE, 1);
}
mutex_unlock (lock);
}
pthread_cond_wait (cond, mtx)
{
mutex_lock (lock);
mutex_unlock (mtx->lock);
++total_seq;
++futex;
mutex = mtx;
bc_seq = broadcast_seq;
seq = wakeup_seq;
do {
val = futex;
mutex_unlock (lock);
futex (&futex, FUTEX_WAIT, val);
mutex_lock (lock);
if (bc_seq != broadcast_seq)
goto out;
} while (wakeup_seq == seq || woken_seq == wakeup_seq);
++woken_seq;
out:
mutex_unlock (lock);
mutex_lock (mtx->lock);
}
pthread_cond_broadcast (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
woken_seq = wakeup_seq = total_seq;
futex = 2 * total_seq;
++broadcast_seq;
val = futex;
mutex_unlock (lock);
if (futex (&futex, FUTEX_CMP_REQUEUE, 1, INT_MAX, &mutex->lock, val) < 0)
futex (&futex, FUTEX_WAKE, INT_MAX);
return;
}
mutex_unlock (lock);
}

Jakub

2004-11-18 01:31:49

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Jamie Lokier wrote:
> Hidetoshi Seto wrote:
>
>>I have to deeply apologize to all for my mistake.
>>If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!!
>>I had swallow the story that 2.6 futex has the same problem...
>
> Wrong, 2.6 has the same behaviour!
>
>>So I realize that 2.6 futex never behave:
>>
>>>> "returns 0 if the futex was not equal to the expected value, but
>>>> the process was woken by a FUTEX_WAKE call."
>>
>>Update of manpage is now unnecessary, I think.
>
> It is necessary.
>
>>First of all, I would appreciate if you could read my old post:
>>"Kernel bug in futex_wait, cause application hang with NPTL"
>>http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html
>
>>If my understanding is correct, 2.6 futex does not get any spinlocks,
>>but a semaphore:
>>
>> 286 static int futex_wake(unsigned long uaddr, int nr_wake)
>> :
>> 294 down_read(&current->mm->mmap_sem);
>>
>> 477 static int futex_wait(unsigned long uaddr, int val, unsigned long time)
>> :
>> 483 down_read(&current->mm->mmap_sem);
>
>>This semaphore prevents a waiter which temporarily queued to check the val
>>from being target of wakeup.
>
> No, because it's a read-write semaphore, and we do "down_read" on it
> which is a shared lock. It does not prevent concurrent wake and wait
> operations!

Aha, yes. You are right.

> [About 2.4 futex in RHEL3U2 which takes spinlocks instead]:
>
>>However, this spinlocks fail to prevent topical waiters from wakeups.
>>Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373).
>>So this failure allows wake_Y to touch the queue while wait_A is in it.
>
> This order is necessary, because it's not safe to call get_user()
> while holding any spinlocks. It is not a bug in RHEL.

I think 2.4 is fixable. My original patch for 2.4 was:

/*----- patch begin -----*/

diff -Naur linux-2.4.21-EL3_org/kernel/futex.c linux-2.4.21-EL3/kernel/futex.c
--- linux-2.4.21-EL3_org/kernel/futex.c 2004-08-25 19:47:35.418632860 +0900
+++ linux-2.4.21-EL3/kernel/futex.c 2004-08-25 19:48:32.505546224 +0900
@@ -297,14 +297,20 @@

spin_lock(&vcache_lock);
spin_lock(&futex_lock);
+ ret = __unqueue_me(q);
+ spin_unlock(&futex_lock);
+ spin_unlock(&vcache_lock);
+ return ret;
+}
+
+static inline int __unqueue_me(struct futex_q *q)
+{
if (!list_empty(&q->list)) {
list_del(&q->list);
__detach_vcache(&q->vcache);
- ret = 1;
+ return 1;
}
- spin_unlock(&futex_lock);
- spin_unlock(&vcache_lock);
- return ret;
+ return 0;
}

static inline int futex_wait(unsigned long uaddr,
@@ -333,13 +339,18 @@
* Page is pinned, but may no longer be in this address space.
* It cannot schedule, so we access it with the spinlock held.
*/
- if (!access_ok(VERIFY_READ, uaddr, 4))
- goto out_fault;
+ if (!access_ok(VERIFY_READ, uaddr, 4)) {
+ __unqueue_me(&q);
+ unlock_futex_mm();
+ ret = -EFAULT;
+ goto out;
+ }
kaddr = kmap_atomic(page, KM_USER0);
curval = *(int*)(kaddr + offset);
kunmap_atomic(kaddr, KM_USER0);

if (curval != val) {
+ __unqueue_me(&q);
unlock_futex_mm();
ret = -EWOULDBLOCK;
goto out;
@@ -364,22 +375,18 @@
*/
if (time == 0) {
ret = -ETIMEDOUT;
- goto out;
+ goto out_wait;
}
if (signal_pending(current))
ret = -EINTR;
-out:
+out_wait:
/* Were we woken up anyway? */
if (!unqueue_me(&q))
ret = 0;
+out:
put_page(q.page);

return ret;
-
-out_fault:
- unlock_futex_mm();
- ret = -EFAULT;
- goto out;
}

long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout,

/*----- patch end -----*/

This patch just reorder old codes in fault route:

if(fault){
unlock(futex);
ret = -ERRVAR;
unqueue();
put_page();
return ret;
}

to new one:

if(fault){
unqueue_in_lock();
unlock(futex);
ret = -ERRVAR;
put_page();
return ret;
}

It protects the temporarily queued thread from wakes, doesn't it?

If this work, it could be said that we can fix 2.6 futex with a
spinlock... but it will be slow, slow...


Thanks,
H.Seto

2004-11-18 02:13:06

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Jakub Jelinek wrote:
> On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote:
>
>> 1. A lost wakeup.
>>
>> wait_A is woken, but wait_B is not, even though the second
>> pthread_cond_signal is "observably" after wait_B.
>>
>> The operation order is observable in sense that wait_B could
>> update the data structure which is protected by cond+mutex, and
>> wake_Y could read that update prior to deciding to signal.
>>
>> _Logically_, what happens is that wait_A is woken by wake_X, but
>> it does not immediately re-acquire the mutex. In this time
>> window, wait_B and wake_Y both run, and then wait_A acquires the
>> mutex. During this window, wait_A is able to absorb the second
>> signal.
>>
>> It's not clear to me if POSIX requires wait_B to be signalled or
>> not in this case.
>>
>> 2. Future lost wakeups.
>>
>> Future calls to pthread_cond_signal(cond) fail to wake wait_B,
>> even much later, because cond's NPTL data structure is
>> inconsistent. It's invariant is broken.
>>
>> This is a bug in NPTL and it's easy to fix. Never increment wake
>> unconditionally. Instead, increment it conditionally when (a)
>> FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN.
>
>
> If you think it is fixable in userland, please write at least the pseudo
> code that you believe should work. We have spent quite a lot of time
> on that code and don't believe this is solvable in userland.
> E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise
> the woken tasks wouldn't see the effect.
>
> I believe the only place this is solvable in is the kernel, by ensuring
> atomicity (i.e. queuing task iff curval == expected_val operation atomic
> wrt. futex_wake/futex_requeue in other tasks).

I agree. I think this is kernel problem.

Even if it is possible to avoid this problem by tricks in userland, I think
it is ugly that it could happen that threads having randomness val could be
waken. i.g.:

>>>> >> "returns 0 if the futex was not equal to the expected value, but
>>>> >> the process was woken by a FUTEX_WAKE call."

Still now, update of manpage is unnecessary, I think.


Thanks,
H.Seto

2004-11-18 07:21:42

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Jakub Jelinek wrote:
> On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote:
> > 1. A lost wakeup.
> >
> > wait_A is woken, but wait_B is not, even though the second
> > pthread_cond_signal is "observably" after wait_B.
> >
> > The operation order is observable in sense that wait_B could
> > update the data structure which is protected by cond+mutex, and
> > wake_Y could read that update prior to deciding to signal.
> >
> > _Logically_, what happens is that wait_A is woken by wake_X, but
> > it does not immediately re-acquire the mutex. In this time
> > window, wait_B and wake_Y both run, and then wait_A acquires the
> > mutex. During this window, wait_A is able to absorb the second
> > signal.
> >
> > It's not clear to me if POSIX requires wait_B to be signalled or
> > not in this case.
> >
> > 2. Future lost wakeups.
> >
> > Future calls to pthread_cond_signal(cond) fail to wake wait_B,
> > even much later, because cond's NPTL data structure is
> > inconsistent. It's invariant is broken.
> >
> > This is a bug in NPTL and it's easy to fix. Never increment wake
> > unconditionally. Instead, increment it conditionally when (a)
> > FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN.
>
> If you think it is fixable in userland, please write at least the pseudo
> code that you believe should work. We have spent quite a lot of time
> on that code and don't believe this is solvable in userland.
> E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise
> the woken tasks wouldn't see the effect.

Do you have an answer for whether the behaviour of (a) is a bug or
not? I don't know if it's a bug, or if that part of NPTL behaviour is
acceptable under POSIX. Even if it's acceptable, you might decide
it's not acceptable quality to do that.

That answer affects my answer.

> I believe the only place this is solvable in is the kernel, by ensuring
> atomicity (i.e. queuing task iff curval == expected_val operation atomic
> wrt. futex_wake/futex_requeue in other tasks).

I think it's solvable in userspace. I have a solution, but I'm tired
and will send it tomorrow. This is just to let you know I'm looking
at the problem.

-- Jamie

2004-11-18 19:51:12

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

On Thu, Nov 18, 2004 at 07:20:58AM +0000, Jamie Lokier wrote:
> Do you have an answer for whether the behaviour of (a) is a bug or
> not? I don't know if it's a bug, or if that part of NPTL behaviour is
> acceptable under POSIX. Even if it's acceptable, you might decide
> it's not acceptable quality to do that.

Not sure what you mean by (a) there, so assuming you meant 1.
If pthread_cond_{signal,broadcast} is called with the condvar's associated
mutex held, then the standard is pretty clear when a thread is considered
blocked in pthread_cond_*wait on the condvar, as releasing the mutex and
getting blocked on the condvar in pthread_cond_*wait shall be observed as
atomic by other threads. If pthread_cond_{signal,broadcast} is called
without the mutex held, it is not that clear.
Anyway, pthread_cond_signal is supposed to wake at least one thread
blocked in pthread_cond_*wait (if there are any).

The scenario described in futex_wait-fix.patch IMHO can happen even
if all calls to pthread_cond_signal are done with mutex held around it, i.e.
A B X Y
pthread_mutex_lock (&mtx);
pthread_cond_wait (&cv, &mtx);
- mtx release *)
total++ [1/0/0] (0) {}
pthread_mutex_lock (&mtx);
pthread_cond_signal (&cv);
- wake++ [1/1/0] (1) {}
FUTEX_WAKE, 1 (returns, nothing is queued)
pthread_mutex_unlock (&mtx);
pthread_mutex_lock (&mtx);
pthread_cond_wait (&cv, &mtx);
- mtx release *)
total++ [2/1/0] (1) {}
FUTEX_WAIT, 0
queue_me [2/1/0] (1) {A}
0 != 1
FUTEX_WAIT, 1
queue_me [2/1/0] (1) {A,B}
1 == 1
pthread_mutex_lock (&mtx);
pthread_cond_signal (&cv);
- wake++ [2/2/0] (2) {A,B}
FUTEX_WAKE, 1 (unqueues incorrectly A)
[2/2/0] (2) {B}
pthread_mutex_unlock (&mtx);
try to dequeue but already dequeued
would normally return EWOULDBLOCK here
but as unqueue_me failed, returns 0
woken++ [2/2/1] (2) {B}
schedule_timeout (forever)
- mtx reacquire
pthread_cond_wait returns
pthread_mutex_unlock (&mtx);

-------------------
the code would like to say pthread_mutex_unlock (&mtx);
and pthread_exit here, but never reaches there.

Now, if at this point say A pthread_join's B, Y pthread_join's A and
X pthread_join's Y, the program should eventually finish, as B must have
been woken up according to the standard. Whether signal in X means
pthread_cond_wait in A returning first or pthread_cond_wait in B returning
first is I believe not defined unless special scheduling policy is used,
as both A and B are supposed to contend for mtx lock.
But I believe both A and B must be awaken, assuming no other thread attempts
to acquire mtx afterwards.

*) therefore other threads that acquire mtx can now consider A blocked on
the condvar

Jakub

2004-11-27 04:17:50

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

I've looked at the problem of lost-wakeups problem with NPTL condition
variables and 2.6 futex, with the help of Jakub's finely presented
pseudo-code. Unless I've made a mistake, it is fixable in userspace.

[ It might be more efficient to fix it in kernel space - on the other
hand, doing so might also make kernel futexes slower. In general, I
prefer if the kernel futex semantics can be as "loose" as possible
to minimise the locking they are absolutely required to do. Who
knows, we might come up with an algorithm that uses even less
cross-CPU traffic in the kernel, if the semantics permit it.
However, I appreciate that a more "atomic" kernel semantic is easier
to understand, and it is possible to implement that if it is really
worth doing. I would like to see benchmarks proving it doesn't slow
down normal futex stress tests though. It might not be slower at all. ]

Ok. Userspace solutions first.

Logically, waiters have four states: Awake, About to sleep, Sleeping
and Drowsy. These don't correspond to places in the code; they are
just logical states for the purpose of reasoning.

Waiters go to sleep through a sequence, from Awake to About to sleep,
then to Sleeping. This is prompted by the call to pthread_condvar_wait.

Waking up is prompted by passing around WAKE tokens.

The combined operation "futex++" followed by FUTEX_WAKE is always done
as an ordered sequence, which we'll call offering a WAKE token.

That operation offers a WAKE token to all waiters, and if there exists
any single waiter in a state that will consume the token, that waiter
consumes the token and transitions immediately to Awake.

The waker offering a WAKE token knows if a waiter accepts the token
that it offers. A waiter knows if it accepts a token. Tokens are
conserved exactly (like energy and momentum). This is important.

In the Sleeping state, waiters are woken by consuming a WAKE token, as
soon as one becomes available.

In the About to sleep state, two transitions are possible. If time
passes with no WAKE tokens, they become Sleeping. If a WAKE token is
offered, they do not consume it, but they transition to a state called
Drowsy instead.

In the Drowsy state, time can pass and it will transition to Awake.
However, it can also accept a WAKE token in that state. This is
optional: if a token is offered, it might not accept it. This is
different from Sleeping, where if a token is offered it will
definitely accepted it.

These are all the transitions of a waiter:

Awake -> About to sleep [Called pthread_condvar_wait]

About to sleep -> Sleeping [Time passes]
About to sleep -> Drowsy [Tickled by WAKE token but did not accept it]

Sleeping -> Awake [Accept one WAKE token - guaranteed to accept]

Drowsy -> Awake [Time passes]
Drowsy -> Awake [Accept one WAKE token - may refuse]



+--------------+ time passes +----------+
|About to sleep| ------------> | Sleeping |
+--------------+ +----------+
| |
tickled by | |
token but did | | WAKE token
not accept it | | (guaranteed to accept)
V time passes V
+----------+ --------------> +---------+
| Drowsy | | Awake |
+----------+ --------------> +---------+
WAKE token
(may refuse)


The states actually correspond to the following real events. The
condvar private mutex ensures that reading the futex value occurs
before it is incremented:

About to sleep == starting from mutex release by the waiter, until
whichever comes first from FUTEX_WAKE and queue_me

Sleeping == if FUTEX_WAKE comes after queue_me, this state
begins at queue_me

Drowsy == if FUTEX_WAKE comes before queue_me, the FUTEX_WAKE
event is called "tickled by token" and this is the
moment when Drowsy begins

Awake == if FUTEX_WAKE comes before queue_me, Awake begins
at unqueue_me or a subsequent FUTEX_WAKE, whichever
comes first (these are the two transitions from
Drowsy).

if FUTEX_WAKE comes after queue_me, Awake begins
at the moment of FUTEX_WAKE (this is the transition
from Sleeping)


On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote:
> 2. Future lost wakeups.
>
> Future calls to pthread_cond_signal(cond) fail to wake wait_B,
> even much later, because cond's NPTL data structure is
> inconsistent. It's invariant is broken.
>
> This is a bug in NPTL and it's easy to fix. Never increment wake
> unconditionally. Instead, increment it conditionally when (a)
> FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN.

This is easy to solve.

The key invariant which breaks is that (total_seq - wakeup_seq) is
equal to the number waiters which are effectively blocked. This
corresponds to the states "Sleeping" and "About to sleep".

pthread_condvar_signal checks (total_seq - wakeup_seq), and if it's >
0, increments wakeup_seq. To maintain the invariant it, at the same
time (i.e. inside the mutex), it offers a WAKE token (this is the
operational sequence futex++ followed by FUTEX_WAKE). This is
supposed to make one waiter in "About to sleep" or "Sleeping"
transition to another state.

When there is only one waiter, this works.

When there are two or more waiters, this fails because one of them can
be "Drowsy". That's not one of the states counted in (total_seq -
wakeup_seq), but it might accept the WAKE token, causing the attempt to
decrease the number in "About to sleep" and "Sleeping" to fail.

After the invariant is broken, no amount of calling
pthread_cond_signal will wake up all waiters.

Now, a waker cannot detect which state ("Sleeping" or "Drowsy")
accepted the token. A woken waiter cannot detect it either.

Therefore the solution to this invariant _must_ involve not
distinguishing those states.

The solution to maintaining the invariant is to include "Drowsy" in
the states counted by (total_seq - wakeup_seq). This means that
wakeup_seq must not be incremented by the waker if FUTEX_WAKE reports
the WAKE token is not accepted ("About to sleep" -> "Drowsy", it's
still in the counted set). wakeup_set must also be incremented by the
waiter if FUTEX_WAIT reports that it did _not_ receive a token
("Drowsy" -> "Awake"), as this means the counted set has changed but
this has not yet been reflected in wakeup_seq.

This still fails to wake up some waiters transiently (see later), but
it solves this particular problem of the long term invariant breaking
- this is the more serious problem.

Here's the implementation. You'll notice that we do something
significant: we look at the return value of futex operations. That's
why they return a value! :)

pthread_cond_signal (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
<<<<<
++wakeup_seq, ++futex;
futex (&futex, FUTEX_WAKE, 1);
=====
++futex;
wakeup_seq += futex (&futex, FUTEX_WAKE, 1);
>>>>>
}
mutex_unlock (lock);
}
pthread_cond_wait (cond, mtx)
{
mutex_lock (lock);
mutex_unlock (mtx->lock);
++total_seq;
++futex;
mutex = mtx;
bc_seq = broadcast_seq;
seq = wakeup_seq;
do {
val = futex;
mutex_unlock (lock);
<<<<<
futex (&futex, FUTEX_WAIT, val);
mutex_lock (lock);
=====
result = futex (&futex, FUTEX_WAIT, val);
mutex_lock (lock);
if (result < 0)
wakeup_seq++;
>>>>>
if (bc_seq != broadcast_seq)
goto out;
} while (wakeup_seq == seq || woken_seq == wakeup_seq);
++woken_seq;
out:
mutex_unlock (lock);
mutex_lock (mtx->lock);
}

(Thanks for the helpful pseudo-code, btw).

Jakub Jelinek wrote:
> E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise
> the woken tasks wouldn't see the effect.

futex must be incremented before FUTEX_WAKE, but wakeup_seq does not
have to be incremented before FUTEX_WAKE - the private mutex means
that it can be incremented after.

> 1. A lost wakeup.
>
> wait_A is woken, but wait_B is not, even though the second
> pthread_cond_signal is "observably" after wait_B.
>
> The operation order is observable in sense that wait_B could
> update the data structure which is protected by cond+mutex, and
> wake_Y could read that update prior to deciding to signal.
>
> _Logically_, what happens is that wait_A is woken by wake_X, but
> it does not immediately re-acquire the mutex. In this time
> window, wait_B and wake_Y both run, and then wait_A acquires the
> mutex. During this window, wait_A is able to absorb the second
> signal.
>
> It's not clear to me if POSIX requires wait_B to be signalled or
> not in this case.

Ok, I have seen written and it makes sense that two signals should
result in both waiters woken in this case. I think that's a
reasonable expectation.

Using those logical states, this lost wakeup occurs because wait_A is
woken by wake_X, entering the "Drowsy" state, and then it accepts a
WAKE token from wake_Y, to become "Awake". Accepting a WAKE token in
the "Drowsy" state prevents wait_B from accepted it. In extreme
cases, there can be a large number of threads in the "Drowsy" state,
absorbing a lot of wakeups together.

There are several ways to fix this (the 6th is my favourite):

1. In the kernel, make the FUTEX_WAIT test-and-queue operation
effectively atomic w.r.t. FUTEX_WAKE by more exclusive locks,
as you have requested.

Effect: Prevents the "Drowsy" state from accepting WAKE tokens.

2. Subtler: In the kernel, lock FUTEX_WAIT test-and-queue operations
w.r.t. _other_ FUTEX_WAIT operations on the same futex, but
not exclusive w.r.t. FUTEX_WAKE operations.

Effect: Does not prevent "Drowsy" from accepting WAKE tokens,
but does prevent any "Sleeping" states from existing at the same
time, so "Drowsy" never steals WAKE tokens.

To be more precise, just the region from get_user to unqueue_me
needs to be locked w.r.t. other FUTEX_WAITs, but explaining
this requires a more complicated state machine.

This one is too subtle to be allowed, imho. Can you imagine
the man page trying to explain it?

3. Related to above, but purely userspace. Lock a second private
mutex around each call to FUTEX_WAIT. At first sight this
looks like it would be a performance killer, but it's not
totally obvious whether it would be:

<<<<<
result = futex (&futex, FUTEX_WAIT, val);
=====
mutex_lock (lock2);
result = futex (&futex, FUTEX_WAIT, val);
mutex_unlock (lock2);
>>>>>

4. A combination of low-impact kernel and userspace changes.

In the kernel, change the return value of FUTEX_WAIT to report
when the futex word didn't match but it received a wakeup anyway.

Effect: Allows the waiter to detect that it absorbed a WAKE
token in the "Drowsy" state, implying that it was maybe needed
by another waiter, so it should re-transmit that token by
calling FUTEX_WAKE.

The kernel code change is trivial and has no performance impact
on futexes in general, e.g. as used for mutexes, but here it
might lead to redundant extra system calls in some cases.

This strategy has a subtle behavioural quirk, which might be a
flaw, I'm not sure, which is described at the end of answer 5 below.

Kernel change looks like:

out_unqueue:
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!unqueue_me(&q))
<<<<<
ret = 0;
=====
ret = 1;
>>>>>

Userspace change looks like:

result = futex (&futex, FUTEX_WAIT, val);
mutex_lock (lock);
if (result < 0)
wakeup_seq++;
<<<<<
=====
else if (result > 0)
wakeup_seq += futex (&futex, FUTEX_WAKE, 1);
>>>>>

5. Like 4, but in the kernel. We change the kernel to _always_
retransmit a wakeup if it's received by the unqueue_me() in the
word-didn't-match branch.

Effect: In the "Drowsy" state, a waiter may accept a WAKE token
but then it will offer it again so they are never lost from
"Sleeping" states.

NOTE: This is NOT equivalent to changing the kernel to do
test-and-queue atomically. With this change, a FUTEX_WAKE
operation can return to userspace _before_ the final
destination of the WAKE token decides to begin FUTEX_WAIT.

This will result in spurious extra wakeups, erring too far the
other way, because of the difference from atomicity described
in the preceding paragraph.

Therefore, I don't like this. It would fix the NPTL condition
variables, but introduces two new problems:

- It violates conservation of WAKE tokens (like energy and
momentum), which some other futex-using code may depend
on - unless the return value from FUTEX_WAIT is changed
to report 1 when it receives a token or 2 when it
forwards it successfully.

- Some spurious wakeups at times when a wakeup is not
required.

- No logical benefit over doing it in userspace, but
would take away flexibility if kernel always did it.

6. Like 4, but this requires no kernel change, just userspace.
Another counter is used to detect when retransmision is needed:

pthread_cond_signal (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
<<<<<
++wakeup_seq, ++futex;
futex (&futex, FUTEX_WAKE, 1);
=====
++futex;
++missing;
result = futex (&futex, FUTEX_WAKE, missing);
wakeup_seq += result;
missing -= result;
>>>>>
}
mutex_unlock (lock);
}
pthread_cond_wait (cond, mtx)
{
mutex_lock (lock);
mutex_unlock (mtx->lock);
++total_seq;
++futex;
mutex = mtx;
bc_seq = broadcast_seq;
seq = wakeup_seq;
do {
val = futex;
mutex_unlock (lock);
<<<<<
futex (&futex, FUTEX_WAIT, val);
mutex_lock (lock);
=====
result = futex (&futex, FUTEX_WAIT, val);
mutex_lock (lock);
if (result < 0) {
++wakeup_seq;
--missing;
}
if (missing) {
result = futex (&futex, FUTEX_WAKE, missing);
wakeup_seq += result;
missing -= result;
}
>>>>>
if (bc_seq != broadcast_seq)
goto out;
} while (wakeup_seq == seq || woken_seq == wakeup_seq);
++woken_seq;
out:
mutex_unlock (lock);
mutex_lock (mtx->lock);
}

NOTE: The difference in 5 between kernel atomic wakeups and kernel
forwarded wakeups being observable has an analogous form in userspace
pthreads condition variables, with any of the 4, 5 or 6
implementations. That is, anything that works by forwarding wakeups.

If an application calls pthread_cond_signal, then that returns, and
then the application calls pthread_cond_wait, forwarded wakeups could
result in that wait being woken by the signal which logically preceded
it.

This happens because the wake is "in flight" so to speak.

It would also result in a different wait, queued earlier than the
pthread_cond_signal call, not being woken because this one is woken in
its place. The total number woken is fine.

The same thing can occur with solutions 4, 5 and 6.

Those spuriously delayed wakeups may or may not be a problem. They
are observable so a program's behaviour could be written to depend on
them not occurring. However, that's a pretty subtle thing to depend
on - not the sort of thing programs using condvars would normally do.

This time I _really_ have no idea if that would be forbidden by POSIX
or not.

I suspect some implementations of condvar work a bit like queued
signals or queued messages: where pthread_cond_signal while the signal
itself is in flight and may be delivered to a subsequently starting
wait, within a time window. Then again, maybe they aren't.

> If you think it is fixable in userland, please write at least the pseudo
> code that you believe should work. We have spent quite a lot of time
> on that code and don't believe this is solvable in userland.

I hope I have presented and explained the userland-only solutions.

Out of all of the above, solution 6 looks most promising to me.
Having a think about the wakeup ordering issues mentioned at the end,
though.

> I believe the only place this is solvable in is the kernel, by ensuring
> atomicity (i.e. queuing task iff curval == expected_val operation atomic
> wrt. futex_wake/futex_requeue in other tasks). In the RHEL3 2.4.x backport
> this is easy, as spinlock is held around the user access (the page is first
> pinned, then lock taken, then value compared (but that is guaranteed to
> be non-blocking) and if equal queued, then unlocked and unpinned.
> In 2.6.x this is harder if the kernel cannot allow some spinlock to be
> taken while doing user access, but I guess the kernel needs to cope
> with this, e.g. by queueing the task early but mark it as maybe queued
> only. If futex_wake sees such a bit, it would wait until futex_wait
> notifies it it has decided whether that one should be queued or not.
> Or something else, whatever, as long as the right semantics is ensured.

> Just FYI, current pseudo code is (not mentioning cancellation stuff here,
> code/data to deal with pthread_cond_destroy semantics, timedwait and
> pshared condvars):
>
> typedef struct { int lock, futex; uint64_t total_seq, wakeup_seq, woken_seq;
> void *mutex; uint32_t broadcast_seq; } pthread_cond_t;

A few questions:

1. Why are total_seq and so on 64 bit quantities?

The comparison problem on overflow is solvable by changing
(total_seq > wakeup_seq) to (int32_t) (total_seq -
wakeup_seq) > 0, just like the kernel does with jiffies.

If you imagine the number of waiters to exceed 2^31, you have
bigger problems, because:

2. futex is 32 bits and can overflow. If a waiter blocks, then
a waker is called 2^32 times in succession before the waiter
can schedule again, the waiter will remain blocked after the
waker returns.

This is unlikely, except where it's done deliberately
(e.g. SIGSTOP/CONT), and it's a bug and it only needs two
threads! It could perhaps be used for denial of service.

3. Why is futex incremented in pthread_cond_wait?
I don't see the reason for it.

4. In pthread_cond_broadcast, why is the mutex_unlock(lock)
dropped before calling FUTEX_CMP_REQUEUE? Wouldn't it be
better to drop the lock just after, in which case
FUTEX_REQUEUE would be fine?

pthread_cond_signal has no problem with holding the lock
across FUTEX_WAKE, and I do not see any reason why that would
be different for pthread_cond_broadcast.

> pthread_cond_broadcast (cond)
> {
> mutex_lock (lock);
> if (total_seq > wakeup_seq) {
> woken_seq = wakeup_seq = total_seq;
> futex = 2 * total_seq;
> ++broadcast_seq;
> val = futex;
> mutex_unlock (lock);
> if (futex (&futex, FUTEX_CMP_REQUEUE, 1, INT_MAX, &mutex->lock, val) < 0)
> futex (&futex, FUTEX_WAKE, INT_MAX);
> return;
> }
> mutex_unlock (lock);
> }

-- Jamie

2004-11-29 11:26:02

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

On Fri, Nov 26, 2004 at 05:06:49PM +0000, Jamie Lokier wrote:

Let's start with the questions:

> A few questions:
>
> 1. Why are total_seq and so on 64 bit quantities?
>
> The comparison problem on overflow is solvable by changing
> (total_seq > wakeup_seq) to (int32_t) (total_seq -
> wakeup_seq) > 0, just like the kernel does with jiffies.
>
> If you imagine the number of waiters to exceed 2^31, you have
> bigger problems, because:
>
> 2. futex is 32 bits and can overflow. If a waiter blocks, then
> a waker is called 2^32 times in succession before the waiter
> can schedule again, the waiter will remain blocked after the
> waker returns.
>
> This is unlikely, except where it's done deliberately
> (e.g. SIGSTOP/CONT), and it's a bug and it only needs two
> threads! It could perhaps be used for denial of service.

The only problem with the 32-bit overflow is if you get scheduled
away in between releasing the CV's internal lock, i.e.
lll_mutex_unlock (cond->__data.__lock);
and
if (get_user(curval, (int __user *)uaddr) != 0) {
in kernel and don't get scheduled again for enough time to reach
this place within 2^31 pthread_cond_{*wait,signal,broadcast} calls.
There are no things on the userland side that would block and
in kernel the only place you can block is down_read on mm's mmap_sem
(but if the writer lock is held that long, other pthread_cond_*
calls couldn't get in either) or the short term spinlocks on the hash
bucket. SIGSTOP/SIGCONT affect the whole process, so unless you are
talking about process shared condvars, these signals aren't going to help
you in exploiting it.

But, once you get past that point, current NPTL doesn't care if 2^31 or
more other cv calls happen, it uses the 64-bit vars to determine what to
do and they are big enough that overflows on them are just assumed not to
happen. And only past that point the thread is blocked in longer-term
waiting.

> 3. Why is futex incremented in pthread_cond_wait?
> I don't see the reason for it.

See
https://www.redhat.com/archives/phil-list/2004-May/msg00023.html
https://www.redhat.com/archives/phil-list/2004-May/msg00022.html

__data.__futex increases in pthread_cond_{signal,broadcast} are so that
pthread_cond_*wait detects pthread_cond_{signal,broadcast} that happened
in between releasing of internal cv lock in the *wait and being queued
on the futex's wait queue. __data.__futex increases in pthread_cond_*wait
are so that FUTEX_CMP_REQUEUE in pthread_cond_broadcast detects
pthread_cond_*wait that happened in between releasing the internal
lock in *broadcast and test in FUTEX_CMP_REQUEUE.

> 4. In pthread_cond_broadcast, why is the mutex_unlock(lock)
> dropped before calling FUTEX_CMP_REQUEUE? Wouldn't it be
> better to drop the lock just after, in which case
> FUTEX_REQUEUE would be fine?
>
> pthread_cond_signal has no problem with holding the lock
> across FUTEX_WAKE, and I do not see any reason why that would
> be different for pthread_cond_broadcast.

Holding the internal lock over requeue kills performance of broadcast,
if you hold the internal lock over the requeue, all the threads you
wake up will block on the internal lock anyway.

Jakub

2004-11-29 21:51:33

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Jakub Jelinek wrote:
> > 2. futex is 32 bits and can overflow. If a waiter blocks, then
> > a waker is called 2^32 times in succession before the waiter
> > can schedule again, the waiter will remain blocked after the
> > waker returns.
> >
> > This is unlikely, except where it's done deliberately
> > (e.g. SIGSTOP/CONT), and it's a bug and it only needs two
> > threads! It could perhaps be used for denial of service.
>
> The only problem with the 32-bit overflow is if you get scheduled
> away in between releasing the CV's internal lock, i.e.
> lll_mutex_unlock (cond->__data.__lock);
> and
> if (get_user(curval, (int __user *)uaddr) != 0) {
> in kernel and don't get scheduled again for enough time to reach
> this place within 2^31 pthread_cond_{*wait,signal,broadcast} calls.

Yes.

> There are no things on the userland side that would block and
> in kernel the only place you can block is down_read on mm's mmap_sem
> (but if the writer lock is held that long, other pthread_cond_*
> calls couldn't get in either) or the short term spinlocks on the hash
> bucket. SIGSTOP/SIGCONT affect the whole process, so unless you are
> talking about process shared condvars, these signals aren't going to help
> you in exploiting it.

I agree, it is a difficult exploit, and the only consequence is a
thread hangs. I though it worth mentioning only because Ulrich brings
up a very similar 2^32 issue in "Futexes are tricky".

> But, once you get past that point, current NPTL doesn't care if 2^31 or
> more other cv calls happen, it uses the 64-bit vars to determine what to
> do and they are big enough that overflows on them are just assumed not to
> happen. And only past that point the thread is blocked in longer-term
> waiting.

About those 64-bit vars: don't the invariants guarantee the following?

total_seq - wakeup_seq < number of waiters

number of waiters is surely bounded by 2^31 (pid space), so 32-bit
vars would be enough for sure, and using wraparound-safe comparisons
(like time_after() in the kernel) would be strictly correct.

I'm just offering an optimisation here: less memory, smaller code.

> > 3. Why is futex incremented in pthread_cond_wait?
> > I don't see the reason for it.

I figured this out in a dream at the same time as you were writing
this message! Then I woke and thought "doh!". Yes, it's pretty clear
you must increment futex if the broadcast unlocks before requeuing.

> See
> https://www.redhat.com/archives/phil-list/2004-May/msg00023.html
> https://www.redhat.com/archives/phil-list/2004-May/msg00022.html

Examples of problems due to broadcast unlocking before requeueing and
the necessary fixes.

> > 4. In pthread_cond_broadcast, why is the mutex_unlock(lock)
> > dropped before calling FUTEX_CMP_REQUEUE? Wouldn't it be
> > better to drop the lock just after, in which case
> > FUTEX_REQUEUE would be fine?
> >
> > pthread_cond_signal has no problem with holding the lock
> > across FUTEX_WAKE, and I do not see any reason why that would
> > be different for pthread_cond_broadcast.
>
> Holding the internal lock over requeue kills performance of broadcast,
> if you hold the internal lock over the requeue, all the threads you
> wake up will block on the internal lock anyway.

Let's take a closer look.

Do you mean broadcast of process-shared condvars?

When a process-local broadcast requeues, it doesn't wake up lots of
threads; it wakes exactly one thread. When a process-shared broadcast
requeues, it wakes every waiter (because it doesn't know the address
of the mutex).

First the process-local case.

There are potentially 2 redundant context switches when signalling, and
there would be potentially 2 when broadcasting process-local _if_ the
lock were released after the requeue:

- switch to the thread just woken (#1 redundant switch)
- it tries to get the mutex and fails
- switch back to the signal/broadcast thread (#2 redundant switch)
- signaller/broadcaster releases mutex
- switch to the thread just woken (this is not redundant)

I thought this was what you meant, at first, and I wondered why spend
so much effort fixing it for broadcast and not for signal. Surely
signal is as important.

Then I realised you might mean process-shared wakeups being slow
because broadcast cannot requeue in that case.

Still, the earlier thought revealed a neat solution to those 2
potential context switches that also fixes process-shared broadcast,
while retaining the lock over requeue.

This is worth a look because I think it may turn out to be faster for
the common process-local cases too - precisely because it prevents the
potential 2 context switches after pthread_cond_signal. (Some
messages indicate that has been observed sometimes).

I'll explain with code. There may be mistakes, but hopefully the
principle is conveyed.

Something to watch out for is that FUTEX_REQUEUE is used to requeue to
&lock _and_ &mutex->lock in this code.

pthread_cond_signal (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
- ++wakeup_seq, ++futex;
- futex (&futex, FUTEX_WAKE, 1);
+ ++futex;
+ if (futex (&futex, FUTEX_REQUEUE, 0, 1, &lock) > 0) {
+ ++wakeup_seq;
+ lock = WHATEVER_MAKES_UNLOCK_CALL_FUTEX_WAKE;
+ }
}
mutex_unlock (lock);
}
pthread_cond_broadcast (cond)
{
mutex_lock (lock);
if (total_seq > wakeup_seq) {
- woken_seq = wakeup_seq = total_seq;
- futex = 2 * total_seq;
- ++broadcast_seq;
- val = futex;
- mutex_unlock (lock);
- if (process_shared || futex (&futex, FUTEX_CMP_REQUEUE, 1, INT_MAX,
- &mutex->lock, val) < 0)
- futex (&futex, FUTEX_WAKE, INT_MAX);
- return;
+ count = total_seq - wakeup_seq;
+ ++futex;
+ if (process_shared) {
+ count = futex (&futex, FUTEX_REQUEUE, 0, count, &lock);
+ wakeup_seq += count;
+ if (count > 0)
+ lock = WHATEVER_MAKES_UNLOCK_CALL_FUTEX_WAKE;
+ } else if (futex (&futex, FUTEX_REQUEUE, 0, 1, &lock) > 0) {
+ count = futex (&futex, FUTEX_REQUEUE, 0, count - 1, &mutex->lock);
+ wakeup_seq += count + 1;
+ lock = WHATEVER_MAKES_UNLOCK_CALL_FUTEX_WAKE;
+ }
}
mutex_unlock (lock);
}
pthread_cond_wait (cond, mtx)
{
mutex_lock (lock);
mutex_unlock (mtx->lock);
++total_seq;
- ++futex;
mutex = mtx;
bc_seq = broadcast_seq;
seq = wakeup_seq;
do {
val = futex;
mutex_unlock (lock);
- futex (&futex, FUTEX_WAIT, val);
- mutex_lock (lock);
- if (bc_seq != broadcast_seq)
- goto out;
+ result = futex (&futex, FUTEX_WAIT, val);
+ mutex_lock (lock);
+ if (result < 0 && wakeup_seq < total_seq)
+ wakeup_seq++;
} while (wakeup_seq == seq || woken_seq == wakeup_seq);
++woken_seq;
- out:
mutex_unlock (lock);
mutex_lock (mtx->lock);
}

(By the way, there's a further optimisation not shown for
process-shared broadcast: if wait is called with a mutex in the same
page as the condvar, the offset within that page is valid for
computing the mutex address in the process-shared broadcast, so it can
requeue to the mutex in that case.)

-- Jamie

2005-03-17 11:30:35

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

On Thu, Nov 18, 2004 at 02:47:26PM -0500, Jakub Jelinek wrote:
> The scenario described in futex_wait-fix.patch IMHO can happen even
> if all calls to pthread_cond_signal are done with mutex held around it, i.e.
> A B X Y
> pthread_mutex_lock (&mtx);
> pthread_cond_wait (&cv, &mtx);
> - mtx release *)
> total++ [1/0/0] (0) {}
> pthread_mutex_lock (&mtx);
> pthread_cond_signal (&cv);
> - wake++ [1/1/0] (1) {}
> FUTEX_WAKE, 1 (returns, nothing is queued)
> pthread_mutex_unlock (&mtx);
> pthread_mutex_lock (&mtx);
> pthread_cond_wait (&cv, &mtx);
> - mtx release *)
> total++ [2/1/0] (1) {}
> FUTEX_WAIT, 0
> queue_me [2/1/0] (1) {A}
> 0 != 1
> FUTEX_WAIT, 1
> queue_me [2/1/0] (1) {A,B}
> 1 == 1
> pthread_mutex_lock (&mtx);
> pthread_cond_signal (&cv);
> - wake++ [2/2/0] (2) {A,B}
> FUTEX_WAKE, 1 (unqueues incorrectly A)
> [2/2/0] (2) {B}
> pthread_mutex_unlock (&mtx);
> try to dequeue but already dequeued
> would normally return EWOULDBLOCK here
> but as unqueue_me failed, returns 0
> woken++ [2/2/1] (2) {B}
> schedule_timeout (forever)
> - mtx reacquire
> pthread_cond_wait returns
> pthread_mutex_unlock (&mtx);
>
> -------------------
> the code would like to say pthread_mutex_unlock (&mtx);
> and pthread_exit here, but never reaches there.
...

http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html

Your argument in November was that you don't want to slow down the
kernel and that userland must be able to cope with the
non-atomicity of futex syscall.

But with the recent changes to futex.c I think kernel can ensure
atomicity for free.

With get_futex_value_locked doing the user access in_atomic () and
repeating if that failed, I think it would be just a matter of
something as in the patch below (totally untested though).
It would simplify requeue implementation (getting rid of the nqueued
field), as well as never enqueue a futex in futex_wait until
the *uaddr == val uaccess check has shown it should be enqueued.
And I don't think the kernel will be any slower because of that,
in the common case where get_futex_value_locked does not cause
a mm fault (userland typically accessed that memory a few cycles before
the syscall), the futex_wait change is just about doing first half of
queue_me before the user access and second half after it.

--- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.000000000 -0500
+++ linux-2.6.11/kernel/futex.c 2005-03-17 05:13:45.000000000 -0500
@@ -97,7 +97,6 @@ struct futex_q {
*/
struct futex_hash_bucket {
spinlock_t lock;
- unsigned int nqueued;
struct list_head chain;
};

@@ -265,7 +264,6 @@ static inline int get_futex_value_locked
inc_preempt_count();
ret = __copy_from_user_inatomic(dest, from, sizeof(int));
dec_preempt_count();
- preempt_check_resched();

return ret ? -EFAULT : 0;
}
@@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u
struct list_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
- unsigned int nqueued;

retry:
down_read(&current->mm->mmap_sem);
@@ -354,23 +351,24 @@ static int futex_requeue(unsigned long u
bh1 = hash_futex(&key1);
bh2 = hash_futex(&key2);

- nqueued = bh1->nqueued;
+ if (bh1 < bh2)
+ spin_lock(&bh1->lock);
+ spin_lock(&bh2->lock);
+ if (bh1 > bh2)
+ spin_lock(&bh1->lock);
+
if (likely(valp != NULL)) {
int curval;

- /* In order to avoid doing get_user while
- holding bh1->lock and bh2->lock, nqueued
- (monotonically increasing field) must be first
- read, then *uaddr1 fetched from userland and
- after acquiring lock nqueued field compared with
- the stored value. The smp_mb () below
- makes sure that bh1->nqueued is read from memory
- before *uaddr1. */
- smp_mb();
-
ret = get_futex_value_locked(&curval, (int __user *)uaddr1);

if (unlikely(ret)) {
+ spin_unlock(&bh1->lock);
+ if (bh1 != bh2)
+ spin_unlock(&bh2->lock);
+
+ preempt_check_resched();
+
/* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
@@ -385,21 +383,10 @@ static int futex_requeue(unsigned long u
}
if (curval != *valp) {
ret = -EAGAIN;
- goto out;
+ goto out_unlock;
}
}

- if (bh1 < bh2)
- spin_lock(&bh1->lock);
- spin_lock(&bh2->lock);
- if (bh1 > bh2)
- spin_lock(&bh1->lock);
-
- if (unlikely(nqueued != bh1->nqueued && valp != NULL)) {
- ret = -EAGAIN;
- goto out_unlock;
- }
-
head1 = &bh1->chain;
list_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (&this->key, &key1))
@@ -435,13 +422,9 @@ out:
return ret;
}

-/*
- * queue_me and unqueue_me must be called as a pair, each
- * exactly once. They are called with the hashed spinlock held.
- */
-
/* The key must be already stored in q->key. */
-static void queue_me(struct futex_q *q, int fd, struct file *filp)
+static inline struct futex_hash_bucket *
+queue_lock(struct futex_q *q, int fd, struct file *filp)
{
struct futex_hash_bucket *bh;

@@ -455,11 +438,35 @@ static void queue_me(struct futex_q *q,
q->lock_ptr = &bh->lock;

spin_lock(&bh->lock);
- bh->nqueued++;
+ return bh;
+}
+
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh)
+{
list_add_tail(&q->list, &bh->chain);
spin_unlock(&bh->lock);
}

+static inline void
+queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh)
+{
+ spin_unlock(&bh->lock);
+ drop_key_refs(&q->key);
+}
+
+/*
+ * queue_me and unqueue_me must be called as a pair, each
+ * exactly once. They are called with the hashed spinlock held.
+ */
+
+/* The key must be already stored in q->key. */
+static void queue_me(struct futex_q *q, int fd, struct file *filp)
+{
+ struct futex_hash_bucket *bh;
+ bh = queue_lock(q, fd, filp);
+ __queue_me(q, bh);
+}
+
/* Return 1 if we were still queued (ie. 0 means we were woken) */
static int unqueue_me(struct futex_q *q)
{
@@ -503,6 +510,7 @@ static int futex_wait(unsigned long uadd
DECLARE_WAITQUEUE(wait, current);
int ret, curval;
struct futex_q q;
+ struct futex_hash_bucket *bh;

retry:
down_read(&current->mm->mmap_sem);
@@ -511,7 +519,7 @@ static int futex_wait(unsigned long uadd
if (unlikely(ret != 0))
goto out_release_sem;

- queue_me(&q, -1, NULL);
+ bh = queue_lock(&q, -1, NULL);

/*
* Access the page AFTER the futex is queued.
@@ -537,14 +545,15 @@ static int futex_wait(unsigned long uadd
ret = get_futex_value_locked(&curval, (int __user *)uaddr);

if (unlikely(ret)) {
+ queue_unlock(&q, bh);
+
+ preempt_check_resched();
+
/* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
up_read(&current->mm->mmap_sem);

- if (!unqueue_me(&q)) /* There's a chance we got woken already */
- return 0;
-
ret = get_user(curval, (int __user *)uaddr);

if (!ret)
@@ -553,9 +562,15 @@ static int futex_wait(unsigned long uadd
}
if (curval != val) {
ret = -EWOULDBLOCK;
- goto out_unqueue;
+ queue_unlock(&q, bh);
+ preempt_check_resched();
+ goto out_release_sem;
}

+ /* Only actually queue if *uaddr contained val. */
+ __queue_me(&q, bh);
+ preempt_check_resched();
+
/*
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.


Jakub

2005-03-17 15:21:30

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Jakub Jelinek wrote:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html
>
> Your argument in November was that you don't want to slow down the
> kernel and that userland must be able to cope with the
> non-atomicity of futex syscall.

Those were two of them.

But my other main concern is conceptual.

Right now, a futex_wait call is roughly equivalent to to
add_wait_queue, which is quite versatile.

It means anything you can do with one futex, you can extend to
multiple futexes (e.g. waiting on more than one lock), and you can do
asynchronously (e.g. futex_wait can be implemented in userspace as
futex_fd[1] + poll[2], and therefore things like poll-driven state machines
where one of the state machines wants to wait on a lock are possible).

[1] Ulrich was mistaken in his paper to say futex_fd needs to check a word
to be useful; userspace is supposed to check the word after futex_fd
and before polling or waiting on it. This is more useful because it
extends to multiple futexes.
[2] actually it can't right now because of a flaw in futex_fd's poll
function, but that could be fixed. The _principle_ is sound.

If you change futex_wait to be "atomic", and then have userspace locks
which _depend_ on that atomicity, it becomes impossible to wait on
multiple of those locks, or make poll-driven state machines which can
wait on those locks.

There are applications and libraries which use futex, not just for
threading but things like database locks in files.

You can do userspace threading and simulate most blocking system calls
by making them non-blocking and using poll).

(I'm not saying anything against NPTL by this, by the way - NPTL is a
very good general purpose library - but there are occasions when an
application wants to do it's own equivalent of simulated blocking
system calls for one reason or another. My favourite being research
into inter-thread JIT-optimisation in an environment like valgrind).

Right now, in principle, futex_wait is among the system calls which
can be simulated by making it non-blocking (= futex_fd) and using poll()[2].
Which means programs using futex themselves can be subject to interesting
thread optimisations by code which knows nothing about the program
(similar to valgrind..)

If you change futex_wait to be "atomic", then it would be _impossible_
to take a some random 3rd party library which is using that
futex_wait, and convert it's blocking system calls to use poll-driven
state machines instead.

I think taking that away would be a great conceptual loss.

It's not a _huge_ loss, but considering it's only Glibc which is
demanding this and futexes have another property, token-passing, which
Glibc could be using instead - why not use it?

That said, let's look at your patch.

> It would simplify requeue implementation (getting rid of the nqueued
> field),

The change to FUTEX_REQUEUE2 is an improvement :)
nqueued is an abomination, like the rest of FUTEX_REQUEUE2 :)

> @@ -265,7 +264,6 @@ static inline int get_futex_value_locked
> inc_preempt_count();
> ret = __copy_from_user_inatomic(dest, from, sizeof(int));
> dec_preempt_count();
> - preempt_check_resched();
>
> return ret ? -EFAULT : 0;
> }

inc_preempt_count() and dec_preempt_count() aren't needed, as
preemption is disabled by the queue spinlocks. So
get_futex_value_locked isn't needed any more: with the spinlocks held,
__get_user will do.

> [numerous instances of...]
> + preempt_check_resched();

Not required. The spin unlocks will do this.

> But with the recent changes to futex.c I think kernel can ensure
> atomicity for free.

I agree it would probably not slow the kernel, but I would _strongly_
prefer that Glibc were fixed to use the token-passing property, if
Glibc is the driving intention behind this patch - instead of this
becoming a semantic that application-level users of futex (like
database and IPC libraries) come to depend on and which can't be
decomposed into a multiple-waiting form.

(I admit that the kernel code does look nicer with
get_futex_value_locked gone, though).

By the way, do you know of Scott Snyder's recent work on fixing Glibc
in this way? He bumped into one of Glibc's currently broken corner
cases, fixed it (according to the algorithm I gave in November), and
reported that it works fine with the fix.

-- Jamie

2005-03-17 15:56:07

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

On Thu, Mar 17, 2005 at 03:20:31PM +0000, Jamie Lokier wrote:
> If you change futex_wait to be "atomic", and then have userspace locks
> which _depend_ on that atomicity, it becomes impossible to wait on
> multiple of those locks, or make poll-driven state machines which can
> wait on those locks.

The futex man pages that have been around for years (certainly since mid 2002)
certainly don't document FUTEX_WAIT as token passing operation, but as atomic
operation:

Say http://www.icewalkers.com/Linux/ManPages/futex-2.html

FUTEX_WAIT
This operation atomically verifies that the futex
address still contains the value given, and sleeps
awaiting FUTEX_WAKE on this futex address. If the
timeout argument is non-NULL, its contents describe
the maximum duration of the wait, which is infinite
otherwise. For futex(4), this call is executed if
decrementing the count gave a negative value (indi
cating contention), and will sleep until another
process releases the futex and executes the
FUTEX_WAKE operation.

RETURN VALUE
FUTEX_WAIT
Returns 0 if the process was woken by a FUTEX_WAKE
call. In case of timeout, ETIMEDOUT is returned. If
the futex was not equal to the expected value, the
operation returns EWOULDBLOCK. Signals (or other
spurious wakeups) cause FUTEX_WAIT to return EINTR.

so there very well might be programs other than glibc that
depend on this behaviour. Given that in most cases the race
is not hit every day (after all, we have been living with it for
several years), they probably wouldn't know there is a problem
like that.

> You can do userspace threading and simulate most blocking system calls
> by making them non-blocking and using poll).

Sure, but then you need to write your own locking as well and
can just use the token passing property of futexes there.

> It's not a _huge_ loss, but considering it's only Glibc which is
> demanding this and futexes have another property, token-passing, which
> Glibc could be using instead - why not use it?

Because that requires requeue being done with the cv lock held, which
means an extra context switch.

> > @@ -265,7 +264,6 @@ static inline int get_futex_value_locked
> > inc_preempt_count();
> > ret = __copy_from_user_inatomic(dest, from, sizeof(int));
> > dec_preempt_count();
> > - preempt_check_resched();
> >
> > return ret ? -EFAULT : 0;
> > }
>
> inc_preempt_count() and dec_preempt_count() aren't needed, as
> preemption is disabled by the queue spinlocks. So
> get_futex_value_locked isn't needed any more: with the spinlocks held,
> __get_user will do.

They aren't needed if CONFIG_PREEMPT. But with !CONFIG_PREEMPT, they
are IMHO still needed, as spin_lock/spin_unlock call preempt_{disable,enable},
which is a nop if !CONFIG_PREEMPT.
__get_user can't be used though, it should be __get_user_inatomic
(or __copy_from_user_inatomic if the former doesn't exist).

> > [numerous instances of...]
> > + preempt_check_resched();
>
> Not required. The spin unlocks will do this.

True, preempt_check_resched() is a nop if !CONFIG_PREEMPT and for
CONFIG_PREEMPT spin_unlock will handle it. Will remove them from the
patch.

> > But with the recent changes to futex.c I think kernel can ensure
> > atomicity for free.
>
> I agree it would probably not slow the kernel, but I would _strongly_
> prefer that Glibc were fixed to use the token-passing property, if
> Glibc is the driving intention behind this patch - instead of this
> becoming a semantic that application-level users of futex (like
> database and IPC libraries) come to depend on and which can't be
> decomposed into a multiple-waiting form.
>
> (I admit that the kernel code does look nicer with
> get_futex_value_locked gone, though).
>
> By the way, do you know of Scott Snyder's recent work on fixing Glibc
> in this way? He bumped into one of Glibc's currently broken corner
> cases, fixed it (according to the algorithm I gave in November), and
> reported that it works fine with the fix.

I certainly haven't seen his patch.

Jakub

2005-03-18 16:59:00

by Jakub Jelinek

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

On Thu, Mar 17, 2005 at 03:20:31PM +0000, Jamie Lokier wrote:
> > [numerous instances of...]
> > + preempt_check_resched();
>
> Not required. The spin unlocks will do this.

Here is updated patch with those removed (all of them are preceeded
by spin_unlock) and out_unqueue label and following unused code removed
too.

--- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.000000000 -0500
+++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.000000000 -0500
@@ -97,7 +97,6 @@ struct futex_q {
*/
struct futex_hash_bucket {
spinlock_t lock;
- unsigned int nqueued;
struct list_head chain;
};

@@ -265,7 +264,6 @@ static inline int get_futex_value_locked
inc_preempt_count();
ret = __copy_from_user_inatomic(dest, from, sizeof(int));
dec_preempt_count();
- preempt_check_resched();

return ret ? -EFAULT : 0;
}
@@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u
struct list_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
- unsigned int nqueued;

retry:
down_read(&current->mm->mmap_sem);
@@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u
bh1 = hash_futex(&key1);
bh2 = hash_futex(&key2);

- nqueued = bh1->nqueued;
+ if (bh1 < bh2)
+ spin_lock(&bh1->lock);
+ spin_lock(&bh2->lock);
+ if (bh1 > bh2)
+ spin_lock(&bh1->lock);
+
if (likely(valp != NULL)) {
int curval;

- /* In order to avoid doing get_user while
- holding bh1->lock and bh2->lock, nqueued
- (monotonically increasing field) must be first
- read, then *uaddr1 fetched from userland and
- after acquiring lock nqueued field compared with
- the stored value. The smp_mb () below
- makes sure that bh1->nqueued is read from memory
- before *uaddr1. */
- smp_mb();
-
ret = get_futex_value_locked(&curval, (int __user *)uaddr1);

if (unlikely(ret)) {
+ spin_unlock(&bh1->lock);
+ if (bh1 != bh2)
+ spin_unlock(&bh2->lock);
+
/* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
@@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u
}
if (curval != *valp) {
ret = -EAGAIN;
- goto out;
+ goto out_unlock;
}
}

- if (bh1 < bh2)
- spin_lock(&bh1->lock);
- spin_lock(&bh2->lock);
- if (bh1 > bh2)
- spin_lock(&bh1->lock);
-
- if (unlikely(nqueued != bh1->nqueued && valp != NULL)) {
- ret = -EAGAIN;
- goto out_unlock;
- }
-
head1 = &bh1->chain;
list_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (&this->key, &key1))
@@ -435,13 +420,9 @@ out:
return ret;
}

-/*
- * queue_me and unqueue_me must be called as a pair, each
- * exactly once. They are called with the hashed spinlock held.
- */
-
/* The key must be already stored in q->key. */
-static void queue_me(struct futex_q *q, int fd, struct file *filp)
+static inline struct futex_hash_bucket *
+queue_lock(struct futex_q *q, int fd, struct file *filp)
{
struct futex_hash_bucket *bh;

@@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q,
q->lock_ptr = &bh->lock;

spin_lock(&bh->lock);
- bh->nqueued++;
+ return bh;
+}
+
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh)
+{
list_add_tail(&q->list, &bh->chain);
spin_unlock(&bh->lock);
}

+static inline void
+queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh)
+{
+ spin_unlock(&bh->lock);
+ drop_key_refs(&q->key);
+}
+
+/*
+ * queue_me and unqueue_me must be called as a pair, each
+ * exactly once. They are called with the hashed spinlock held.
+ */
+
+/* The key must be already stored in q->key. */
+static void queue_me(struct futex_q *q, int fd, struct file *filp)
+{
+ struct futex_hash_bucket *bh;
+ bh = queue_lock(q, fd, filp);
+ __queue_me(q, bh);
+}
+
/* Return 1 if we were still queued (ie. 0 means we were woken) */
static int unqueue_me(struct futex_q *q)
{
@@ -503,6 +508,7 @@ static int futex_wait(unsigned long uadd
DECLARE_WAITQUEUE(wait, current);
int ret, curval;
struct futex_q q;
+ struct futex_hash_bucket *bh;

retry:
down_read(&current->mm->mmap_sem);
@@ -511,7 +517,7 @@ static int futex_wait(unsigned long uadd
if (unlikely(ret != 0))
goto out_release_sem;

- queue_me(&q, -1, NULL);
+ bh = queue_lock(&q, -1, NULL);

/*
* Access the page AFTER the futex is queued.
@@ -537,14 +543,13 @@ static int futex_wait(unsigned long uadd
ret = get_futex_value_locked(&curval, (int __user *)uaddr);

if (unlikely(ret)) {
+ queue_unlock(&q, bh);
+
/* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
up_read(&current->mm->mmap_sem);

- if (!unqueue_me(&q)) /* There's a chance we got woken already */
- return 0;
-
ret = get_user(curval, (int __user *)uaddr);

if (!ret)
@@ -553,9 +558,13 @@ static int futex_wait(unsigned long uadd
}
if (curval != val) {
ret = -EWOULDBLOCK;
- goto out_unqueue;
+ queue_unlock(&q, bh);
+ goto out_release_sem;
}

+ /* Only actually queue if *uaddr contained val. */
+ __queue_me(&q, bh);
+
/*
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
@@ -596,10 +605,6 @@ static int futex_wait(unsigned long uadd
* have handled it for us already. */
return -EINTR;

- out_unqueue:
- /* If we were woken (and unqueued), we succeeded, whatever. */
- if (!unqueue_me(&q))
- ret = 0;
out_release_sem:
up_read(&current->mm->mmap_sem);
return ret;


Jakub

2005-03-18 17:04:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering


* Jakub Jelinek <[email protected]> wrote:

> The futex man pages that have been around for years (certainly since
> mid 2002) certainly don't document FUTEX_WAIT as token passing
> operation, but as atomic operation:
>
> Say http://www.icewalkers.com/Linux/ManPages/futex-2.html

besides this documented-behavior argument, i dont think futexes should
be degraded into waitqueues - in fact, to solve some of the known
performance problems the opposite will have to happen: e.g. i believe
that in the future we'll need to enable the kernel-side futex code to
actually modify the futex variable. I.e. atomicity of the read in
FUTEX_WAIT is an absolute must, and is only the first step.

[ the double-context-switch problem in cond_signal() that Jamie
mentioned is precisely one such case: pthread semantics force us that
the wakeup of the wakee _must_ happen while still holding the internal
lock. So we cannot just delay the wakeup to outside the glibc critical
section. This double context-switch could be avoided if the 'release
internal lock and wake up wakee' operation could be done atomically
within the kernel. (A sane default 'userspace unlock' operation on a
machine word could be defined .. e.g. decrement-to-zero.) ]

so i'm very much in favor of your patch - it fixes a real bug and is
also the right step forward. We'll need more locking code in the kernel
to remove fundamental limitations of userspace (such as no ability to
control preemption), not less.

i've tested your latest patch (from today) on x86 and it boots/works
fine with Fedora userspace, where futexes do get utilized, and ran a few
tests as well.

(Andrew - might make sense to include in the next -mm so that we get
some feel of stability, while the conceptual discussion continues?)

Ingo

--

this patch makes FUTEX_WAIT atomic again.

Signed-off-by: Jakub Jelinek <[email protected]>
Acked-by: Ingo Molnar <[email protected]>

--- linux/kernel/futex.c.orig
+++ linux/kernel/futex.c
@@ -97,7 +97,6 @@ struct futex_q {
*/
struct futex_hash_bucket {
spinlock_t lock;
- unsigned int nqueued;
struct list_head chain;
};

@@ -265,7 +264,6 @@ static inline int get_futex_value_locked
inc_preempt_count();
ret = __copy_from_user_inatomic(dest, from, sizeof(int));
dec_preempt_count();
- preempt_check_resched();

return ret ? -EFAULT : 0;
}
@@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u
struct list_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
- unsigned int nqueued;

retry:
down_read(&current->mm->mmap_sem);
@@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u
bh1 = hash_futex(&key1);
bh2 = hash_futex(&key2);

- nqueued = bh1->nqueued;
+ if (bh1 < bh2)
+ spin_lock(&bh1->lock);
+ spin_lock(&bh2->lock);
+ if (bh1 > bh2)
+ spin_lock(&bh1->lock);
+
if (likely(valp != NULL)) {
int curval;

- /* In order to avoid doing get_user while
- holding bh1->lock and bh2->lock, nqueued
- (monotonically increasing field) must be first
- read, then *uaddr1 fetched from userland and
- after acquiring lock nqueued field compared with
- the stored value. The smp_mb () below
- makes sure that bh1->nqueued is read from memory
- before *uaddr1. */
- smp_mb();
-
ret = get_futex_value_locked(&curval, (int __user *)uaddr1);

if (unlikely(ret)) {
+ spin_unlock(&bh1->lock);
+ if (bh1 != bh2)
+ spin_unlock(&bh2->lock);
+
/* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
@@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u
}
if (curval != *valp) {
ret = -EAGAIN;
- goto out;
+ goto out_unlock;
}
}

- if (bh1 < bh2)
- spin_lock(&bh1->lock);
- spin_lock(&bh2->lock);
- if (bh1 > bh2)
- spin_lock(&bh1->lock);
-
- if (unlikely(nqueued != bh1->nqueued && valp != NULL)) {
- ret = -EAGAIN;
- goto out_unlock;
- }
-
head1 = &bh1->chain;
list_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (&this->key, &key1))
@@ -435,13 +420,9 @@ out:
return ret;
}

-/*
- * queue_me and unqueue_me must be called as a pair, each
- * exactly once. They are called with the hashed spinlock held.
- */
-
/* The key must be already stored in q->key. */
-static void queue_me(struct futex_q *q, int fd, struct file *filp)
+static inline struct futex_hash_bucket *
+queue_lock(struct futex_q *q, int fd, struct file *filp)
{
struct futex_hash_bucket *bh;

@@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q,
q->lock_ptr = &bh->lock;

spin_lock(&bh->lock);
- bh->nqueued++;
+ return bh;
+}
+
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh)
+{
list_add_tail(&q->list, &bh->chain);
spin_unlock(&bh->lock);
}

+static inline void
+queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh)
+{
+ spin_unlock(&bh->lock);
+ drop_key_refs(&q->key);
+}
+
+/*
+ * queue_me and unqueue_me must be called as a pair, each
+ * exactly once. They are called with the hashed spinlock held.
+ */
+
+/* The key must be already stored in q->key. */
+static void queue_me(struct futex_q *q, int fd, struct file *filp)
+{
+ struct futex_hash_bucket *bh;
+ bh = queue_lock(q, fd, filp);
+ __queue_me(q, bh);
+}
+
/* Return 1 if we were still queued (ie. 0 means we were woken) */
static int unqueue_me(struct futex_q *q)
{
@@ -503,6 +508,7 @@ static int futex_wait(unsigned long uadd
DECLARE_WAITQUEUE(wait, current);
int ret, curval;
struct futex_q q;
+ struct futex_hash_bucket *bh;

retry:
down_read(&current->mm->mmap_sem);
@@ -511,7 +517,7 @@ static int futex_wait(unsigned long uadd
if (unlikely(ret != 0))
goto out_release_sem;

- queue_me(&q, -1, NULL);
+ bh = queue_lock(&q, -1, NULL);

/*
* Access the page AFTER the futex is queued.
@@ -537,14 +543,13 @@ static int futex_wait(unsigned long uadd
ret = get_futex_value_locked(&curval, (int __user *)uaddr);

if (unlikely(ret)) {
+ queue_unlock(&q, bh);
+
/* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
up_read(&current->mm->mmap_sem);

- if (!unqueue_me(&q)) /* There's a chance we got woken already */
- return 0;
-
ret = get_user(curval, (int __user *)uaddr);

if (!ret)
@@ -553,9 +558,13 @@ static int futex_wait(unsigned long uadd
}
if (curval != val) {
ret = -EWOULDBLOCK;
- goto out_unqueue;
+ queue_unlock(&q, bh);
+ goto out_release_sem;
}

+ /* Only actually queue if *uaddr contained val. */
+ __queue_me(&q, bh);
+
/*
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
@@ -596,10 +605,6 @@ static int futex_wait(unsigned long uadd
* have handled it for us already. */
return -EINTR;

- out_unqueue:
- /* If we were woken (and unqueued), we succeeded, whatever. */
- if (!unqueue_me(&q))
- ret = 0;
out_release_sem:
up_read(&current->mm->mmap_sem);
return ret;

2005-03-21 02:56:13

by Jamie Lokier

[permalink] [raw]
Subject: Re: Futex queue_me/get_user ordering

Ingo Molnar wrote:
>
> * Jakub Jelinek <[email protected]> wrote:
>
> > The futex man pages that have been around for years (certainly since
> > mid 2002) certainly don't document FUTEX_WAIT as token passing
> > operation, but as atomic operation:
> >
> > Say http://www.icewalkers.com/Linux/ManPages/futex-2.html
>
> besides this documented-behavior argument, i dont think futexes should
> be degraded into waitqueues

I give in...

Depending on atomicity makes it impossible for an application, which
is linked with NPTL and Glibc, to write an NPTL-compatible "wait on
two locks" function.

I'm not saying that's a very clean thing to want, but it's a
conceptual loss and I'm disappointed I seem to be the only one
noticing it.

On the other hand, I was mistaken to think it makes it impossible to
write an emulation of synchronous futex() in terms of asynchronous
futex().* In fact it makes it impossible to do so using the existing
FUTEX_FD, but it would be possible if there were a FUTEX_FD2 added
somewhere down the line.

* - The reason you would do this is if you were writing userspace-threading
for any reason, and you had to include an emulation of synchronous
futex() in terms of async futex because there are some libraries
which might run on top of the userspace-threading which use futex
in an application-dependent way.

> - in fact, to solve some of the known
> performance problems the opposite will have to happen: e.g. i believe
> that in the future we'll need to enable the kernel-side futex code to
> actually modify the futex variable. I.e. atomicity of the read in
> FUTEX_WAIT is an absolute must, and is only the first step.

Some of those performance problems can be solved already by better use
of FUTEX_REQUEUE instead of FUTEX_WAKE.

> [ the double-context-switch problem in cond_signal() that Jamie
> mentioned is precisely one such case: pthread semantics force us that
> the wakeup of the wakee _must_ happen while still holding the internal
> lock. So we cannot just delay the wakeup to outside the glibc critical
> section. This double context-switch could be avoided if the 'release
> internal lock and wake up wakee' operation could be done atomically
> within the kernel. (A sane default 'userspace unlock' operation on a
> machine word could be defined .. e.g. decrement-to-zero.) ]

Did you not see the solution I gave last November, using FUTEX_REQUEUE?

See:

http://lkml.org/lkml/2004/11/29/201

I spent a _lot_ of time figuring it out but everyone was too busy to
confirm that it worked. It would improve performance in a number of cases.

I hope that it does not get ignored yet again.

There _may_ be cases where more complex futex operations are needed,
but we should try the better algorithms that use the existing futex
operations before adding new ones.

-- Jamie