LinuxLists.cc - kernel bug: futex

2005-03-22 04:01:40

Subject: kernel bug: futex_wait hang

Paul Davis and Chris Morgan have been chasing down a problem with
xmms_jack and it really looks like this bug, thought to have been fixed
in 2.6.10, is the culprit.

http://www.uwsg.iu.edu/hypermail/linux/kernel/0409.0/2044.html

(for more info google "futex_wait 2.6 hang")

It's simple to reproduce. Run JACK and launch xmms with the JACK output
plugin. Close XMMS. The xmms process hangs. Strace looks like this:

rlrevell@krustophenia:~$ strace -p 7935
Process 7935 attached - interrupt to quit
futex(0xb5341bf8, FUTEX_WAIT, 7939, NULL

Just like in the above bug report, if xmms is run with
LD_ASSUME_KERNEL=2.4.19, it works perfectly.

I have reproduced the bug with 2.6.12-rc1.

Lee

2005-03-22 04:28:21

by Andrew Morton

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

Lee Revell <[email protected]> wrote:
>
> Paul Davis and Chris Morgan have been chasing down a problem with
> xmms_jack and it really looks like this bug, thought to have been fixed
> in 2.6.10, is the culprit.
>
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0409.0/2044.html
>
> (for more info google "futex_wait 2.6 hang")
>
> It's simple to reproduce. Run JACK and launch xmms with the JACK output
> plugin. Close XMMS. The xmms process hangs. Strace looks like this:
>
> rlrevell@krustophenia:~$ strace -p 7935
> Process 7935 attached - interrupt to quit
> futex(0xb5341bf8, FUTEX_WAIT, 7939, NULL
>
> Just like in the above bug report, if xmms is run with
> LD_ASSUME_KERNEL=2.4.19, it works perfectly.
>
> I have reproduced the bug with 2.6.12-rc1.
>

iirc we ended up deciding that the futex problems around that time were due
to userspace problems (a version of libc). But then, there's no discussion
around Seto's patch and it didn't get applied. So I don't know what
happened to that work - it's all a bit mysterious.

Is this a 100% repeatable hang, or is it some occasional race?

2005-03-22 04:54:07

by Jamie Lokier

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

Lee Revell wrote:
> > iirc we ended up deciding that the futex problems around that time were due
> > to userspace problems (a version of libc). But then, there's no discussion
> > around Seto's patch and it didn't get applied. So I don't know what
> > happened to that work - it's all a bit mysterious.
>
> It does seem like it could be a different problem. Maybe Paul can
> provide some more evidence that it's a kernel and not a glibc/NPTL bug.
> I'm really just posting this on Paul's behalf; I don't claim to
> understand the issue. ;-)

Try applying the patch below, which was recently posted by Jakub Jelinek.

If it fixes the problem, it's the same thing as Hidetoshi Seto
noticed, although this patch is much improved thanks to
"preempt_count technology" (tm). If not, it's a whole new problem.

-- Jamie

--- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.000000000 -0500
+++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.000000000 -0500
@@ -97,7 +97,6 @@ struct futex_q {
*/
struct futex_hash_bucket {
spinlock_t lock;
- unsigned int nqueued;
struct list_head chain;
};

@@ -265,7 +264,6 @@ static inline int get_futex_value_locked
inc_preempt_count();
ret = __copy_from_user_inatomic(dest, from, sizeof(int));
dec_preempt_count();
- preempt_check_resched();

return ret ? -EFAULT : 0;
}
@@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u
struct list_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
- unsigned int nqueued;

retry:
down_read(&current->mm->mmap_sem);
@@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u
bh1 = hash_futex(&key1);
bh2 = hash_futex(&key2);

- nqueued = bh1->nqueued;
+ if (bh1 < bh2)
+ spin_lock(&bh1->lock);
+ spin_lock(&bh2->lock);
+ if (bh1 > bh2)
+ spin_lock(&bh1->lock);
+
if (likely(valp != NULL)) {
int curval;

- /* In order to avoid doing get_user while
- holding bh1->lock and bh2->lock, nqueued
- (monotonically increasing field) must be first
- read, then *uaddr1 fetched from userland and
- after acquiring lock nqueued field compared with
- the stored value. The smp_mb () below
- makes sure that bh1->nqueued is read from memory
- before *uaddr1. */
- smp_mb();
-
ret = get_futex_value_locked(&curval, (int __user *)uaddr1);

if (unlikely(ret)) {
+ spin_unlock(&bh1->lock);
+ if (bh1 != bh2)
+ spin_unlock(&bh2->lock);
+
/* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
@@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u
}
if (curval != *valp) {
ret = -EAGAIN;
- goto out;
+ goto out_unlock;
}
}

- if (bh1 < bh2)
- spin_lock(&bh1->lock);
- spin_lock(&bh2->lock);
- if (bh1 > bh2)
- spin_lock(&bh1->lock);
-
- if (unlikely(nqueued != bh1->nqueued && valp != NULL)) {
- ret = -EAGAIN;
- goto out_unlock;
- }
-
head1 = &bh1->chain;
list_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (&this->key, &key1))
@@ -435,13 +420,9 @@ out:
return ret;
}

-/*
- * queue_me and unqueue_me must be called as a pair, each
- * exactly once. They are called with the hashed spinlock held.
- */
-
/* The key must be already stored in q->key. */
-static void queue_me(struct futex_q *q, int fd, struct file *filp)
+static inline struct futex_hash_bucket *
+queue_lock(struct futex_q *q, int fd, struct file *filp)
{
struct futex_hash_bucket *bh;

@@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q,
q->lock_ptr = &bh->lock;

spin_lock(&bh->lock);
- bh->nqueued++;
+ return bh;
+}
+
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh)
+{
list_add_tail(&q->list, &bh->chain);
spin_unlock(&bh->lock);
}

+static inline void
+queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh)
+{
+ spin_unlock(&bh->lock);
+ drop_key_refs(&q->key);
+}
+
+/*
+ * queue_me and unqueue_me must be called as a pair, each
+ * exactly once. They are called with the hashed spinlock held.
+ */
+
+/* The key must be already stored in q->key. */
+static void queue_me(struct futex_q *q, int fd, struct file *filp)
+{
+ struct futex_hash_bucket *bh;
+ bh = queue_lock(q, fd, filp);
+ __queue_me(q, bh);
+}
+
/* Return 1 if we were still queued (ie. 0 means we were woken) */
static int unqueue_me(struct futex_q *q)
{
@@ -503,6 +508,7 @@ static int futex_wait(unsigned long uadd
DECLARE_WAITQUEUE(wait, current);
int ret, curval;
struct futex_q q;
+ struct futex_hash_bucket *bh;

retry:
down_read(&current->mm->mmap_sem);
@@ -511,7 +517,7 @@ static int futex_wait(unsigned long uadd
if (unlikely(ret != 0))
goto out_release_sem;

- queue_me(&q, -1, NULL);
+ bh = queue_lock(&q, -1, NULL);

/*
* Access the page AFTER the futex is queued.
@@ -537,14 +543,13 @@ static int futex_wait(unsigned long uadd
ret = get_futex_value_locked(&curval, (int __user *)uaddr);

if (unlikely(ret)) {
+ queue_unlock(&q, bh);
+
/* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
up_read(&current->mm->mmap_sem);

- if (!unqueue_me(&q)) /* There's a chance we got woken already */
- return 0;
-
ret = get_user(curval, (int __user *)uaddr);

if (!ret)
@@ -553,9 +558,13 @@ static int futex_wait(unsigned long uadd
}
if (curval != val) {
ret = -EWOULDBLOCK;
- goto out_unqueue;
+ queue_unlock(&q, bh);
+ goto out_release_sem;
}

+ /* Only actually queue if *uaddr contained val. */
+ __queue_me(&q, bh);
+
/*
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
@@ -596,10 +605,6 @@ static int futex_wait(unsigned long uadd
* have handled it for us already. */
return -EINTR;

- out_unqueue:
- /* If we were woken (and unqueued), we succeeded, whatever. */
- if (!unqueue_me(&q))
- ret = 0;
out_release_sem:
up_read(&current->mm->mmap_sem);
return ret;

2005-03-22 04:54:04

by Jamie Lokier

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

Andrew Morton wrote:
> iirc we ended up deciding that the futex problems around that time were due
> to userspace problems (a version of libc). But then, there's no discussion
> around Seto's patch and it didn't get applied. So I don't know what
> happened to that work - it's all a bit mysterious.

It can be fixed _either_ in Glibc, or by changing the kernel.

That problem is caused by differing assumptions between Glibc and the
kernel about subtle futex semantics. Which goes to show they are
really clever, or something.

I provided pseudo-code for the Glibc fix, but not an actual patch
because NPTL is quite complicated and I wanted to know the Glibc
people were interested, but apparently they were too busy at the time
- benchmarks would have made sense for such a patch.

Scott Snyder started fixing part of Glibc, and that did fix his
instance of this problem so we know the approach works. But a full
patch for Glibc was not prepared.

The most recent messages under "Futex queue_me/get_user ordering",
with a patch from Jakub Jelinek will fix this problem by changing the
kernel. Yes, you should apply Jakub's most recent patch, message-ID
"<[email protected]>".

I have not tested the patch, but it looks convincing.

I argued for fixing Glibc on the grounds that the changed kernel
behaviour, or more exactly having Glibc depend on it, loses a certain
semantic property useful for unusual operations on multiple futexes at
the same time. But I appear to have lost the argument, and Jakub's
latest patch does clean up some cruft quite nicely, with no expected
performance hit.

-- Jamie

2005-03-22 05:02:25

by Lee Revell

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

On Mon, 2005-03-21 at 20:20 -0800, Andrew Morton wrote:
> Lee Revell <[email protected]> wrote:
> >
> > Paul Davis and Chris Morgan have been chasing down a problem with
> > xmms_jack and it really looks like this bug, thought to have been fixed
> > in 2.6.10, is the culprit.
> >
> > http://www.uwsg.iu.edu/hypermail/linux/kernel/0409.0/2044.html
> >
> > (for more info google "futex_wait 2.6 hang")
> >
> > It's simple to reproduce. Run JACK and launch xmms with the JACK output
> > plugin. Close XMMS. The xmms process hangs. Strace looks like this:
> >
> > rlrevell@krustophenia:~$ strace -p 7935
> > Process 7935 attached - interrupt to quit
> > futex(0xb5341bf8, FUTEX_WAIT, 7939, NULL
> >
> > Just like in the above bug report, if xmms is run with
> > LD_ASSUME_KERNEL=2.4.19, it works perfectly.
> >
> > I have reproduced the bug with 2.6.12-rc1.
> >
>
> iirc we ended up deciding that the futex problems around that time were due
> to userspace problems (a version of libc). But then, there's no discussion
> around Seto's patch and it didn't get applied. So I don't know what
> happened to that work - it's all a bit mysterious.
>

It does seem like it could be a different problem. Maybe Paul can
provide some more evidence that it's a kernel and not a glibc/NPTL bug.
I'm really just posting this on Paul's behalf; I don't claim to
understand the issue. ;-)

> Is this a 100% repeatable hang, or is it some occasional race?
>

100% repeatable.

Lee

2005-03-22 05:20:23

by Lee Revell

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

On Tue, 2005-03-22 at 04:48 +0000, Jamie Lokier wrote:
> I argued for fixing Glibc on the grounds that the changed kernel
> behaviour, or more exactly having Glibc depend on it, loses a certain
> semantic property useful for unusual operations on multiple futexes at
> the same time. But I appear to have lost the argument, and Jakub's
> latest patch does clean up some cruft quite nicely, with no expected
> performance hit.

A glibc fix will take forever to get to users compared to a kernel fix.

Lee

2005-03-22 05:16:12

by Andrew Morton

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

Jamie Lokier <[email protected]> wrote:
>
> Andrew Morton wrote:
> > iirc we ended up deciding that the futex problems around that time were due
> > to userspace problems (a version of libc). But then, there's no discussion
> > around Seto's patch and it didn't get applied. So I don't know what
> > happened to that work - it's all a bit mysterious.
>
> It can be fixed _either_ in Glibc, or by changing the kernel.
>
> That problem is caused by differing assumptions between Glibc and the
> kernel about subtle futex semantics. Which goes to show they are
> really clever, or something.
>
> I provided pseudo-code for the Glibc fix, but not an actual patch
> because NPTL is quite complicated and I wanted to know the Glibc
> people were interested, but apparently they were too busy at the time
> - benchmarks would have made sense for such a patch.
>
> Scott Snyder started fixing part of Glibc, and that did fix his
> instance of this problem so we know the approach works. But a full
> patch for Glibc was not prepared.
>
> The most recent messages under "Futex queue_me/get_user ordering",
> with a patch from Jakub Jelinek will fix this problem by changing the
> kernel. Yes, you should apply Jakub's most recent patch, message-ID
> "<[email protected]>".
>
> I have not tested the patch, but it looks convincing.

OK, thanks. Lee && Paul, that's at
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12-rc1-mm1/broken-out/futex-queue_me-get_user-ordering-fix.patch

> I argued for fixing Glibc on the grounds that the changed kernel
> behaviour, or more exactly having Glibc depend on it, loses a certain
> semantic property useful for unusual operations on multiple futexes at
> the same time. But I appear to have lost the argument, and Jakub's
> latest patch does clean up some cruft quite nicely, with no expected
> performance hit.

Futexes were initially so simple.

2005-03-22 05:36:49

by Lee Revell

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

On Mon, 2005-03-21 at 21:08 -0800, Andrew Morton wrote:
> Jamie Lokier <[email protected]> wrote:
> >
> > The most recent messages under "Futex queue_me/get_user ordering",
> > with a patch from Jakub Jelinek will fix this problem by changing the
> > kernel. Yes, you should apply Jakub's most recent patch, message-ID
> > "<[email protected]>".
> >
> > I have not tested the patch, but it looks convincing.
>
> OK, thanks. Lee && Paul, that's at
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12-rc1-mm1/broken-out/futex-queue_me-get_user-ordering-fix.patch
>

Does not fix the problem.

Lee

2005-03-22 06:36:57

by Jakub Jelinek

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

On Tue, Mar 22, 2005 at 12:30:53AM -0500, Lee Revell wrote:
> On Mon, 2005-03-21 at 21:08 -0800, Andrew Morton wrote:
> > Jamie Lokier <[email protected]> wrote:
> > >
> > > The most recent messages under "Futex queue_me/get_user ordering",
> > > with a patch from Jakub Jelinek will fix this problem by changing the
> > > kernel. Yes, you should apply Jakub's most recent patch, message-ID
> > > "<[email protected]>".
> > >
> > > I have not tested the patch, but it looks convincing.
> >
> > OK, thanks. Lee && Paul, that's at
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12-rc1-mm1/broken-out/futex-queue_me-get_user-ordering-fix.patch
> >
>
> Does not fix the problem.

Have you analyzed the use of mutexes/condvars in the program?
The primary suspect is a deadlock, race of some kind or other bug
in the program. All these will show up as a hang in FUTEX_WAIT.
The argument that it works with LinuxThreads doesn't count,
the timing and internals of both threading libraries are so different
that a program bug can only show up with one of the threading libraries
and not both.
Only once you distill a minimal self-contained testcase that proves
the program is correct and it gets analyzed, it is time to talk about
NPTL or kernel bugs.

Jakub

2005-03-22 15:39:28

by Jamie Lokier

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

Lee Revell wrote:
> On Tue, 2005-03-22 at 04:48 +0000, Jamie Lokier wrote:
> > I argued for fixing Glibc on the grounds that the changed kernel
> > behaviour, or more exactly having Glibc depend on it, loses a certain
> > semantic property useful for unusual operations on multiple futexes at
> > the same time. But I appear to have lost the argument, and Jakub's
> > latest patch does clean up some cruft quite nicely, with no expected
> > performance hit.
>
> A glibc fix will take forever to get to users compared to a kernel fix.

Interesting perspective. On my systems Glibc is upgraded more often
than the kernel.

-- Jamie

2005-03-22 23:20:50

by Lee Revell

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

On Tue, 2005-03-22 at 15:30 +0000, Jamie Lokier wrote:
> Lee Revell wrote:
> > On Tue, 2005-03-22 at 04:48 +0000, Jamie Lokier wrote:
> > > I argued for fixing Glibc on the grounds that the changed kernel
> > > behaviour, or more exactly having Glibc depend on it, loses a certain
> > > semantic property useful for unusual operations on multiple futexes at
> > > the same time. But I appear to have lost the argument, and Jakub's
> > > latest patch does clean up some cruft quite nicely, with no expected
> > > performance hit.
> >
> > A glibc fix will take forever to get to users compared to a kernel fix.
>
> Interesting perspective. On my systems Glibc is upgraded more often
> than the kernel.
>

Blame the Debian maintainers. This bug, reported August 2004, is still
unfixed even in unstable!!!

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=266507

Apparently they think marking a bug "fixed upstream" does something to
solve the problem.

Lee

2005-03-23 00:00:48

by Lee Revell

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

On Tue, 2005-03-22 at 01:34 -0500, Jakub Jelinek wrote:
> On Tue, Mar 22, 2005 at 12:30:53AM -0500, Lee Revell wrote:
> > On Mon, 2005-03-21 at 21:08 -0800, Andrew Morton wrote:
> > > Jamie Lokier <[email protected]> wrote:
> > > >
> > > > The most recent messages under "Futex queue_me/get_user ordering",
> > > > with a patch from Jakub Jelinek will fix this problem by changing the
> > > > kernel. Yes, you should apply Jakub's most recent patch, message-ID
> > > > "<[email protected]>".
> > > >
> > > > I have not tested the patch, but it looks convincing.
> > >
> > > OK, thanks. Lee && Paul, that's at
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12-rc1-mm1/broken-out/futex-queue_me-get_user-ordering-fix.patch
> > >
> >
> > Does not fix the problem.
>
> Have you analyzed the use of mutexes/condvars in the program?
> The primary suspect is a deadlock, race of some kind or other bug
> in the program. All these will show up as a hang in FUTEX_WAIT.
> The argument that it works with LinuxThreads doesn't count,
> the timing and internals of both threading libraries are so different
> that a program bug can only show up with one of the threading libraries
> and not both.
> Only once you distill a minimal self-contained testcase that proves
> the program is correct and it gets analyzed, it is time to talk about
> NPTL or kernel bugs.

Paul is on vacation for a week so I suspect this will have to wait for
his return. But he's been right about similar issues in the past so I'm
inclined to believe him.

In the meantime if anyone cares to investigate, the problem is trivial
to reproduce. All you need is JACK, XMMS, xmms-jack and any 2.6 kernel.

Lee

2005-03-23 13:13:22

by Paul Davis

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

> Paul is on vacation for a week so I suspect this will have to wait for
> his return. But he's been right about similar issues in the past so I'm
> inclined to believe him.
>
> In the meantime if anyone cares to investigate, the problem is trivial
> to reproduce. All you need is JACK, XMMS, xmms-jack and any 2.6 kernel.

fortunately (or perhaps not), by the powers of webmail, here i am to just
mention a couple more details for anyone who cares to think about this.

the hang occurs during an attempted thread cancel+join. we know from
strace that one thread calls tgkill() on the other. the other thread is
blocked in a poll call on a FIFO. after tgkill, the first thread enters a
futex wait, apparently waiting for the thread ID of the cancelled thread
to appear at some location (just a guess based on the info from strace).
the wait never returns, and so the first thread ends up hung in
pthread_join(). there are no user-defined mutexes or condvars involved.

what is rather odd is that when we have checked for the persistence (or
otherwise) of the cancelled thread, it appears as if several other threads
have died, not just the cancelled one. the primary tester was unclear if
this was to be expected or not (chris morgan, the author of xmms-jack),
and i cannot say definitively whether or not we know for certain that the
cancelled thread no longer exists.

as lee mentioned, i am on vacation right now, using some xp machine at a
relative's house, so i can't test anything till i return.

2005-03-23 13:45:59

by Jakub Jelinek

[permalink] [raw]

Subject: Re: kernel bug: futex_wait hang

On Wed, Mar 23, 2005 at 05:12:59AM -0800, [email protected] wrote:
> the hang occurs during an attempted thread cancel+join. we know from
> strace that one thread calls tgkill() on the other. the other thread is
> blocked in a poll call on a FIFO. after tgkill, the first thread enters a
> futex wait, apparently waiting for the thread ID of the cancelled thread
> to appear at some location (just a guess based on the info from strace).
> the wait never returns, and so the first thread ends up hung in
> pthread_join(). there are no user-defined mutexes or condvars involved.

If the thread that is to be cancelled is in async cancel state (it should
be when waiting in a poll and if cancellation is not disabled in that thread),
then pthread_cancel sends a SIGCANCEL signal to it via tgkill.
If tgkill succeeds (and thus pthread_cancel succeeds too) and you call
pthread_join on it, in the likely case the thread is still alive
pthread_join will FUTEX_WAIT on pd->tid, waiting until the thread dies.
NPTL threads are created with CLONE_CHILD_CLEARTID &self->tid, so this
futex will be FUTEX_WAKEd by mm_release in kernel whenever the thread is
exiting (or dying in some other way).

So, if pthread_join waits for the thread forever, the thread must be
around (otherwise pthread_join would not block on it; well, there could
be memory corruption in the program and anything would be possible then).
This would mean either that the poll has not been awaken by the SIGCANCEL
signal, or e.g. that one of the registered cleanup handlers (or C++
destructors) in the thread that is being cancelled get stuck for whatever
reason (deadlock, etc.).

Jakub