LinuxLists.cc - futex strangeness in 2.6.23-mm1/UML

2007-10-22 17:29:41

Subject: futex strangeness in 2.6.23-mm1/UML

I'm getting a process stuck in pthread_rwlock_wrlock(), even though it
looks like the lock is not held by anybody.

I think the last -mm was OK. Any ideas?

If not, I'll go searching for the offending patch.

Thanks,
Miklos

2007-10-22 18:54:20

by Rik van Riel

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, 22 Oct 2007 19:29:07 +0200
Miklos Szeredi <[email protected]> wrote:

> I'm getting a process stuck in pthread_rwlock_wrlock(), even though it
> looks like the lock is not held by anybody.
>
> I think the last -mm was OK. Any ideas?
>
> If not, I'll go searching for the offending patch.

I wonder if that's the same bug that's breaking autofs for me.

Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent cache
rwlock lock failed
Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads error: 11 at
65 in cache.c

I'm bisecting 2.6.23-mm1 today to find the problem patch, with
some luck I'll have it this afternoon.

With the series applied up to whitespace-fixes-task-exit-handling.patch
things work.

It breaks before kswapd-should-only-wait-on-io-if-there-is-io.patch

That leaves only about 60-80 patches to look at :)

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-22 20:49:25

by Miklos Szeredi

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

> > I'm getting a process stuck in pthread_rwlock_wrlock(), even though it
> > looks like the lock is not held by anybody.
> >
> > I think the last -mm was OK. Any ideas?
> >
> > If not, I'll go searching for the offending patch.
>
> I wonder if that's the same bug that's breaking autofs for me.

Probably.

> Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent cache
> rwlock lock failed
> Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads error: 11 at
> 65 in cache.c
>
> I'm bisecting 2.6.23-mm1 today to find the problem patch, with
> some luck I'll have it this afternoon.
>
> With the series applied up to whitespace-fixes-task-exit-handling.patch
> things work.
>
> It breaks before kswapd-should-only-wait-on-io-if-there-is-io.patch
>
> That leaves only about 60-80 patches to look at :)

OK, the first patch that breaks something for me is:

pid-namespaces-move-alloc_pid-lower-in-copy_process.patch

Miklos

2007-10-22 21:29:45

by Rik van Riel

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, 22 Oct 2007 22:48:51 +0200
Miklos Szeredi <[email protected]> wrote:

> > > I'm getting a process stuck in pthread_rwlock_wrlock(), even
> > > though it looks like the lock is not held by anybody.
> > >
> > > I think the last -mm was OK. Any ideas?
> > >
> > > If not, I'll go searching for the offending patch.
> >
> > I wonder if that's the same bug that's breaking autofs for me.
>
> Probably.
>
> > Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent cache
> > rwlock lock failed
> > Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads error:
> > 11 at 65 in cache.c
> >
> > I'm bisecting 2.6.23-mm1 today to find the problem patch, with
> > some luck I'll have it this afternoon.
> >
> > With the series applied up to
> > whitespace-fixes-task-exit-handling.patch things work.
> >
> > It breaks before kswapd-should-only-wait-on-io-if-there-is-io.patch
> >
> > That leaves only about 60-80 patches to look at :)
>
> OK, the first patch that breaks something for me is:
>
> pid-namespaces-move-alloc_pid-lower-in-copy_process.patch

Confirmed. That same patch is the point where the bisect
here starts breaking autofs.

Ian, it looks like the autofs code is innocent :)

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-23 00:08:03

by Rik van Riel

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, 22 Oct 2007 17:29:26 -0400
Rik van Riel <[email protected]> wrote:

> On Mon, 22 Oct 2007 22:48:51 +0200
> Miklos Szeredi <[email protected]> wrote:
>
> > > > I'm getting a process stuck in pthread_rwlock_wrlock(), even
> > > > though it looks like the lock is not held by anybody.
> > > >
> > > > I think the last -mm was OK. Any ideas?
> > > >
> > > > If not, I'll go searching for the offending patch.
> > >
> > > I wonder if that's the same bug that's breaking autofs for me.
> >
> > Probably.
> >
> > > Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent
> > > cache rwlock lock failed
> > > Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads error:
> > > 11 at 65 in cache.c
> > >
> > > I'm bisecting 2.6.23-mm1 today to find the problem patch, with
> > > some luck I'll have it this afternoon.
> > >
> > > With the series applied up to
> > > whitespace-fixes-task-exit-handling.patch things work.
> > >
> > > It breaks before
> > > kswapd-should-only-wait-on-io-if-there-is-io.patch
> > >
> > > That leaves only about 60-80 patches to look at :)
> >
> > OK, the first patch that breaks something for me is:
> >
> > pid-namespaces-move-alloc_pid-lower-in-copy_process.patch
>
> Confirmed. That same patch is the point where the bisect
> here starts breaking autofs.

Oww man. Getting into heisenbug territory now :(

I bisected down to the point where I had these two patches between
good and bad:

# GOOD
pid-namespaces-move-alloc_pid-lower-in-copy_process.patch
pid-namespaces-make-proc-have-multiple-superblocks-one-for-each-namespace.patch
# BAD

Applying them one by one and rebuilding the kernel after each one
gave me a working kernel, though!

Bisecting them from the top (quilt pop) resulted in a broken kernel,
bisecting from the bottom (quilt push) results in a working one.

I have no idea what is going on any more...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-23 00:17:22

by Andrew Morton

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, 22 Oct 2007 20:07:42 -0400
Rik van Riel <[email protected]> wrote:

> On Mon, 22 Oct 2007 17:29:26 -0400
> Rik van Riel <[email protected]> wrote:
>
> > On Mon, 22 Oct 2007 22:48:51 +0200
> > Miklos Szeredi <[email protected]> wrote:
> >
> > > > > I'm getting a process stuck in pthread_rwlock_wrlock(), even
> > > > > though it looks like the lock is not held by anybody.
> > > > >
> > > > > I think the last -mm was OK. Any ideas?
> > > > >
> > > > > If not, I'll go searching for the offending patch.
> > > >
> > > > I wonder if that's the same bug that's breaking autofs for me.
> > >
> > > Probably.
> > >
> > > > Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent
> > > > cache rwlock lock failed
> > > > Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads error:
> > > > 11 at 65 in cache.c
> > > >
> > > > I'm bisecting 2.6.23-mm1 today to find the problem patch, with
> > > > some luck I'll have it this afternoon.
> > > >
> > > > With the series applied up to
> > > > whitespace-fixes-task-exit-handling.patch things work.
> > > >
> > > > It breaks before
> > > > kswapd-should-only-wait-on-io-if-there-is-io.patch
> > > >
> > > > That leaves only about 60-80 patches to look at :)
> > >
> > > OK, the first patch that breaks something for me is:
> > >
> > > pid-namespaces-move-alloc_pid-lower-in-copy_process.patch
> >
> > Confirmed. That same patch is the point where the bisect
> > here starts breaking autofs.
>
> Oww man. Getting into heisenbug territory now :(
>
> I bisected down to the point where I had these two patches between
> good and bad:
>
> # GOOD
> pid-namespaces-move-alloc_pid-lower-in-copy_process.patch
> pid-namespaces-make-proc-have-multiple-superblocks-one-for-each-namespace.patch
> # BAD
>
> Applying them one by one and rebuilding the kernel after each one
> gave me a working kernel, though!
>
> Bisecting them from the top (quilt pop) resulted in a broken kernel,
> bisecting from the bottom (quilt push) results in a working one.
>
> I have no idea what is going on any more...
>

I guess we can debug it in the old-fashioned ways. The first of which is
to palm the problem off on Pavel ;)

I don't recall seeing a simple step-by-step way by which others can
reproduce this?

2007-10-23 01:12:00

by Rik van Riel

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, 22 Oct 2007 17:16:24 -0700
Andrew Morton <[email protected]> wrote:

> > > > > Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent
> > > > > cache rwlock lock failed
> > > > > Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads
> > > > > error: 11 at 65 in cache.c

> I guess we can debug it in the old-fashioned ways. The first of
> which is to palm the problem off on Pavel ;)
>
> I don't recall seeing a simple step-by-step way by which others can
> reproduce this?

I have my systems set up to automount my home directory over
NFS when I log in. When trying to log in to the system with
2.6.23-mm1, I get the messages from above in my syslog, and
the NFS filesystem is not automounted.

I am thinking something in autofs or the pid-namespace* patches
does not match up and uses a wrong PID number or process pointer
when trying to lock things.

Which code is at fault I have no idea...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-23 01:38:30

by Rik van Riel

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, 22 Oct 2007 21:11:43 -0400
Rik van Riel <[email protected]> wrote:

> I have my systems set up to automount my home directory over
> NFS when I log in. When trying to log in to the system with
> 2.6.23-mm1, I get the messages from above in my syslog, and
> the NFS filesystem is not automounted.
>
> I am thinking something in autofs or the pid-namespace* patches
> does not match up and uses a wrong PID number or process pointer
> when trying to lock things.
>
> Which code is at fault I have no idea...

I still do not know for sure, but I have found some code that
puzzles me.

At fork() time, task->pid is set to task->pid->numbers[0].nr

On the other hand, fork() returns to the parent process:

task->pid[PIDTYPE_PID]->numbers[task->pid->level].nr

I have not unravelled the code enough yet to be sure whether
this is always the same number, but having a wrong PID number
somewhere could certainly explain these autofs4 errors:

Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent cache
rwlock lock failed
Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads error: 11 at
65 in cache.c

After all, autofs4 puts various kinds of PID information of the
daemon into the autofs4 waitqueue:

fs/autofs4/waitq.c:296:
wq->uid = current->uid;
wq->gid = current->gid;
wq->pid = current->pid;
wq->tgid = current->tgid;

Could this be related?

Wrt. the UML failures that Miklos is seeing, I imagine UML needs
to do some similar tricks.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-23 02:54:28

by Ian Kent

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, 2007-10-22 at 21:38 -0400, Rik van Riel wrote:
> On Mon, 22 Oct 2007 21:11:43 -0400
> Rik van Riel <[email protected]> wrote:
>
> > I have my systems set up to automount my home directory over
> > NFS when I log in. When trying to log in to the system with
> > 2.6.23-mm1, I get the messages from above in my syslog, and
> > the NFS filesystem is not automounted.
> >
> > I am thinking something in autofs or the pid-namespace* patches
> > does not match up and uses a wrong PID number or process pointer
> > when trying to lock things.
> >
> > Which code is at fault I have no idea...
>
> I still do not know for sure, but I have found some code that
> puzzles me.
>
> At fork() time, task->pid is set to task->pid->numbers[0].nr
>
> On the other hand, fork() returns to the parent process:
>
> task->pid[PIDTYPE_PID]->numbers[task->pid->level].nr
>
> I have not unravelled the code enough yet to be sure whether
> this is always the same number, but having a wrong PID number
> somewhere could certainly explain these autofs4 errors:
>
> Oct 22 14:39:01 kenny automount[2299]: cache_readlock: mapent cache
> rwlock lock failed
> Oct 22 14:39:01 kenny automount[2299]: unexpected pthreads error: 11 at
> 65 in cache.c
>
> After all, autofs4 puts various kinds of PID information of the
> daemon into the autofs4 waitqueue:
>
> fs/autofs4/waitq.c:296:
> wq->uid = current->uid;
> wq->gid = current->gid;
> wq->pid = current->pid;
> wq->tgid = current->tgid;
>
> Could this be related?

Probably not wrt the pthreads lock fail.
The value passed to the daemon are used for information purposes.

The pid is used only to log who is asking for the mount.
The uid and gid are used to find user name and group name and the four
bits of user information are used to set values in the environment for
use as macros within the map.

If the values in the wait request are wrong then we probably wouldn't
even know as few people will be even aware they are available.

Ian

2007-10-23 03:42:11

by Jeff Dike

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Mon, Oct 22, 2007 at 09:38:12PM -0400, Rik van Riel wrote:
> Wrt. the UML failures that Miklos is seeing, I imagine UML needs
> to do some similar tricks.

UML is just an architecture - it has no need of such tricks (although
it does need to keep track of host pids, but these are in architecture
data, invisible to the rest of the kernel).

Jeff

--
Work email - jdike at linux dot intel dot com

2007-10-23 07:37:17

by Miklos Szeredi

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

> I guess we can debug it in the old-fashioned ways. The first of which is
> to palm the problem off on Pavel ;)
>
> I don't recall seeing a simple step-by-step way by which others can
> reproduce this?

My method is this:

- enable CONFIG_FUSE_FS
- compile fuse from CVS:
cvs -d:pserver:[email protected]:/cvsroot/fuse co -P fuse
cd fuse
./makeconf.sh
./configure
make
make -C test

- in one shell:
mkdir /tmp/fuse
example/fusexmp_fh -d /tmp/fuse

- in another shell:
mkdir /tmp/test
test/test /tmp/fuse/tmp/test

The result is normally a completed run with "1 tests failed". On
2.6.23-mm1, it just hangs after the first test.

I've only tried under UML, but since Rick is seeing it on real HW, I
guess this method may work there as well.

Miklos

2007-10-23 08:55:33

by Pavel Emelyanov

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

Miklos Szeredi wrote:
>> I guess we can debug it in the old-fashioned ways. The first of which is
>> to palm the problem off on Pavel ;)

I look at the 2.6.23-mm1 and see that there's one hunk lost. This
is the one Oleg re-sent some days ago (the mail thread subject was
2.6.23-mm1 thread exit_group issue). Here it is (sent 13 oct 2007):

--- kernel/fork.c~ 2007-10-13 15:41:35.000000000 +0400
+++ kernel/fork.c 2007-10-13 15:41:41.000000000 +0400
@@ -1443,6 +1443,9 @@ long do_fork(unsigned long clone_flags,
task_pid_nr_ns(p, current->nsproxy->pid_ns) :
task_pid_vnr(p);

+ if (clone_flags & CLONE_PARENT_SETTID)
+ put_user(nr, parent_tidptr);
+
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);

Please, try with this patch.

>> I don't recall seeing a simple step-by-step way by which others can
>> reproduce this?
>
> My method is this:
>
> - enable CONFIG_FUSE_FS
> - compile fuse from CVS:
> cvs -d:pserver:[email protected]:/cvsroot/fuse co -P fuse
> cd fuse
> ./makeconf.sh
> ./configure
> make
> make -C test
>
> - in one shell:
> mkdir /tmp/fuse
> example/fusexmp_fh -d /tmp/fuse
>
> - in another shell:
> mkdir /tmp/test
> test/test /tmp/fuse/tmp/test
>
> The result is normally a completed run with "1 tests failed". On
> 2.6.23-mm1, it just hangs after the first test.
>
> I've only tried under UML, but since Rick is seeing it on real HW, I
> guess this method may work there as well.
>
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-10-23 10:24:05

by Miklos Szeredi

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

> >> I guess we can debug it in the old-fashioned ways. The first of which is
> >> to palm the problem off on Pavel ;)
>
> I look at the 2.6.23-mm1 and see that there's one hunk lost. This
> is the one Oleg re-sent some days ago (the mail thread subject was
> 2.6.23-mm1 thread exit_group issue). Here it is (sent 13 oct 2007):
>
> --- kernel/fork.c~ 2007-10-13 15:41:35.000000000 +0400
> +++ kernel/fork.c 2007-10-13 15:41:41.000000000 +0400
> @@ -1443,6 +1443,9 @@ long do_fork(unsigned long clone_flags,
> task_pid_nr_ns(p, current->nsproxy->pid_ns) :
> task_pid_vnr(p);
>
> + if (clone_flags & CLONE_PARENT_SETTID)
> + put_user(nr, parent_tidptr);
> +
> if (clone_flags & CLONE_VFORK) {
> p->vfork_done = &vfork;
> init_completion(&vfork);
>
> Please, try with this patch.

Thanks. This patch does indeed seem to solve the problem.

Miklos

2007-10-23 13:52:27

by Rik van Riel

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Tue, 23 Oct 2007 12:55:44 +0400
Pavel Emelyanov <[email protected]> wrote:

> I look at the 2.6.23-mm1 and see that there's one hunk lost. This
> is the one Oleg re-sent some days ago (the mail thread subject was
> 2.6.23-mm1 thread exit_group issue). Here it is (sent 13 oct 2007):
>
> --- kernel/fork.c~ 2007-10-13 15:41:35.000000000 +0400
> +++ kernel/fork.c 2007-10-13 15:41:41.000000000 +0400
> @@ -1443,6 +1443,9 @@ long do_fork(unsigned long clone_flags,
> task_pid_nr_ns(p, current->nsproxy->pid_ns) :
> task_pid_vnr(p);
>
> + if (clone_flags & CLONE_PARENT_SETTID)
> + put_user(nr, parent_tidptr);
> +
> if (clone_flags & CLONE_VFORK) {
> p->vfork_done = &vfork;
> init_completion(&vfork);
>
> Please, try with this patch.

The patch above fixes the issue for me.

Autofs works again as expected.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-23 16:02:33

by Andrew Morton

[permalink] [raw]

Subject: Re: futex strangeness in 2.6.23-mm1/UML

On Tue, 23 Oct 2007 09:52:08 -0400 Rik van Riel <[email protected]> wrote:

> On Tue, 23 Oct 2007 12:55:44 +0400
> Pavel Emelyanov <[email protected]> wrote:
>
> > I look at the 2.6.23-mm1 and see that there's one hunk lost. This
> > is the one Oleg re-sent some days ago (the mail thread subject was
> > 2.6.23-mm1 thread exit_group issue). Here it is (sent 13 oct 2007):
> >
> > --- kernel/fork.c~ 2007-10-13 15:41:35.000000000 +0400
> > +++ kernel/fork.c 2007-10-13 15:41:41.000000000 +0400
> > @@ -1443,6 +1443,9 @@ long do_fork(unsigned long clone_flags,
> > task_pid_nr_ns(p, current->nsproxy->pid_ns) :
> > task_pid_vnr(p);
> >
> > + if (clone_flags & CLONE_PARENT_SETTID)
> > + put_user(nr, parent_tidptr);
> > +
> > if (clone_flags & CLONE_VFORK) {
> > p->vfork_done = &vfork;
> > init_completion(&vfork);
> >
> > Please, try with this patch.
>
> The patch above fixes the issue for me.
>
> Autofs works again as expected.

argh. Sorry guys, I'd completely forgotten that that bug was present in
-mm1 and I didn't upload the fix into hot-fixes/. Bad me.