2001-04-26 03:40:54

by Bob McElrath

[permalink] [raw]
Subject: aa's rwsem-generic-6 bug? Process stuck in 'R' state.

Running 2.4.4pre4 with Andrea's rwsem-generic-6 patch, I have just
gotten a process stuck in the 'R' state. According to the ps man page
this is: "runnable (on run queue)". The 'ps aux' output is:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 7921 0.8 26.9 91720 68608 ? R< 00:33 11:20 /usr/X11R6/bin/X

X is niced at -10 and doesn't respond to kill or kill -9.

alpha 21164 (ev56) architecture. kernel compiled with:
gcc version 2.96 20000731 (Red Hat Linux 7.0)

Cheers,
-- Bob

Bob McElrath ([email protected])
Univ. of Wisconsin at Madison, Department of Physics


Attachments:
(No filename) (627.00 B)
(No filename) (240.00 B)
Download all attachments

2001-04-26 04:11:41

by Andrea Arcangeli

[permalink] [raw]
Subject: it isn't aa's rwsem-generic-6 bug but something else [Re: aa's rwsem-generic-6 bug? Process stuck in 'R' state.]

On Wed, Apr 25, 2001 at 10:39:39PM -0500, Bob McElrath wrote:
> Running 2.4.4pre4 with Andrea's rwsem-generic-6 patch, I have just
> gotten a process stuck in the 'R' state. According to the ps man page
> this is: "runnable (on run queue)". The 'ps aux' output is:
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 7921 0.8 26.9 91720 68608 ? R< 00:33 11:20 /usr/X11R6/bin/X
>
> X is niced at -10 and doesn't respond to kill or kill -9.
>
> alpha 21164 (ev56) architecture. kernel compiled with:
> gcc version 2.96 20000731 (Red Hat Linux 7.0)

The fact X is also part of the equation makes things even less obvious
(now we're not even sure it's a kernel bug).

generic-rwsem-6 is a very trivial implementation and I'm pretty sure it
is the _last_ thing that could go wrong in your equation. I mean if it
goes wrong then it's more likely to be a bug in the spinlocks or
whatever in the architectural part of the kernel than in the common code
(rwsem-generic-6 was all common code btw).

Furthmore the X server shouldn't really be such an heavy user of the
rwsemaphores, as first it's not even threaded.

You can also press SYSRQ+P and get some EIP so we see a bit more what's
going on with the X server (assuming such cpu still receives interrupt).

BTW, could you also try to compile with egcs 1.1.2 just in case? I
learnt the hard way that for the alpha gcc 2.95.* isn't going to work
well (I didn't tried official 95.3 exactly yet, but certainly an older .3
from the 2_95-branch of gcc cvs definitely miscompiled all my 2.4
kernels, 2.96 with some houndred of patches [literally] is certainly
better than 2.95.* on the alpha but egcs is definitely still worth a
try) (personally I'm using egcs 1.1.2 for the 2.[24] alpha kernels and
2.95.4 (2_95-branch of cvs) for the 2.[24] x86 kernels [and gcc 3.1 for
x86-64 ;])

Andrea

2001-04-26 05:38:34

by Bob McElrath

[permalink] [raw]
Subject: Re: it isn't aa's rwsem-generic-6 bug but something else [Re: aa's rwsem-generic-6 bug? Process stuck in 'R' state.]

Andrea Arcangeli [[email protected]] wrote:
> On Wed, Apr 25, 2001 at 10:39:39PM -0500, Bob McElrath wrote:
> > Running 2.4.4pre4 with Andrea's rwsem-generic-6 patch, I have just
> > gotten a process stuck in the 'R' state. According to the ps man page
> > this is: "runnable (on run queue)". The 'ps aux' output is:
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> > root 7921 0.8 26.9 91720 68608 ? R< 00:33 11:20 /usr/X11R6/bin/X
> >
> > X is niced at -10 and doesn't respond to kill or kill -9.
> >
> > alpha 21164 (ev56) architecture. kernel compiled with:
> > gcc version 2.96 20000731 (Red Hat Linux 7.0)
>
> The fact X is also part of the equation makes things even less obvious
> (now we're not even sure it's a kernel bug).

Tell me about it. But the fact remains that I never see these hangs
with a 2.2 kernel. I've also futzed with X quite a bit to try and track
this down, to no avail. I have tracked down some separate X bugs
though. In the next iteration I'll use the mga driver from XFree86 CVS
(which had some alpha-specific changes, I hear).

During this last hang I tried to get gdb to attach to the X process.
gdb hung after issuing 'attach 7921', and had to be killed. My naive
interpretation is that this indicates a kernel problem, and nothing to
do with X.

Egad I wish this were more reproducible. Having a hang once every 3
days sucks for debugging.

> generic-rwsem-6 is a very trivial implementation and I'm pretty sure it
> is the _last_ thing that could go wrong in your equation. I mean if it
> goes wrong then it's more likely to be a bug in the spinlocks or
> whatever in the architectural part of the kernel than in the common code
> (rwsem-generic-6 was all common code btw).
>
> Furthmore the X server shouldn't really be such an heavy user of the
> rwsemaphores, as first it's not even threaded.

When I posted this bug originally, you came right out and said it was
probably the rwsemaphores. I really have no idea how the rwsemaphores
work, and don't know myself that they are even the problem. My
process-table-hang seems consistent with something having a lock on the
process table and not letting go of it. (Note in this last "hang", the
process table did not hang...that is, ps dumped the entire process list
without a burp)

> You can also press SYSRQ+P and get some EIP so we see a bit more what's
> going on with the X server (assuming such cpu still receives interrupt).

The CPU still receives interrupts, and other than this one X process,
acts normally. (even in the process-table-hang-case...as long as I
don't run ps, everything is fine) I had to reboot to get rid of the
hung X process though. (shutdown proceeded normally)

I'm running a debug X build at this point, and have identified at least
two separate bugs in the X server that were causing hangs. I've
reported these to the X people. I didn't get debug info out of the X
server after the process-table-hang because X continued to behave
normally during the process-table-hang.

> BTW, could you also try to compile with egcs 1.1.2 just in case? I
> learnt the hard way that for the alpha gcc 2.95.* isn't going to work
> well (I didn't tried official 95.3 exactly yet, but certainly an older .3
> from the 2_95-branch of gcc cvs definitely miscompiled all my 2.4
> kernels, 2.96 with some houndred of patches [literally] is certainly
> better than 2.95.* on the alpha but egcs is definitely still worth a
> try) (personally I'm using egcs 1.1.2 for the 2.[24] alpha kernels and
> 2.95.4 (2_95-branch of cvs) for the 2.[24] x86 kernels [and gcc 3.1 for
> x86-64 ;])

I have been using egcs 1.1.2 (rh7 kgcc) Only this last hang was with a
2.96-compiled kernel (I forgot to change the makefile to use kgcc
instead of gcc...then figured what the hell) The rest were with egcs
1.1.2. I'll use egcs 1.1.2 in the future.

Cheers,
-- Bob

Bob McElrath ([email protected])
Univ. of Wisconsin at Madison, Department of Physics


Attachments:
(No filename) (3.91 kB)
(No filename) (240.00 B)
Download all attachments

2001-04-26 15:46:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: it isn't aa's rwsem-generic-6 bug but something else [Re: aa's rwsem-generic-6 bug? Process stuck in 'R' state.]

On Thu, Apr 26, 2001 at 12:38:02AM -0500, Bob McElrath wrote:
> When I posted this bug originally, you came right out and said it was
> probably the rwsemaphores. I really have no idea how the rwsemaphores

You were talking about the ps table hang when I told you about the rwsem
races. I had the same trouble on my alpha and I reproduced the races
trivially by lanucing:

make MAKE='make -j2' -j2 &

while :; do ps xa ; sleep 1 ; done

After a few seconds ps deadlocked. Try that on the old asm semaphores.

It was 100% reproducible, and after I rewrote the rwsemaphores the
deadlock gone away completly.

Your X hanging in R state is completly unrelated to the rwsem ps table
hang problem as far I can tell.

> I'm running a debug X build at this point, and have identified at least

If you can reproduce without starting X I will be interested in fixing
the hang. Maybe you have a graphics card with a fb driver that doesn't
need VESA that maybe works on the alpha, then you could run X without
privilegies. (btw, in theory we could make the VESA thing working as
well using the x86 emulator in SRM but nobody attempted to implement
that yet)

> instead of gcc...then figured what the hell) The rest were with egcs
> 1.1.2. I'll use egcs 1.1.2 in the future.

good.

Andrea

2001-04-26 16:11:09

by Bob McElrath

[permalink] [raw]
Subject: Re: it isn't aa's rwsem-generic-6 bug but something else [Re: aa's rwsem-generic-6 bug? Process stuck in 'R' state.]

Andrea Arcangeli [[email protected]] wrote:
> On Thu, Apr 26, 2001 at 12:38:02AM -0500, Bob McElrath wrote:
> > When I posted this bug originally, you came right out and said it was
> > probably the rwsemaphores. I really have no idea how the rwsemaphores
>
> You were talking about the ps table hang when I told you about the rwsem
> races. I had the same trouble on my alpha and I reproduced the races
> trivially by lanucing:
>
> make MAKE='make -j2' -j2 &
>
> while :; do ps xa ; sleep 1 ; done
>
> After a few seconds ps deadlocked. Try that on the old asm semaphores.

This does not cause a hang on my machine with your new rwsemaphores.

> It was 100% reproducible, and after I rewrote the rwsemaphores the
> deadlock gone away completly.
>
> Your X hanging in R state is completly unrelated to the rwsem ps table
> hang problem as far I can tell.

Ok, so what are the other alternatives? In the R state, the scheduler
should give it some CPU at the first available jiffy, correct? After
several minutes it was still stuck in the R state, and had received 0
CPU time.

Could this be a scheduler bug?

Another thing I just noticed: watching the ps list, gcc is getting
called with -mcpu=ev56, which in turn is calling as with -mev6. Since
this is an ev56 processor, not the newer ev6, this could conceivable be
generating illegal instructions, though I haven't ever seen any kernel
illegal instruction faults.

*Sigh*
-- Bob

Bob McElrath ([email protected])
Univ. of Wisconsin at Madison, Department of Physics


Attachments:
(No filename) (1.50 kB)
(No filename) (240.00 B)
Download all attachments