2007-09-24 21:08:28

by Helge Hafting

[permalink] [raw]
Subject: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

The two kernels mentioned hangs occationally.
Typically when I compile something and pass the time
by surfing the web.

A few minutes and then I notice that the mouse (and everything else in X)
stops. kbd LEDs does not react to numlock/capslock.
The only thing that still works is sysrq+B
So far this has happened while running X, so no messages.

I have gone back to 2.6.22rc4, which seems to work.

This is a single opteron, although on a dual-slot board.


Helge Hafting


2007-09-24 21:29:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22


On Mon, 2007-09-24 at 23:08 +0200, Helge Hafting wrote:
> The two kernels mentioned hangs occationally.
> Typically when I compile something and pass the time
> by surfing the web.
>
> A few minutes and then I notice that the mouse (and everything else in X)
> stops. kbd LEDs does not react to numlock/capslock.
> The only thing that still works is sysrq+B
> So far this has happened while running X, so no messages.
>
> I have gone back to 2.6.22rc4, which seems to work.
>
> This is a single opteron, although on a dual-slot board.

Can you switch to serial console, so we can get some information out of
that box? Sysrq-B is working, so we can get info from other sysrq
functions as well.

tglx


2007-09-29 14:10:14

by Helge Hafting

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Thomas Gleixner wrote:
> On Mon, 2007-09-24 at 23:08 +0200, Helge Hafting wrote:
>
>> The two kernels mentioned hangs occationally.
>> Typically when I compile something and pass the time
>> by surfing the web.
>>
>> A few minutes and then I notice that the mouse (and everything else in X)
>> stops. kbd LEDs does not react to numlock/capslock.
>> The only thing that still works is sysrq+B
>> So far this has happened while running X, so no messages.
>>
>> I have gone back to 2.6.22rc4, which seems to work.
>>
>> This is a single opteron, although on a dual-slot board.
>>
>
> Can you switch to serial console, so we can get some information out of
> that box? Sysrq-B is working, so we can get info from other sysrq
> functions as well.
>
I didn't need the serial - it crashes during console work too.
I think a "make clean" was in progress at the time. There must be work
going on
in order to crash.

This time 2.6.22rc4 died on me with a general protection fault

I got two reports, the first one scrolled partially off screen but
the whole trace was there:

shrink_dcache_memory
shrink_slab
kswapd
autoremove_wake_function
thread_return
trace_hardirqs_on
kswapd
kswapd
kthtread
child_rip
restore_args
kthread
child_rip

Then I got:
spinlock lockup on cpu #0, kswapd 0/212
_raw_spin_lock
shrink_dcache_parent
shrink_dcache_parent
proc_flush_task
release_task
do_exit
die
error_exit
prune_dcache
[From here on, it continues exactly like the first report:]
shrink_dcache_memory
shrink_slab
kswapd
autoremove_wake_function
thread_return
trace_hardirqs_on
kswapd
kswapd
kthtread
child_rip
restore_args
kthread
child_rip


sysrq P says:
cpu 0
pid 212 comm: kswapd0 not tainted 2.6.22-rc4 #18
RIP: __delay

I took a picture of the screen, in case the register dumps are interesting.
Wonder what this is - dcache trouble? swap trouble?
Helge Hafting

2007-09-30 15:58:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

On Sat, 29 Sep 2007, Helge Hafting wrote:
> Thomas Gleixner wrote:
> > > I have gone back to 2.6.22rc4, which seems to work.
> > >
> > > This is a single opteron, although on a dual-slot board.
> > >
> >
> > Can you switch to serial console, so we can get some information out of
> > that box? Sysrq-B is working, so we can get info from other sysrq
> > functions as well.
> >
> I didn't need the serial - it crashes during console work too.
> I think a "make clean" was in progress at the time. There must be work going
> on in order to crash.
>
> This time 2.6.22rc4 died on me with a general protection fault
>
> I got two reports, the first one scrolled partially off screen but
> the whole trace was there:

That's why I asked for a serial console. That way we can get all the
information from the reports including the register dumps ....

> Then I got:
> spinlock lockup on cpu #0, kswapd 0/212

That's probably caused by the previous one.

tglx

2007-09-30 20:58:29

by Helge Hafting

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Thomas Gleixner wrote:
> On Sat, 29 Sep 2007, Helge Hafting wrote:
>
>> Thomas Gleixner wrote:
>>
>>>> I have gone back to 2.6.22rc4, which seems to work.
>>>>
>>>> This is a single opteron, although on a dual-slot board.
>>>>
>>>>
>>> Can you switch to serial console, so we can get some information out of
>>> that box? Sysrq-B is working, so we can get info from other sysrq
>>> functions as well.
>>>
>>>
>> I didn't need the serial - it crashes during console work too.
>> I think a "make clean" was in progress at the time. There must be work going
>> on in order to crash.
>>
>> This time 2.6.22rc4 died on me with a general protection fault
>>
>> I got two reports, the first one scrolled partially off screen but
>> the whole trace was there:
>>
>
> That's why I asked for a serial console. That way we can get all the
> information from the reports including the register dumps ...
>
Sure. But I can't get a cable right now. Was the registers necessary
in this case? Often, the trace turns out to be enough.

Helge Hafting

2007-09-30 21:44:25

by Andi Kleen

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Helge Hafting <[email protected]> writes:
>
> shrink_dcache_memory

That usually means random memory corruption from somewhere -- dcache
tends to use a lot of memory and when it is corrupted anywhere these
functions tend to crash while walking the lists.

Unfortunately memory corruption is hard to track down because
the messenger is usually not the one to blame.

Perhaps enable slab debugging and see if it turns
something up. Could be also broken hardware. Does an older kernel
run stable? If yes and if it can be reproduced bisecting would
be good.

-Andi

2007-10-01 08:41:52

by Helge Hafting

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Andi Kleen wrote:
> Helge Hafting <[email protected]> writes:
>
>> shrink_dcache_memory
>>
>
> That usually means random memory corruption from somewhere -- dcache
> tends to use a lot of memory and when it is corrupted anywhere these
> functions tend to crash while walking the lists.
>
> Unfortunately memory corruption is hard to track down because
> the messenger is usually not the one to blame.
>
> Perhaps enable slab debugging and see if it turns
> something up. Could be also broken hardware. Does an older kernel
> run stable? If yes and if it can be reproduced bisecting would
> be good.
>
2.6.18 had no problem compiling stuff without crashing.
Looks like I have some work to do then.

Helge Hafting

2007-10-05 12:12:55

by Helge Hafting

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Andi Kleen wrote:
> Helge Hafting <[email protected]> writes:
>
>> shrink_dcache_memory
>>
>
> That usually means random memory corruption from somewhere -- dcache
> tends to use a lot of memory and when it is corrupted anywhere these
> functions tend to crash while walking the lists.
>
> Unfortunately memory corruption is hard to track down because
> the messenger is usually not the one to blame.
>
> Perhaps enable slab debugging and see if it turns
> something up. Could be also broken hardware. Does an older kernel
> run stable? If yes and if it can be reproduced bisecting would
> be good.
>
I attempted bisecting - and failed. The problem is that
2.6.23rc7 seems very unstable, but 2.6.22rc4 is much better
but not perfect. 2.6.22rc4 only crashed once - it can compile for
hours and swap lots and keep running. But it died at least once.

I'll try running recent kernels with more debugging instead.
I think I used SLUB instead of SLAB - either way I can switch
that over to see if it changes things.

Helge Hafting

2007-10-08 22:36:32

by Helge Hafting

[permalink] [raw]
Subject: Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Thomas Gleixner wrote:
> On Sat, 29 Sep 2007, Helge Hafting wrote:
>
>> Thomas Gleixner wrote:
>>
>>>> I have gone back to 2.6.22rc4, which seems to work.
>>>>
>>>> This is a single opteron, although on a dual-slot board.
>>>>
>>>>
>>> Can you switch to serial console, so we can get some information out of
>>> that box? Sysrq-B is working, so we can get info from other sysrq
>>> functions as well.
>>>
>>>
>> I didn't need the serial - it crashes during console work too.
>> I think a "make clean" was in progress at the time. There must be work going
>> on in order to crash.
>>
>> This time 2.6.22rc4 died on me with a general protection fault
>>
>> I got two reports, the first one scrolled partially off screen but
>> the whole trace was there:
>>
>
> That's why I asked for a serial console. That way we can get all the
> information from the reports including the register dumps ....
>
I got another crash - with a full dump. I have also discovered
files with lots of single-bit errors, so this is probably just some kind
of hw problem. :-(

Replace mermory or the motherboard with everything on it . . . :-(

Helge Hafting