2002-02-07 23:19:35

by Alex Riesen

[permalink] [raw]
Subject: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

Frozen while compiling galeon (1.1.2, 778 files in ~50Mb),
also had xmms playing something (alsa-0.5.12, Ensoniq AudioPCI ES1371),
and some ssh (slow traffic, NIC Digital Equipment Corporation DECchip 21142/43).
NFS traffic (kernel automounter). XFree86 4.2.0, usb devices (mouse, for example).
Low static electricity.

It looks really bad :(
Ok, continue...

alt-sysrq-b booted, and sync seems also worked:

Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception: 0000000000000004
Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt
Feb 7 23:46:07 steel kernel: <6>SysRq : Emergency Sync
Feb 7 23:46:07 steel kernel: Syncing device 03:02 ... OK

I've pressed sysrq-s many times, at the moments sound played a second,
two or three times.

No serial console output, sorry, thought the system went stable.

Booted 2.5.4-pre1 before, recovered home reiserfs (--rebuild-tree)
from the mess it left. Rebooted in 2.4.18-pre8-K2. Got the panic.

-alex

P.S. no nasty suspections about processor, please. No funds reserved
for a new one :)

PIII-700, ASUS CUV4X (VIA KT133), <512Mb

ver_linux:

Linux steel 2.4.18-pre8-K2 #2 Thu Feb 7 00:02:26 CET 2002 i686 unknown

Gnu C 2.95.3
Gnu make 3.79.1
binutils 2.11.2
util-linux 2.11n
mount 2.11n
modutils 2.4.12
e2fsprogs 1.23
reiserfsprogs 3.x.0j
Linux C Library 2.2.4
Dynamic linker (ldd) 2.2.4
Procps 2.0.7
Console-tools 0.3.3
Sh-utils 2.0
Modules Loaded nfs lockd sunrpc ide-cd cdrom snd-seq-midi snd-seq-midi-event snd-seq snd-card-ens1371 snd-ens1371 snd-pcm snd-timer snd-rawmidi snd-seq-device snd-ac97-codec snd-mixer snd soundcore autofs4 tulip mousedev usbmouse usb-uhci usbcore input reiserfs ext3 jbd nls_iso8859-1 nls_cp437 vfat fat



2002-02-07 23:37:15

by Dave Jones

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

On Fri, Feb 08, 2002 at 12:18:31AM +0100, Alex Riesen wrote:

> Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception: 0000000000000004
> Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
> Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt

Machine checks are indicative of hardware fault.
Overclocking, inadequate cooling and bad memory are the usual causes.

> P.S. no nasty suspections about processor, please. No funds reserved
> for a new one :)

The truth hurts 8(

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-02-08 22:13:28

by Alex Riesen

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

On Fri, Feb 08, 2002 at 12:36:53AM +0100, Dave Jones wrote:
> On Fri, Feb 08, 2002 at 12:18:31AM +0100, Alex Riesen wrote:
>
> > Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception: 0000000000000004
> > Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
> > Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt
>
> Machine checks are indicative of hardware fault.
> Overclocking, inadequate cooling and bad memory are the usual causes.

no overclocking, memtest passed (1 pass, 1 hour), native intel cooler.
Space radiation, maybe 8)


> > P.S. no nasty suspections about processor, please. No funds reserved
> > for a new one :)
>
> The truth hurts 8(

oh dear...

>
> --
> | Dave Jones. http://www.codemonkey.org.uk
> | SuSE Labs

2002-02-08 22:45:00

by Dieter Nützel

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

On Friday, February 08, 2002 at 22:18 +0100, Alex Riesen wrote:
> On Fri, Feb 08, 2002 at 12:36:53AM +0100, Dave Jones wrote:
> > On Fri, Feb 08, 2002 at 12:18:31AM +0100, Alex Riesen wrote:
> >
> > > Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception:
> > > 0000000000000004
> > > Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
> > > Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt
> >
> > Machine checks are indicative of hardware fault.
> > Overclocking, inadequate cooling and bad memory are the usual causes.
>
> no overclocking, memtest passed (1 pass, 1 hour), native intel cooler.
> Space radiation, maybe 8)

We run it over night in our lab, to be sure...

Good luck!

-Dieter
--
Dieter N?tzel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: [email protected]

2002-02-10 21:30:55

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

Hi!

> > Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception: 0000000000000004
> > Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
> > Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt
>
> Machine checks are indicative of hardware fault.
> Overclocking, inadequate cooling and bad memory are the usual
> causes.

Maybe you should print something like

Machine Check Exception: .... (hardware problem!)

so that we get less reports like this?
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-02-10 21:36:18

by Dave Jones

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

On Sat, 9 Feb 2002, Pavel Machek wrote:

> > > Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception: 0000000000000004
> > > Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
> > > Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt
> > Machine checks are indicative of hardware fault.
> > Overclocking, inadequate cooling and bad memory are the usual
> > causes.
> Maybe you should print something like
> Machine Check Exception: .... (hardware problem!)
> so that we get less reports like this?

When I get around to finishing the diagnosis tool, I'll add
something like "Feed to decodemca for more info".

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-02-11 00:50:27

by Alan

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

> > Maybe you should print something like
> > Machine Check Exception: .... (hardware problem!)
> > so that we get less reports like this?
>
> When I get around to finishing the diagnosis tool, I'll add
> something like "Feed to decodemca for more info".

For a lot of processors the MCE values are not documented. Strangely for
once Intel are the good guys and AMD seem to be sitting on the docs.

2002-02-11 12:09:42

by Alex Riesen

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

I can good understand that it is a hardware problem.
But if someone seems not to be interested in reports like this,
why dump them out? Just save what we can and hang silently,
but no reports, they're boring 8-]


What does the "Bank 4: b200000000040151" mean?
If that is a memory, can anyone help to find out which slot it is?
(memtest86 haven't found anything, btw, i doubt that counts)
-alex

P.S. if someone going to change the message about machine check,
could you please avoid lame descriptions? Like "(hardware problem!)"?
I sure the majority are experienced enough to understand what the
words "Machine Check" mean.


On Sat, Feb 09, 2002 at 11:23:58PM +0100, Pavel Machek wrote:
> Hi!
>
> > > Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception: 0000000000000004
> > > Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
> > > Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt
> >
> > Machine checks are indicative of hardware fault.
> > Overclocking, inadequate cooling and bad memory are the usual
> > causes.
>
> Maybe you should print something like
>
> Machine Check Exception: .... (hardware problem!)
>
> so that we get less reports like this?
> Pavel
> --
> (about SSSCA) "I don't say this lightly. However, I really think that the U.S.
> no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-02-11 12:53:22

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.4.18-pre8-K2: Kernel panic: CPU context corrupt

Hi!

> What does the "Bank 4: b200000000040151" mean?
> If that is a memory, can anyone help to find out which slot it is?
> (memtest86 haven't found anything, btw, i doubt that counts)
> -alex
>
> P.S. if someone going to change the message about machine check,
> could you please avoid lame descriptions? Like "(hardware problem!)"?
> I sure the majority are experienced enough to understand what the
> words "Machine Check" mean.

Ugh? If you understand that its hardware problem, why did you bother
contacting l-k? l-k is certainly not interested in debugging hardware
problems....

...and... It is not exactly easy to see that Machine check means
hardware problem...

Pavel

> > > > Feb 7 23:45:31 steel kernel: CPU 0: Machine Check Exception: 0000000000000004
> > > > Feb 7 23:45:31 steel kernel: Bank 4: b200000000040151
> > > > Feb 7 23:45:31 steel kernel: Kernel panic: CPU context corrupt
> > >
> > > Machine checks are indicative of hardware fault.
> > > Overclocking, inadequate cooling and bad memory are the usual
> > > causes.
> >
> > Maybe you should print something like
> >
> > Machine Check Exception: .... (hardware problem!)
> >
> > so that we get less reports like this?

--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.