2002-12-26 01:45:00

by Josh Brooks

[permalink] [raw]
Subject: CPU failures ... or something else ?


Hello,

I have a dual p3 866 running 2.4 kernel that is crashing once every few
days leaving this on the console:


Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
localhost kernel: CPU 1: Machine Check Exception: 0000000000000004

Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
localhost kernel: Bank 4: b200000000040151

Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
localhost kernel: Kernel panic: CPU context corrupt



Word on the street is that this indicates hardware failure of some kind
(cpu, bus, or memory). My main question is, is that very surely the
culprit, or is it also possible that all of the hardware is perfect and
that a bug in the kernel code or some outside influence (remote exploit)
is causing this crash ?

Basically, I am ordering all new hardware to swap out, and I just want to
know if there is some remote possibility that my hardware is actually just
fine and this is some kind of software error ?

ALSO, I have not been physically at the console when this has happened,
and have not tried this yet, but whatever that thing is where you press
ctrl-alt-printscreen and get to enter those post-crash commands - do you
think that would work in this situation, or does the above error hard lock
the system so you can't do those emergency measures ?

thanks!



2002-12-26 01:51:05

by Bubba

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

try turning off the Machine Check Exception in the kernel as it is just buggy
on some machines, not necessarily a bug in the kernel, or without
recompiling, use the kernel param "nomce"

On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> Hello,
>
> I have a dual p3 866 running 2.4 kernel that is crashing once every few
> days leaving this on the console:
>
>
> Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
>
> Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> localhost kernel: Bank 4: b200000000040151
>
> Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> localhost kernel: Kernel panic: CPU context corrupt
>
>
>
> Word on the street is that this indicates hardware failure of some kind
> (cpu, bus, or memory). My main question is, is that very surely the
> culprit, or is it also possible that all of the hardware is perfect and
> that a bug in the kernel code or some outside influence (remote exploit)
> is causing this crash ?
>
> Basically, I am ordering all new hardware to swap out, and I just want to
> know if there is some remote possibility that my hardware is actually just
> fine and this is some kind of software error ?
>
> ALSO, I have not been physically at the console when this has happened,
> and have not tried this yet, but whatever that thing is where you press
> ctrl-alt-printscreen and get to enter those post-crash commands - do you
> think that would work in this situation, or does the above error hard lock
> the system so you can't do those emergency measures ?
>
> thanks!
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-12-26 02:57:01

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


So you are saying, that yes, it _is_ possible that my equipment is not
faulty in any way ?

thanks!

On Wed, 25 Dec 2002, Bubba wrote:

> try turning off the Machine Check Exception in the kernel as it is just buggy
> on some machines, not necessarily a bug in the kernel, or without
> recompiling, use the kernel param "nomce"
>
> On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> > Hello,
> >
> > I have a dual p3 866 running 2.4 kernel that is crashing once every few
> > days leaving this on the console:
> >
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> > localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > localhost kernel: Bank 4: b200000000040151
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > localhost kernel: Kernel panic: CPU context corrupt
> >
> >
> >
> > Word on the street is that this indicates hardware failure of some kind
> > (cpu, bus, or memory). My main question is, is that very surely the
> > culprit, or is it also possible that all of the hardware is perfect and
> > that a bug in the kernel code or some outside influence (remote exploit)
> > is causing this crash ?
> >
> > Basically, I am ordering all new hardware to swap out, and I just want to
> > know if there is some remote possibility that my hardware is actually just
> > fine and this is some kind of software error ?
> >
> > ALSO, I have not been physically at the console when this has happened,
> > and have not tried this yet, but whatever that thing is where you press
> > ctrl-alt-printscreen and get to enter those post-crash commands - do you
> > think that would work in this situation, or does the above error hard lock
> > the system so you can't do those emergency measures ?
> >
> > thanks!
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-12-26 03:00:49

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


Oh and by the way, this is a dell poweredge 2450, dual 866 p3 cpus, 2gigs
ram, and using a PERC 3/D. I have a 2.4.1 system running on _identical_
hardware with no problems, and this system that is MCE'ing is a 2.4.16.

So ... not sure if that raises any red flags as far as false/spurious MCEs
are concerned, but either way comments are appreciated.

I will try the nomce option just in case, but I suspect I have bad
hardware. Again, any comments / war stories appreciated.

thanks!

On Wed, 25 Dec 2002, Bubba wrote:

> try turning off the Machine Check Exception in the kernel as it is just buggy
> on some machines, not necessarily a bug in the kernel, or without
> recompiling, use the kernel param "nomce"
>
> On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> > Hello,
> >
> > I have a dual p3 866 running 2.4 kernel that is crashing once every few
> > days leaving this on the console:
> >
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> > localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > localhost kernel: Bank 4: b200000000040151
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > localhost kernel: Kernel panic: CPU context corrupt
> >
> >
> >
> > Word on the street is that this indicates hardware failure of some kind
> > (cpu, bus, or memory). My main question is, is that very surely the
> > culprit, or is it also possible that all of the hardware is perfect and
> > that a bug in the kernel code or some outside influence (remote exploit)
> > is causing this crash ?
> >
> > Basically, I am ordering all new hardware to swap out, and I just want to
> > know if there is some remote possibility that my hardware is actually just
> > fine and this is some kind of software error ?
> >
> > ALSO, I have not been physically at the console when this has happened,
> > and have not tried this yet, but whatever that thing is where you press
> > ctrl-alt-printscreen and get to enter those post-crash commands - do you
> > think that would work in this situation, or does the above error hard lock
> > the system so you can't do those emergency measures ?
> >
> > thanks!
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-12-26 03:02:26

by Brad Parker

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

I never said that. A bad CPU would be my last guess. My first two are buggy
board (use nomce) or bad addresses in your ram. try running Memtest86
(http://www.memtest86.com) for a few minutes and see if you get any errors.

On Wednesday 25 December 2002 21:04, you wrote:
> So you are saying, that yes, it _is_ possible that my equipment is not
> faulty in any way ?
>
> thanks!
>
> On Wed, 25 Dec 2002, Bubba wrote:
> > try turning off the Machine Check Exception in the kernel as it is just
> > buggy on some machines, not necessarily a bug in the kernel, or without
> > recompiling, use the kernel param "nomce"
> >
> > On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> > > Hello,
> > >
> > > I have a dual p3 866 running 2.4 kernel that is crashing once every few
> > > days leaving this on the console:
> > >
> > >
> > > Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> > > localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
> > >
> > > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > > localhost kernel: Bank 4: b200000000040151
> > >
> > > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > > localhost kernel: Kernel panic: CPU context corrupt
> > >
> > >
> > >
> > > Word on the street is that this indicates hardware failure of some kind
> > > (cpu, bus, or memory). My main question is, is that very surely the
> > > culprit, or is it also possible that all of the hardware is perfect and
> > > that a bug in the kernel code or some outside influence (remote
> > > exploit) is causing this crash ?
> > >
> > > Basically, I am ordering all new hardware to swap out, and I just want
> > > to know if there is some remote possibility that my hardware is
> > > actually just fine and this is some kind of software error ?
> > >
> > > ALSO, I have not been physically at the console when this has happened,
> > > and have not tried this yet, but whatever that thing is where you press
> > > ctrl-alt-printscreen and get to enter those post-crash commands - do
> > > you think that would work in this situation, or does the above error
> > > hard lock the system so you can't do those emergency measures ?
> > >
> > > thanks!
> > >
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > in the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------


2002-12-26 03:12:13

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


Ok, last post - but here are some more details. First, the error produced
is:

Message from syslogd@localhost at Mon Dec 23 22:44:16 2002 ...
localhost kernel: CPU 1: Machine Check Exception: 0000000000000004

Message from syslogd@localhost at Mon Dec 23 22:44:17 2002 ...
localhost kernel: Bank 4: b200000000040151

Message from syslogd@localhost at Mon Dec 23 22:44:17 2002 ...
localhost kernel: Kernel panic: CPU context corrupt


So, using the parsemce.c program that exists, I run:


usage: parsemce [options]
options: -V <version number>
-e <MCE status code>
-b <bank number>
-s <bank status code>
-a <bank address>
-f <filename, with MCE dump inside>
-i <get MCE dump from stdin>


So:


./a.out -e 0000000000000004 -b 4 -s b200000000040151

(assuming MCE status code is 0000000000000004 and bank status code is
b200000000040151 )

and I get this as a result:

Status: (4) Machine Check in progress.
Restart IP invalid.

Any ideas what "Restart IP invalid" means ?

thanks.



On Wed, 25 Dec 2002, Bubba wrote:

> try turning off the Machine Check Exception in the kernel as it is just buggy
> on some machines, not necessarily a bug in the kernel, or without
> recompiling, use the kernel param "nomce"
>
> On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> > Hello,
> >
> > I have a dual p3 866 running 2.4 kernel that is crashing once every few
> > days leaving this on the console:
> >
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> > localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > localhost kernel: Bank 4: b200000000040151
> >
> > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > localhost kernel: Kernel panic: CPU context corrupt
> >
> >
> >
> > Word on the street is that this indicates hardware failure of some kind
> > (cpu, bus, or memory). My main question is, is that very surely the
> > culprit, or is it also possible that all of the hardware is perfect and
> > that a bug in the kernel code or some outside influence (remote exploit)
> > is causing this crash ?
> >
> > Basically, I am ordering all new hardware to swap out, and I just want to
> > know if there is some remote possibility that my hardware is actually just
> > fine and this is some kind of software error ?
> >
> > ALSO, I have not been physically at the console when this has happened,
> > and have not tried this yet, but whatever that thing is where you press
> > ctrl-alt-printscreen and get to enter those post-crash commands - do you
> > think that would work in this situation, or does the above error hard lock
> > the system so you can't do those emergency measures ?
> >
> > thanks!
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
>

2002-12-26 03:14:37

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


Ok, understood - and that is why I thought it was significant that I am
running 2.4.1 on _identical_ hardware without problems - presumably if
that is the case then running 2.4.16 on the same hardware should be fine
as well (in terms of a buggy board) - therefore I suspect bad hardware.

Is this good reasoning ? (see post about dell hardware specifics, etc.)

On Wed, 25 Dec 2002, Ro0tSiEgE wrote:

> I never said that. A bad CPU would be my last guess. My first two are buggy
> board (use nomce) or bad addresses in your ram. try running Memtest86
> (http://www.memtest86.com) for a few minutes and see if you get any errors.
>
> On Wednesday 25 December 2002 21:04, you wrote:
> > So you are saying, that yes, it _is_ possible that my equipment is not
> > faulty in any way ?
> >
> > thanks!
> >
> > On Wed, 25 Dec 2002, Bubba wrote:
> > > try turning off the Machine Check Exception in the kernel as it is just
> > > buggy on some machines, not necessarily a bug in the kernel, or without
> > > recompiling, use the kernel param "nomce"
> > >
> > > On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> > > > Hello,
> > > >
> > > > I have a dual p3 866 running 2.4 kernel that is crashing once every few
> > > > days leaving this on the console:
> > > >
> > > >
> > > > Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> > > > localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
> > > >
> > > > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > > > localhost kernel: Bank 4: b200000000040151
> > > >
> > > > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > > > localhost kernel: Kernel panic: CPU context corrupt
> > > >
> > > >
> > > >
> > > > Word on the street is that this indicates hardware failure of some kind
> > > > (cpu, bus, or memory). My main question is, is that very surely the
> > > > culprit, or is it also possible that all of the hardware is perfect and
> > > > that a bug in the kernel code or some outside influence (remote
> > > > exploit) is causing this crash ?
> > > >
> > > > Basically, I am ordering all new hardware to swap out, and I just want
> > > > to know if there is some remote possibility that my hardware is
> > > > actually just fine and this is some kind of software error ?
> > > >
> > > > ALSO, I have not been physically at the console when this has happened,
> > > > and have not tried this yet, but whatever that thing is where you press
> > > > ctrl-alt-printscreen and get to enter those post-crash commands - do
> > > > you think that would work in this situation, or does the above error
> > > > hard lock the system so you can't do those emergency measures ?
> > > >
> > > > thanks!
> > > >
> > > >
> > > > -
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > > in the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at http://www.tux.org/lkml/
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > in the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
>
> -------------------------------------------------------
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-12-26 03:23:15

by Billy Rose

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

> Oh and by the way, this is a dell poweredge 2450, dual 866 p3 cpus,
> 2gigs ram, and using a PERC 3/D. I have a 2.4.1 system running on
> _identical_ hardware with no problems, and this system that is
> MCE'ing is a 2.4.16.

try reseating the cpu's and vrm's. if that doesnt work, remove cpu #2
and #2 vrm. run it and see if the error occurs. if no error, #2 cpu or
#2 vrm is bad. if the error still occurs, swap out cpu #1 and #1 vrm
with cpu #2 and #2 vrm, then run again. if the error still occurs,
youre SOL.

billy

=====
"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

2002-12-26 03:28:25

by Joe

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

FWIW, I had 3 identical 2450s similar to
yours, all with RH linux 7.2 installed -
2 of them were rock solid, the other had
random crashes, nothing in the logs, no
pattern to the crashes that I could see.

Dell had me upgrade/reflash the BIOS
and the perc raid controller, and the box
has been rock solid ever since...

Just my $.02 on the matter -

Joe

Josh Brooks wrote:

>Oh and by the way, this is a dell poweredge 2450, dual 866 p3 cpus, 2gigs
>ram, and using a PERC 3/D. I have a 2.4.1 system running on _identical_
>hardware with no problems, and this system that is MCE'ing is a 2.4.16.
>
>So ... not sure if that raises any red flags as far as false/spurious MCEs
>are concerned, but either way comments are appreciated.
>
>I will try the nomce option just in case, but I suspect I have bad
>hardware. Again, any comments / war stories appreciated.
>
>thanks!
>
>On Wed, 25 Dec 2002, Bubba wrote:
>
>
>
>>try turning off the Machine Check Exception in the kernel as it is just buggy
>>on some machines, not necessarily a bug in the kernel, or without
>>recompiling, use the kernel param "nomce"
>>
>>On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
>>
>>
>>>Hello,
>>>
>>>I have a dual p3 866 running 2.4 kernel that is crashing once every few
>>>days leaving this on the console:
>>>
>>>
>>>Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
>>>localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
>>>
>>>Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
>>>localhost kernel: Bank 4: b200000000040151
>>>
>>>Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
>>>localhost kernel: Kernel panic: CPU context corrupt
>>>
>>>
>>>
>>>Word on the street is that this indicates hardware failure of some kind
>>>(cpu, bus, or memory). My main question is, is that very surely the
>>>culprit, or is it also possible that all of the hardware is perfect and
>>>that a bug in the kernel code or some outside influence (remote exploit)
>>>is causing this crash ?
>>>
>>>Basically, I am ordering all new hardware to swap out, and I just want to
>>>know if there is some remote possibility that my hardware is actually just
>>>fine and this is some kind of software error ?
>>>
>>>ALSO, I have not been physically at the console when this has happened,
>>>and have not tried this yet, but whatever that thing is where you press
>>>ctrl-alt-printscreen and get to enter those post-crash commands - do you
>>>think that would work in this situation, or does the above error hard lock
>>>the system so you can't do those emergency measures ?
>>>
>>>thanks!
>>>
>>>
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>the body of a message to [email protected]
>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>Please read the FAQ at http://www.tux.org/lkml/
>>>
>>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>


2002-12-26 03:30:25

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


Well actually I ordered a complete replacement system - identical in every
way. So I am getting that on saturday, and presumably that will just be
the big hammer that makes every problem go away.

I am just posting to get a head start on the issue if, for some crazy
reason I replace all hardware and the problem continues. Sounds like that
is a slim to none chance, since I am dealing with good hardware (dell) and
it looks like this is a faulty component at work.

Basically I am just moving the disks from one machine to another on
saturday, and I suspect the problems just disappear when I do that.

Comments on the possibility that the problems continue after moving the
disks to different (but identical) hardware ?

thanks!

On Wed, 25 Dec 2002, Billy Rose wrote:

> > Oh and by the way, this is a dell poweredge 2450, dual 866 p3 cpus,
> > 2gigs ram, and using a PERC 3/D. I have a 2.4.1 system running on
> > _identical_ hardware with no problems, and this system that is
> > MCE'ing is a 2.4.16.
>
> try reseating the cpu's and vrm's. if that doesnt work, remove cpu #2
> and #2 vrm. run it and see if the error occurs. if no error, #2 cpu or
> #2 vrm is bad. if the error still occurs, swap out cpu #1 and #1 vrm
> with cpu #2 and #2 vrm, then run again. if the error still occurs,
> youre SOL.
>
> billy
>
> =====
> "there's some milk in the fridge that's about to go bad...
> and there it goes..." -bobby
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-12-26 03:30:51

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


thanks for that advice - much appreciated.



On Wed, 25 Dec 2002, J Sloan wrote:

> FWIW, I had 3 identical 2450s similar to
> yours, all with RH linux 7.2 installed -
> 2 of them were rock solid, the other had
> random crashes, nothing in the logs, no
> pattern to the crashes that I could see.
>
> Dell had me upgrade/reflash the BIOS
> and the perc raid controller, and the box
> has been rock solid ever since...
>
> Just my $.02 on the matter -
>
> Joe
>
> Josh Brooks wrote:
>
> >Oh and by the way, this is a dell poweredge 2450, dual 866 p3 cpus, 2gigs
> >ram, and using a PERC 3/D. I have a 2.4.1 system running on _identical_
> >hardware with no problems, and this system that is MCE'ing is a 2.4.16.
> >
> >So ... not sure if that raises any red flags as far as false/spurious MCEs
> >are concerned, but either way comments are appreciated.
> >
> >I will try the nomce option just in case, but I suspect I have bad
> >hardware. Again, any comments / war stories appreciated.
> >
> >thanks!
> >
> >On Wed, 25 Dec 2002, Bubba wrote:
> >
> >
> >
> >>try turning off the Machine Check Exception in the kernel as it is just buggy
> >>on some machines, not necessarily a bug in the kernel, or without
> >>recompiling, use the kernel param "nomce"
> >>
> >>On Wednesday 25 December 2002 19:53, Josh Brooks wrote:
> >>
> >>
> >>>Hello,
> >>>
> >>>I have a dual p3 866 running 2.4 kernel that is crashing once every few
> >>>days leaving this on the console:
> >>>
> >>>
> >>>Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> >>>localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
> >>>
> >>>Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> >>>localhost kernel: Bank 4: b200000000040151
> >>>
> >>>Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> >>>localhost kernel: Kernel panic: CPU context corrupt
> >>>
> >>>
> >>>
> >>>Word on the street is that this indicates hardware failure of some kind
> >>>(cpu, bus, or memory). My main question is, is that very surely the
> >>>culprit, or is it also possible that all of the hardware is perfect and
> >>>that a bug in the kernel code or some outside influence (remote exploit)
> >>>is causing this crash ?
> >>>
> >>>Basically, I am ordering all new hardware to swap out, and I just want to
> >>>know if there is some remote possibility that my hardware is actually just
> >>>fine and this is some kind of software error ?
> >>>
> >>>ALSO, I have not been physically at the console when this has happened,
> >>>and have not tried this yet, but whatever that thing is where you press
> >>>ctrl-alt-printscreen and get to enter those post-crash commands - do you
> >>>think that would work in this situation, or does the above error hard lock
> >>>the system so you can't do those emergency measures ?
> >>>
> >>>thanks!
> >>>
> >>>
> >>>-
> >>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>>the body of a message to [email protected]
> >>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>Please read the FAQ at http://www.tux.org/lkml/
> >>>
> >>>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to [email protected]
> >>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at http://www.tux.org/lkml/
> >>
> >>
> >>
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
> >
> >
> >
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-12-26 03:36:36

by Felipe W Damasio

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?



Josh Brooks wrote:
> Hello,
>
> I have a dual p3 866 running 2.4 kernel that is crashing once every few
> days leaving this on the console:
>
>
> Message from syslogd@localhost at Tue Dec 24 11:30:31 2002 ...
> localhost kernel: CPU 1: Machine Check Exception: 0000000000000004
>
> Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> localhost kernel: Bank 4: b200000000040151
>
> Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> localhost kernel: Kernel panic: CPU context corrupt
>
> Word on the street is that this indicates hardware failure of some kind
> (cpu, bus, or memory). My main question is, is that very surely the
> culprit, or is it also possible that all of the hardware is perfect and
> that a bug in the kernel code or some outside influence (remote exploit)
> is causing this crash ?

Instruction fetch error from the level 1 cache...I've seen this before
(check the archives). This indicates either a memory or a processor problem.

Could you please run memtest86?

Thanks.

Felipe

2002-12-26 03:42:36

by Billy Rose

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

> Well actually I ordered a complete replacement system - identical in
> every way. So I am getting that on saturday, and presumably that
> will just be the big hammer that makes every problem go away.
>
> I am just posting to get a head start on the issue if, for some crazy
> reason I replace all hardware and the problem continues. Sounds
> like that is a slim to none chance, since I am dealing with good
> hardware (dell) and it looks like this is a faulty component at work.
>
> Basically I am just moving the disks from one machine to another on
> saturday, and I suspect the problems just disappear when I do that.
>
>
> Comments on the possibility that the problems continue after moving
> the disks to different (but identical) hardware ?
>
>
> thanks!

does this machine have a DRAC card by any chance?


"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

2002-12-26 03:46:49

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


I am not sure what a DRAC card is. There are no pci cards plugged in - no
additional hardware - just the system, onboard intel etherexpress, and
onboard PERC.



On Wed, 25 Dec 2002, Billy Rose wrote:

> > Well actually I ordered a complete replacement system - identical in
> > every way. So I am getting that on saturday, and presumably that
> > will just be the big hammer that makes every problem go away.
> >
> > I am just posting to get a head start on the issue if, for some crazy
> > reason I replace all hardware and the problem continues. Sounds
> > like that is a slim to none chance, since I am dealing with good
> > hardware (dell) and it looks like this is a faulty component at work.
> >
> > Basically I am just moving the disks from one machine to another on
> > saturday, and I suspect the problems just disappear when I do that.
> >
> >
> > Comments on the possibility that the problems continue after moving
> > the disks to different (but identical) hardware ?
> >
> >
> > thanks!
>
> does this machine have a DRAC card by any chance?
>
>
> "there's some milk in the fridge that's about to go bad...
> and there it goes..." -bobby
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-12-26 03:56:24

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


Understood. Thank you for that diagnosis.

usually it says proc #1 in the error, but the first time it said proc #0 -
is that interesting ?



On Wed, 25 Dec 2002, Billy Rose wrote:

> i agree with felipe, sounds like either a stick of ram is bad, or proc
> #1 is fried (possibly its vrm though).
>
> a DRAC is the dell remote assistant card. it sits in a pci slot, has
> an intel i860 proc on it, and has a 10/100 for a net cable. if you
> have no cards, then it is obviously ruled out.
>
> billy
> =====
> "there's some milk in the fridge that's about to go bad...
> and there it goes..." -bobby
>

2002-12-26 03:55:21

by Billy Rose

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

i agree with felipe, sounds like either a stick of ram is bad, or proc
#1 is fried (possibly its vrm though).

a DRAC is the dell remote assistant card. it sits in a pci slot, has
an intel i860 proc on it, and has a 10/100 for a net cable. if you
have no cards, then it is obviously ruled out.

billy
=====
"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

2002-12-26 04:00:19

by Felipe W Damasio

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

Josh Brooks wrote:
> Understood. Thank you for that diagnosis.
>
> usually it says proc #1 in the error, but the first time it said proc #0 -
> is that interesting ?

It is.

This would be a stronger evidence of bad RAM, since the instruction
fetch error occured "randomly" on both processors.

Either that or both your processors are going bad :)

Please run memtest86 to be sure it's a bad RAM problem.

Kind Regards,

Felipe

2002-12-26 04:13:08

by Billy Rose

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

> Understood. Thank you for that diagnosis.
>
>
> usually it says proc #1 in the error, but the first time it said proc
> #0 - is that interesting ?

youre welcome :)

if youre hanging on to that box, remove the memory from banks 3 and 4
and it should be ok. if my memory serves me right, you cant have only 3
banks of memory (hence removing bank 3 also), the motherboard is
configured to handle 1, 2, or 4 populated banks. it you leave bank 3
in while removing bank 4, it will beep at you when you power it on and
do nothing. with a gig of ram, it should still be plenty useful.

billy
=====
"there's some milk in the fridge that's about to go bad...
and there it goes..." -bobby

2002-12-26 04:40:35

by Josh Brooks

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?


Are you saying that you think bank 4 is bad because you saw this in my
error:

localhost kernel: Bank 4: b200000000040151
^^^^^^

(just asking to increase my own understanding)

thanks!



On Wed, 25 Dec 2002, Billy Rose wrote:

> > Understood. Thank you for that diagnosis.
> >
> >
> > usually it says proc #1 in the error, but the first time it said proc
> > #0 - is that interesting ?
>
> youre welcome :)
>
> if youre hanging on to that box, remove the memory from banks 3 and 4
> and it should be ok. if my memory serves me right, you cant have only 3
> banks of memory (hence removing bank 3 also), the motherboard is
> configured to handle 1, 2, or 4 populated banks. it you leave bank 3
> in while removing bank 4, it will beep at you when you power it on and
> do nothing. with a gig of ram, it should still be plenty useful.
>
> billy
> =====
> "there's some milk in the fridge that's about to go bad...
> and there it goes..." -bobby
>

2002-12-26 05:55:09

by Joseph D. Wagner

[permalink] [raw]
Subject: RE: CPU failures ... or something else ?

> Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> localhost kernel: Kernel panic: CPU context corrupt

What that basically means is that given some values A, B, and C, in the
context of those values the kernel expects X, Y, and Z to be of some other
value, but X, Y, and Z aren't turning out to be expected.

> Word on the street is that this indicates
> hardware failure of some kind
> [trimmed]
> is that very surely the culprit, or is it
> also possible that all of the hardware is
> perfect and that a bug in the kernel code
> or some outside influence (remote exploit)
> is causing this crash ?

Hardware failures of any kind are relatively rare, rarest of all is a CPU
failure (unless you buy from TC Computers which sold me TWO defective
CPU's).

In your quest to place blame, I'd start with:
1) outdated kernel - you never did say your kernel version
2) other third party processes, especially modules.

> whatever that thing is where you press
> ctrl-alt-printscreen and get to enter
> those post-crash commands - do you think
> that would work in this situation, or does
> the above error hard lock the system so
> you can't do those emergency measures ?

You're thinking of when X (the graphical user interface) crashes. X is
sufficiently abstracted from the kernel so that if it crashes, the rest of
the computer doesn't come down with it. When the KERNEL crashes (a.k.a.
"kernel panic"), it crashes HARD. This is the equivalent of the Blue Screen
of the Death(tm) from which the only "recovery" is the reboot button.

Joseph Wagner

2002-12-26 06:00:24

by Joseph D. Wagner

[permalink] [raw]
Subject: RE: CPU failures ... or something else ?

> I thought it was significant that I am
> running 2.4.1 on _identical_ hardware
> without problems - presumably if that
> is the case then running 2.4.16 on the
> same hardware should be fine as well
> (in terms of a buggy board) - therefore
> I suspect bad hardware.
>
> Is this good reasoning?

ABSOLUTELY NOT! I've been monitoring this list long enough to know that bug
HAVE BEEN INTRODUCED IN NEWER KERNEL VERSION that make things that worked in
older versions not work in new versions.

Run 2.4.20 and if you still have problems then try that one guy's advice who
said to disable Machine Check Exceptions.

Joseph Wagner

2002-12-26 06:04:55

by Joseph D. Wagner

[permalink] [raw]
Subject: RE: CPU failures ... or something else ?

> usually it says proc #1 in the error, but
> the first time it said proc #0 - is that
> interesting ?

proc 0 and proc 1 are CPU 0 and CPU 1, respectively. If you switched CPU's
and now the error is on the other proc, then it IS a CPU error.

Joseph Wagner

P.S. In hindsight, I probably should have read the entire thread before
responding. 8-) You live you learn.

Joseph Wagner

2002-12-26 06:27:12

by Josh Brooks

[permalink] [raw]
Subject: RE: CPU failures ... or something else ?


Well, that's the thing - I _didn't_ switch cpus. Sometimes the error is
cpu 0, sometimes it is cpu 1.



On Thu, 26 Dec 2002, Joseph D. Wagner wrote:

> > usually it says proc #1 in the error, but
> > the first time it said proc #0 - is that
> > interesting ?
>
> proc 0 and proc 1 are CPU 0 and CPU 1, respectively. If you switched CPU's
> and now the error is on the other proc, then it IS a CPU error.
>
> Joseph Wagner
>
> P.S. In hindsight, I probably should have read the entire thread before
> responding. 8-) You live you learn.
>
> Joseph Wagner
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-12-26 06:41:21

by Joseph D. Wagner

[permalink] [raw]
Subject: RE: CPU failures ... or something else ?

> Well, that's the thing - I _didn't_ switch cpus.
> Sometimes the error is cpu 0, sometimes it is cpu 1.

Then it's not a CPU error, unless both CPU's are failing at the same time
(highly unlikely).

Joseph Wagner

2002-12-26 09:02:50

by Felipe W Damasio

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

Josh Brooks wrote:
> Well, that's the thing - I _didn't_ switch cpus. Sometimes the error is
> cpu 0, sometimes it is cpu 1.

Then it's probably a bad RAM problem.

Read my other post.

Kind Regards,

Felipe

2002-12-27 23:24:01

by Alan

[permalink] [raw]
Subject: RE: CPU failures ... or something else ?

On Thu, 2002-12-26 at 06:03, Joseph D. Wagner wrote:
> > Message from syslogd@localhost at Tue Dec 24 11:30:32 2002 ...
> > localhost kernel: Kernel panic: CPU context corrupt
>
> What that basically means is that given some values A, B, and C, in the
> context of those values the kernel expects X, Y, and Z to be of some other
> value, but X, Y, and Z aren't turning out to be expected.

Ermm no.

Lets correct a few comments made here

1. On Pentium II/III we have no known cases where MCE is misreported.
(Some old pentiums dont have the external bits of the MCA/MCE stuff
wired up right so do trigger it). Radeon IGP also triggers it but for
what appear to be valid reasons (Linux confused the chipset badly)

2. The panic occurs because the CPU set flags saying that the CPU state
was corrupt. That is it trapped out at a point where the previous state
could not be recovered, so the trap is fatal

3. Don't be suprised if it moves CPU. It is possible (and not uncommon)
to set up dual systems so that both CPU's boot and race to become CPU#0.
This is actually recommended since the box will then almost always boot
with one failed CPU.

Running each CPU as a single CPU test may find a faulty CPU.


2003-01-01 20:56:24

by Pavel Machek

[permalink] [raw]
Subject: Re: CPU failures ... or something else ?

Hi!

> So you are saying, that yes, it _is_ possible that my equipment is not
> faulty in any way ?

Well, if your machine produces spurious MCE's, you have machine with
*design* bug. That might be slightly better than machine with
overheating or similar problem ;-). Still its buggy machine.
Pavel

--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?