2004-01-09 16:47:05

by lkml

[permalink] [raw]
Subject: 2.6: The hardware reports a non fatal, correctable incident occured on CPU 0.

Hi!

I did have some very scary issues today playing with 2.6. The system was
booted and ran several times today, the longtest uptime was approximately
about an hour.

But then shortly after having booted 2.6 I got syslog messages:

The hardware reports a non fatal, correctable incident occured on CPU 0.

I shut down the machine. After this my Athlon XP 2200+ showed up as 1050MHz in
BIOS an indeed the bus frequency was set to 100 instead of 133 MHz (how can
an OS change the BIOS?!) - nevertheless the CPU should have shown up as
1500MHz. I set it back to 133 MHz - which resulted in the machine did not
even reach the BIOS no more but was rebooting automatically prior to it. I
turned off the machine for some seconds - no change. I turned it off for a
few minutes and the BIOS showed up again - with 1050MHz. So I had to set the
freq back to 133 MHz a second time. I booted my 2.4.21 kernel which seems to
run.

What the fuck is going on here?? As far as I figured out this has something to
do with MCE (CONFIG_X86_MCE=y, CONFIG_X86_MCE_NONFATAL=y) (?).

TIA
Timo


2004-01-09 17:38:31

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6: The hardware reports a non fatal, correctable incident occured on CPU 0.


On Fri, 9 Jan 2004 [email protected] wrote:

> Hi!
>
> I did have some very scary issues today playing with 2.6. The system was
> booted and ran several times today, the longtest uptime was approximately
> about an hour.
>
> But then shortly after having booted 2.6 I got syslog messages:
>
> The hardware reports a non fatal, correctable incident occured on CPU 0.
>
> I shut down the machine. After this my Athlon XP 2200+ showed up as 1050MHz in
> BIOS an indeed the bus frequency was set to 100 instead of 133 MHz (how can
> an OS change the BIOS?!)

It's nothing to do with the OS most likely. Some BIOS's modify the FSB
speed and other settings as a way to provide a sort of "fail safe" boot
mode if a problem was detected.

The BIOS on my board will do that if the system fails to POST and I've
also seen it happen sometimes after a crash.

It's even documented in the motherboard manual that it will behave this
way when running in JumperFree mode (this is an ASUS A7M266 board btw).
The exact text from my motherboard manual is :

"Notes for JumperFree Mode
System Hangup

If your system crashes or hangs due to improper frequency settings, power
OFF your system and restart. The system will start up in safe mode
running at a DRAM-to-CPU frequency ratio of 3:3 and a bus speed of
100MHz. You will then be led to BIOS setup to adjust the configurations."


-- Jesper Juhl

2004-01-09 18:13:22

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6: The hardware reports a non fatal, correctable incident occured on CPU 0.


On Fri, 9 Jan 2004 [email protected] wrote:

> On Friday 09 January 2004 18:35, Jesper Juhl wrote:
> > It's nothing to do with the OS most likely. Some BIOS's modify the FSB
> > speed and other settings as a way to provide a sort of "fail safe" boot
> > mode if a problem was detected.
>
> So, in your opinion I really have hardware problems (which yet didn't notice
> and also for 3,5h did not recurr)?
>
All I'm saying is that I know for a fact that some BIOS's will do this
(set bus speed to 100) if they detect problems - I know mine does.

It's just one possibility. I don't actually /know/ what causes what you
experienced.
I guess it's possible that something the kernel did caused the BIOS to
think there was a problem even though there was not...
Or it could be something else entirely.
I don't know for sure. All I can do is suggest that maybe you should check
your motherboard manual for any hints on this behaviour and maybe try and
test your hardware just to be safe...

Other people probably have better advice for you.


-- Jesper Juhl

2004-01-09 23:12:25

by Eric Bambach

[permalink] [raw]
Subject: Re: 2.6: The hardware reports a non fatal, correctable incident occured on CPU 0.

On Friday 09 January 2004 10:48 am, [email protected] wrote:
> Hi!
>
> I did have some very scary issues today playing with 2.6. The system was
> booted and ran several times today, the longtest uptime was approximately
> about an hour.
>
> But then shortly after having booted 2.6 I got syslog messages:
>
> The hardware reports a non fatal, correctable incident occured on CPU 0.
>
> I shut down the machine. After this my Athlon XP 2200+ showed up as 1050MHz
> in BIOS an indeed the bus frequency was set to 100 instead of 133 MHz (how
> can an OS change the BIOS?!) - nevertheless the CPU should have shown up as
> 1500MHz. I set it back to 133 MHz - which resulted in the machine did not
> even reach the BIOS no more but was rebooting automatically prior to it. I
> turned off the machine for some seconds - no change. I turned it off for a
> few minutes and the BIOS showed up again - with 1050MHz. So I had to set
> the freq back to 133 MHz a second time. I booted my 2.4.21 kernel which
> seems to run.
Check your hardware CPU/MOBO/RAM. Overheating? Bad Ram? Cheap mobo?
MCE should not be triggered under any circumstances unless it is a kernel
bug(RARE, I believe the MCE code is simple) or you REALLY have a hardware
problem. As said before, the bios is resetting your fsb to 100 as a fail-safe
because something bad happened.
BTW, check your setup, an AMD 2200+ should run at 1.8ghz i believe. If you
are setting your FSB or multiplier too low, that might also be triggering a
problem. A quick google lists amd xp2200+ as 1800mhz

> What the fuck is going on here?? As far as I figured out this has something
> to do with MCE (CONFIG_X86_MCE=y, CONFIG_X86_MCE_NONFATAL=y) (?).
Leave it enabled, its a good thing to tell you when you have bad hardware.
Its not a kernel problem, but a feature.
-------------------------
Eric Bambach
Eric at cisu dot net
-------------------------

2004-01-09 23:30:37

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: 2.6: The hardware reports a non fatal, correctable incident occured on CPU 0.

Eric wrote:
> On Friday 09 January 2004 10:48 am, [email protected] wrote:
>
>>Hi!
>>
>>I did have some very scary issues today playing with 2.6. The system was
>>booted and ran several times today, the longtest uptime was approximately
>>about an hour.
>>
>>But then shortly after having booted 2.6 I got syslog messages:
>>
>>The hardware reports a non fatal, correctable incident occured on CPU 0.
>>
>>I shut down the machine. After this my Athlon XP 2200+ showed up as 1050MHz
>>in BIOS an indeed the bus frequency was set to 100 instead of 133 MHz (how
>>can an OS change the BIOS?!) - nevertheless the CPU should have shown up as
>>1500MHz. I set it back to 133 MHz - which resulted in the machine did not
>>even reach the BIOS no more but was rebooting automatically prior to it. I
>>turned off the machine for some seconds - no change. I turned it off for a
>>few minutes and the BIOS showed up again - with 1050MHz. So I had to set
>>the freq back to 133 MHz a second time. I booted my 2.4.21 kernel which
>>seems to run.
>
> Check your hardware CPU/MOBO/RAM. Overheating? Bad Ram? Cheap mobo?
> MCE should not be triggered under any circumstances unless it is a kernel
> bug(RARE, I believe the MCE code is simple) or you REALLY have a hardware
> problem. As said before, the bios is resetting your fsb to 100 as a fail-safe
> because something bad happened.
> BTW, check your setup, an AMD 2200+ should run at 1.8ghz i believe. If you
> are setting your FSB or multiplier too low, that might also be triggering a
> problem. A quick google lists amd xp2200+ as 1800mhz

Yes, I would also say that. With my Athlon XP 1700+ (1.466 GHZ, FSB
133MHZ) clocked at 2.2GHz (FSB200) I get MCE errors, but at 2.1GHz not,
even though I can't find stability issues at 2.2GHz. Nevertheless I
run the system at 2.1GHz.

Prakash

2004-01-10 17:14:55

by lkml

[permalink] [raw]
Subject: Re: 2.6: The hardware reports a non fatal, correctable incident occured on CPU 0.

On Saturday 10 January 2004 00:12, Eric wrote:
> Check your hardware CPU/MOBO/RAM. Overheating? Bad Ram? Cheap mobo?
> MCE should not be triggered under any circumstances unless it is a kernel
> bug(RARE, I believe the MCE code is simple) or you REALLY have a hardware
> problem. As said before, the bios is resetting your fsb to 100 as a
> fail-safe because something bad happened.

Well, my system did run very stable and in the meantime again does run very
stable on both, 2.4.21 and Windows XP...

> BTW, check your setup, an AMD 2200+ should run at 1.8ghz i believe. If you

Yes.

> > What the fuck is going on here?? As far as I figured out this has
> > something to do with MCE (CONFIG_X86_MCE=y, CONFIG_X86_MCE_NONFATAL=y)
> > (?).
>
> Leave it enabled, its a good thing to tell you when you have bad hardware.
> Its not a kernel problem, but a feature.

Well, it is a good thing to tell me, but it's not a good thing to make my
system auto-reset itself before reaching the BIOS afterwards...

timo

2004-01-14 04:43:21

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6: The hardware reports a non fatal, correctable incident occured on CPU 0.

On Sat, Jan 10, 2004 at 06:16:22PM +0100, [email protected] wrote:

> > Check your hardware CPU/MOBO/RAM. Overheating? Bad Ram? Cheap mobo?
> > MCE should not be triggered under any circumstances unless it is a kernel
> > bug(RARE, I believe the MCE code is simple) or you REALLY have a hardware
> > problem. As said before, the bios is resetting your fsb to 100 as a
> > fail-safe because something bad happened.
>
> Well, my system did run very stable and in the meantime again does run very
> stable on both, 2.4.21 and Windows XP...

Neither of which check for the presence of these errors.

> > > What the fuck is going on here?? As far as I figured out this has
> > > something to do with MCE (CONFIG_X86_MCE=y, CONFIG_X86_MCE_NONFATAL=y)
> > > (?).
> >
> > Leave it enabled, its a good thing to tell you when you have bad hardware.
> > Its not a kernel problem, but a feature.
>
> Well, it is a good thing to tell me, but it's not a good thing to make my
> system auto-reset itself before reaching the BIOS afterwards...

The non-fatal MCE code doesn't do anything like that. Any odd side-effects that you
observed were very likely due to whatever caused the MCE in the first place.


Dave