2003-03-30 18:36:13

by Lawrence Walton

[permalink] [raw]
Subject: MCE error

I just got a MCE error while running 2.5.65 "MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 2: 940040000000017a" did a google search and found Dave Jones's parsemce, and decoded it to

Status: (ba) Error IP valid
Restart IP invalid.

And was wondering what that actually meant. :)
Really what I need to know is, how non fatal is non fatal?


--
*--* Mail: [email protected]
*--* Voice: 425.739.4247
*--* Fax: 425.827.9577
*--* HTTP://the-penguin.otak.com/~lawrence/
--------------------------------------
- - - - - - O t a k i n c . - - - - -



2003-03-31 15:28:30

by Dave Jones

[permalink] [raw]
Subject: Re: MCE error

On Sun, Mar 30, 2003 at 10:47:56AM -0800, Lawrence Walton wrote:
> I just got a MCE error while running 2.5.65 "MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
> Bank 2: 940040000000017a" did a google search and found Dave Jones's parsemce, and decoded it to
>
> Status: (ba) Error IP valid
> Restart IP invalid.
>
> And was wondering what that actually meant. :)

Incomplete dump, what it really means..

(davej@deviant:davej)$ ./a.out -b 2 -e 0xba -s 940040000000017a -a 0
Status: (186) Error IP valid
Restart IP invalid.
parsebank(2): 940040000000017a @ 0
External tag parity error
Correctable ECC error
Address in addr register valid
Error enabled in control register
Memory heirarchy error
Request: Generic error
Transaction type : Generic
Memory/IO : I/O

Looks like the L2 cache ECC checking spotted something going wrong,
and fixed it up. This can happen in cases where there is inadequate
cooling, power, or overclocking (or in rare circumstances, flaky CPUs)

Dave