2002-06-26 05:20:07

by Shawn Starr

[permalink] [raw]
Subject: MCE Error - 2.5.24 - Whats this?

I got this message this evening from the syslog:


MCE: The hardware reports a non fatal, correctable incident occured on
CPU 0.

Bank 0: 9409c00000000136


Is this something I should be worried about?

Included is the standard dmesg.

Shawn.




--
Shawn Starr, sh0n.net, <[email protected]>
Maintainer: -shawn kernel patches: http://xfs.sh0n.net/2.4/


Attachments:
dmesg (11.92 kB)

2002-06-26 07:51:38

by Alex Riesen

[permalink] [raw]
Subject: Re: MCE Error - 2.5.24 - Whats this?

On Wed, Jun 26, 2002 at 01:20:57AM -0400, Shawn Starr wrote:
> I got this message this evening from the syslog:
>
>
> MCE: The hardware reports a non fatal, correctable incident occured on
> CPU 0.
>
> Bank 0: 9409c00000000136
>
>
> Is this something I should be worried about?
>
> Included is the standard dmesg.

Dave Jones had a small parser for these codes:
http://www.codemonkey.org.uk/cruft/parsemce.c

And as it seems the parser lacks a bit of information to completely
decode the message:

~ ./parsemce
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(0): 9409c00000000136 @ 0
External tag parity error
Uncorrectable ECC error
CPU state corrupt. Restart not possible
MISC register information valid
Error not corrected.
Error overflow
Memory heirarchy error
Request: Generic error
Transaction type : Data
Memory/IO : I/O

> Linux version 2.5.24 (root@unknown) (gcc version 3.1) #1 Sat Jun 22 14:58:48 EDT 2002
...

-alex

2002-06-26 14:57:31

by Shawn Starr

[permalink] [raw]
Subject: Re: MCE Error - 2.5.24 - Whats this?

I don't understand that decoded result ;)

Is it a phony result or is there a real problem with the CPU itself?
It's brand new!


On Wed, 2002-06-26 at 03:50, Alex Riesen wrote:
> On Wed, Jun 26, 2002 at 01:20:57AM -0400, Shawn Starr wrote:
> > I got this message this evening from the syslog:
> >
> >
> > MCE: The hardware reports a non fatal, correctable incident occured on
> > CPU 0.
> >
> > Bank 0: 9409c00000000136
> >
> >
> > Is this something I should be worried about?
> >
> > Included is the standard dmesg.
>
> Dave Jones had a small parser for these codes:
> http://www.codemonkey.org.uk/cruft/parsemce.c
>
> And as it seems the parser lacks a bit of information to completely
> decode the message:
>
> ~ ./parsemce
> Status: (4) Machine Check in progress.
> Restart IP invalid.
> parsebank(0): 9409c00000000136 @ 0
> External tag parity error
> Uncorrectable ECC error
> CPU state corrupt. Restart not possible
> MISC register information valid
> Error not corrected.
> Error overflow
> Memory heirarchy error
> Request: Generic error
> Transaction type : Data
> Memory/IO : I/O
>
> > Linux version 2.5.24 (root@unknown) (gcc version 3.1) #1 Sat Jun 22 14:58:48 EDT 2002
> ...
>
> -alex
>
--
Shawn Starr, sh0n.net, <[email protected]>
Maintainer: -shawn kernel patches: http://xfs.sh0n.net/2.4/
Developer Support Engineer
Datawire Communication Networks Inc.
10 Carlson Court, Suite 300
Toronto, ON, M9W 6L2
T: 416.213.2001 ext 179 F: 416.213.2008

2002-06-26 15:16:00

by Matti Aarnio

[permalink] [raw]
Subject: Re: MCE Error - 2.5.24 - Whats this?

On Wed, Jun 26, 2002 at 10:57:37AM -0400, Shawn Starr wrote:
> I don't understand that decoded result ;)
>
> Is it a phony result or is there a real problem with the CPU itself?
> It's brand new!

Bad ECC data. Possibly you don't have ECC capable memory in the system
at all, but your BIOS has been set up to expect it. Possibly the
new processor is marginal, possibly the new board is marginal...

> On Wed, 2002-06-26 at 03:50, Alex Riesen wrote:
> > On Wed, Jun 26, 2002 at 01:20:57AM -0400, Shawn Starr wrote:
> > > I got this message this evening from the syslog:
> > >
> > >
> > > MCE: The hardware reports a non fatal, correctable incident occured on
> > > CPU 0.
> > >
> > > Bank 0: 9409c00000000136
> > >
> > >
> > > Is this something I should be worried about?
> > >
> > > Included is the standard dmesg.
> >
> > Dave Jones had a small parser for these codes:
> > http://www.codemonkey.org.uk/cruft/parsemce.c
> >
> > And as it seems the parser lacks a bit of information to completely
> > decode the message:
> >
> > ~ ./parsemce
> > Status: (4) Machine Check in progress.
> > Restart IP invalid.
> > parsebank(0): 9409c00000000136 @ 0
> > External tag parity error
> > Uncorrectable ECC error
> > CPU state corrupt. Restart not possible
> > MISC register information valid
> > Error not corrected.
> > Error overflow
> > Memory heirarchy error
> > Request: Generic error
> > Transaction type : Data
> > Memory/IO : I/O
> >
> > > Linux version 2.5.24 (root@unknown) (gcc version 3.1) #1 Sat Jun 22 14:58:48 EDT 2002
> > ...
> >
> > -alex
> >
> --
> Shawn Starr, sh0n.net, <[email protected]>
> Maintainer: -shawn kernel patches: http://xfs.sh0n.net/2.4/
> Developer Support Engineer
> Datawire Communication Networks Inc.
> 10 Carlson Court, Suite 300
> Toronto, ON, M9W 6L2
> T: 416.213.2001 ext 179 F: 416.213.2008
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-06-26 15:20:35

by Shawn Starr

[permalink] [raw]
Subject: Re: MCE Error - 2.5.24 - Whats this?

hmm, I don't recall turning on ECC in BIOS but I'll check later today.
But that doesn't appear serious in any case then.

The ram is 512MB DDR Registered but its non ECC.

Shawn.


On Wed, 2002-06-26 at 11:15, Matti Aarnio wrote:
> On Wed, Jun 26, 2002 at 10:57:37AM -0400, Shawn Starr wrote:
> > I don't understand that decoded result ;)
> >
> > Is it a phony result or is there a real problem with the CPU itself?
> > It's brand new!
>
> Bad ECC data. Possibly you don't have ECC capable memory in the system
> at all, but your BIOS has been set up to expect it. Possibly the
> new processor is marginal, possibly the new board is marginal...
>
> > On Wed, 2002-06-26 at 03:50, Alex Riesen wrote:
> > > On Wed, Jun 26, 2002 at 01:20:57AM -0400, Shawn Starr wrote:
> > > > I got this message this evening from the syslog:
> > > >
> > > >
> > > > MCE: The hardware reports a non fatal, correctable incident occured on
> > > > CPU 0.
> > > >
> > > > Bank 0: 9409c00000000136
> > > >
> > > >
> > > > Is this something I should be worried about?
> > > >
> > > > Included is the standard dmesg.
> > >
> > > Dave Jones had a small parser for these codes:
> > > http://www.codemonkey.org.uk/cruft/parsemce.c
> > >
> > > And as it seems the parser lacks a bit of information to completely
> > > decode the message:
> > >
> > > ~ ./parsemce
> > > Status: (4) Machine Check in progress.
> > > Restart IP invalid.
> > > parsebank(0): 9409c00000000136 @ 0
> > > External tag parity error
> > > Uncorrectable ECC error
> > > CPU state corrupt. Restart not possible
> > > MISC register information valid
> > > Error not corrected.
> > > Error overflow
> > > Memory heirarchy error
> > > Request: Generic error
> > > Transaction type : Data
> > > Memory/IO : I/O
> > >
> > > > Linux version 2.5.24 (root@unknown) (gcc version 3.1) #1 Sat Jun 22 14:58:48 EDT 2002
> > > ...
> > >
> > > -alex
> > >
> > --
> > Shawn Starr, sh0n.net, <[email protected]>
> > Maintainer: -shawn kernel patches: http://xfs.sh0n.net/2.4/
> > Developer Support Engineer
> > Datawire Communication Networks Inc.
> > 10 Carlson Court, Suite 300
> > Toronto, ON, M9W 6L2
> > T: 416.213.2001 ext 179 F: 416.213.2008
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
--
Shawn Starr, sh0n.net, <[email protected]>
Maintainer: -shawn kernel patches: http://xfs.sh0n.net/2.4/
Developer Support Engineer
Datawire Communication Networks Inc.
10 Carlson Court, Suite 300
Toronto, ON, M9W 6L2
T: 416.213.2001 ext 179 F: 416.213.2008

2002-06-26 15:21:31

by Richard B. Johnson

[permalink] [raw]
Subject: Re: MCE Error - 2.5.24 - Whats this?

On 26 Jun 2002, Shawn Starr wrote:

> I don't understand that decoded result ;)
>
> Is it a phony result or is there a real problem with the CPU itself?
> It's brand new!
>

It looks to me like a ECC error in external tag RAM (part of the
external cache).

The CPU is fine, but since it already read bad data from the cache,
it can't be allowed to restart.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-06-26 17:37:59

by Brian Strand

[permalink] [raw]
Subject: Re: MCE Error - 2.5.24 - Whats this?

Shawn Starr wrote:

>I got this message this evening from the syslog:
>
>
>MCE: The hardware reports a non fatal, correctable incident occured on
>CPU 0.
>
>Bank 0: 9409c00000000136
>
>
>Is this something I should be worried about?
>
>Included is the standard dmesg.
>
>Shawn.
>
As a possibly relevant aside, according to a recent message on lkml,
that board (Asus A7M266-D) was discontinued. See the message by
[email protected] dated Mon, 17 Jun 2002 20:58:50 -0400, subject "Dual
Athlon issue temporarily resolved", as well as the initial post on Sat,
15 Jun 2002 18:21:35 -0400 with subject "Dual Athlon 2000 XP MP nightmare".

Regards,
Brian


2002-06-27 12:05:04

by Felipe W Damasio

[permalink] [raw]
Subject: Re: MCE Error - 2.5.24 - Whats this?

On 26 Jun 2002 01:20:57 -0400
Shawn Starr <[email protected]> wrote:

SS> MCE: The hardware reports a non fatal, correctable incident occured on
SS> CPU 0.
SS> Bank 0: 9409c00000000136

This looks like a data cache L2 read error.

Felipe