2006-11-08 16:20:29

by martin f krafft

[permalink] [raw]
Subject: How to interpret MCE messages?

Thanks to mcelog, I am now regularly seeing messages like this on an
amd64 machine:

kernel: Machine check events logged
bit46 = corrected ecc error
Data cache ECC error (syndrome 5b)
memory/cache error 'data read mem transaction, data transaction, level 2'
ADDR 38ed9200
CPU 0 0 data cache TSC fe4f9128ade
MCE 0
STATUS 942dc00000000136 MCGSTATUS 0

The RAM modules are *not* ECC modules, nor does the Asus K8V Deluxe
motherboard support ECC to my knowledge. I've turned ECC support on
and off in the Bios without any effect.

I've already run memtest86+ for hours without finding any problems,
and I've removed each of the two memory modules for a while, but
I still saw these errors appearing.

Before I go out and buy a new motherboard (as I assume that it's
a L1/L2 cache problem), I'd like to know how I am to interpret these
MCE dumps and how I could use them to actually pinpoint the source
of the problem.

Cheers,

--
martin; (greetings from the heart of the sun.)
\____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck

spamtraps: [email protected]

"america may be unique in being a country which has leapt
from barbarism to decadence without touching civilization."
-- john o'hara


Attachments:
(No filename) (1.27 kB)
signature.asc (189.00 B)
Digital signature (GPG/PGP)
Download all attachments

2006-11-08 16:24:41

by Alan

[permalink] [raw]
Subject: Re: How to interpret MCE messages?

Ar Mer, 2006-11-08 am 17:20 +0100, ysgrifennodd martin f krafft:
> Thanks to mcelog, I am now regularly seeing messages like this on an
> amd64 machine:
>
> kernel: Machine check events logged
> bit46 = corrected ecc error
> Data cache ECC error (syndrome 5b)

Cache.. not memory

> memory/cache error 'data read mem transaction, data transaction, level 2'

L2 Cache

> Before I go out and buy a new motherboard (as I assume that it's
> a L1/L2 cache problem),

L1/L2 cache are on the CPU these days. Double check with the processor
docs and vendor but I think mcelog is actually trying to tell you that
the CPU wants to be warranty returned. It might also of course be a heat
problem.


2006-11-08 23:12:53

by martin f krafft

[permalink] [raw]
Subject: Re: How to interpret MCE messages?

also sprach Alan Cox <[email protected]> [2006.11.08.1729 +0100]:
> > memory/cache error 'data read mem transaction, data
> > transaction, level 2'
>
> L2 Cache

Gosh, I must be blind. Somehow there was too much information in
that dump. Thanks Alan!

> > Before I go out and buy a new motherboard (as I assume that it's
> > a L1/L2 cache problem),
>
> L1/L2 cache are on the CPU these days. Double check with the processor
> docs and vendor but I think mcelog is actually trying to tell you that
> the CPU wants to be warranty returned. It might also of course be a heat
> problem.

I am afraid the CPU might be out of warranty, but I'll try; I doubt
it's a heat problem since there are plenty fans and the machine's
interior is actually of quite agreeable temperature.

I'll check the CPU. Again, thanks.

--
martin; (greetings from the heart of the sun.)
\____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck

spamtraps: [email protected]

if you find a spelling mistake in the above, you get to keep it.


Attachments:
(No filename) (1.03 kB)
signature.asc (189.00 B)
Digital signature (GPG/PGP)
Download all attachments

2006-11-09 04:11:27

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: How to interpret MCE messages?

On Wed, 08 Nov 2006 17:20:22 +0100, martin f krafft said:

> The RAM modules are *not* ECC modules, nor does the Asus K8V Deluxe
> motherboard support ECC to my knowledge. I've turned ECC support on
> and off in the Bios without any effect.

How odd. Is it considered normal to have a BIOS option to turn
ECC support on/off on a motherboard that doesn't support ECC?


Attachments:
(No filename) (226.00 B)

2006-11-11 04:31:48

by Mark Rosenstand

[permalink] [raw]
Subject: Re: How to interpret MCE messages?

On Wed, 2006-11-08 at 23:11 -0500, [email protected] wrote:
> On Wed, 08 Nov 2006 17:20:22 +0100, martin f krafft said:
>
> > The RAM modules are *not* ECC modules, nor does the Asus K8V Deluxe
> > motherboard support ECC to my knowledge. I've turned ECC support on
> > and off in the Bios without any effect.
>
> How odd. Is it considered normal to have a BIOS option to turn
> ECC support on/off on a motherboard that doesn't support ECC?

I think it does support ECC, at least that was my main argument to get a
K8V-X (less feature-bloated version, recommended by djb) two years ago,
as very few socket 754 boards supported it at that time (which is extra
weird since it comes pretty much for free with K8 CPU's.)

2006-11-15 09:27:32

by martin f krafft

[permalink] [raw]
Subject: Re: How to interpret MCE messages?

also sprach Alan Cox <[email protected]> [2006.11.08.1729 +0100]:
> > Before I go out and buy a new motherboard (as I assume that it's
> > a L1/L2 cache problem),
>
> L1/L2 cache are on the CPU these days. Double check with the
> processor docs and vendor but I think mcelog is actually trying to
> tell you that the CPU wants to be warranty returned. It might also
> of course be a heat problem.

I've cleaned the fan and cooler and put a huge fan next to the open
case, blowing any heat out of it. I saw the errors again, even
without any load.

Thus I guess the CPU is asking for retirement. I am just
double-checking with you guys whether I can be sure that it's only
the CPU, or whether it could also be the fault of the motherboard...

Thanks,

--
martin; (greetings from the heart of the sun.)
\____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck

spamtraps: [email protected]

time wounds all heels.
-- groucho marx


Attachments:
(No filename) (1.00 kB)
signature.asc (189.00 B)
Digital signature (GPG/PGP)
Download all attachments

2006-11-17 23:27:53

by dean gaudet

[permalink] [raw]
Subject: Re: How to interpret MCE messages?

On Wed, 15 Nov 2006, martin f krafft wrote:

> Thus I guess the CPU is asking for retirement. I am just
> double-checking with you guys whether I can be sure that it's only
> the CPU, or whether it could also be the fault of the motherboard...

could be VRMs and/or PSU delivering unclean power... but you'd probably
see other errors in that case too.

-dean