2004-03-02 17:48:03

by Davi Leal

[permalink] [raw]
Subject: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident

What about this message?. Note that the system works. I have not had to
reboot. What meens the below message?.

Do not hesitate ask for more information.

Regards,
Davi Leal



Message from syslogd@AMD at Tue Mar 2 11:27:00 2004 ...
AMD kernel: MCE: The hardware reports a non fatal, correctable incident
occurred on CPU 0.

Message from syslogd@AMD at Tue Mar 2 11:27:00 2004 ...
AMD kernel: Bank 1: 9400400000000152



$ cat /proc/version
Linux version 2.6.2 (root@AMD) (gcc version 3.3.3 20040125 (prerelease)
(Debian)) #1 Wed Feb 4 19:26:25 CET 2004

$ cat /proc/version
Linux version 2.6.2 (root@AMD) (gcc version 3.3.3 20040125 (prerelease)
(Debian)) #1 Wed Feb 4 19:26:25 CET 2004
davi@AMD:/Compartida$
davi@AMD:/Compartida$
davi@AMD:/Compartida$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 10
model name : AMD Athlon(tm) XP 2400+
stepping : 0
cpu MHz : 2010.002
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 3956.73
$


2004-03-02 21:56:58

by Dave Jones

[permalink] [raw]
Subject: Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident

On Tue, Mar 02, 2004 at 07:00:16PM +0100, Davi Leal wrote:
> What about this message?. Note that the system works. I have not had to
> reboot. What meens the below message?.
>

The original plan behind that option was to find hardware faults early,
but it seems to trigger a lot of false positives for various reasons.
Part of this problem is that MCEs can also be generated on some hardware
by doing something silly like reading from a reserved part of your
motherboard chipset..

There are also CPU errata that can cause them to falsely trigger in
some unusual cases, but I've not had time to go through the various
errata datasheets to blacklist affected CPUs unfortunatly.

I'm toying with the idea of marking it CONFIG_BROKEN for 2.6,
and fixing it up later.

Dave

2004-03-03 07:58:28

by Philippe Elie

[permalink] [raw]
Subject: Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident

On Tue, 02 Mar 2004 at 21:55 +0000, Dave Jones wrote:

> On Tue, Mar 02, 2004 at 07:00:16PM +0100, Davi Leal wrote:
> > What about this message?. Note that the system works. I have not had to
> > reboot. What meens the below message?.
> >
>
> The original plan behind that option was to find hardware faults early,
> but it seems to trigger a lot of false positives for various reasons.
> Part of this problem is that MCEs can also be generated on some hardware
> by doing something silly like reading from a reserved part of your
> motherboard chipset..
>
> There are also CPU errata that can cause them to falsely trigger in
> some unusual cases, but I've not had time to go through the various
> errata datasheets to blacklist affected CPUs unfortunatly.
>
> I'm toying with the idea of marking it CONFIG_BROKEN for 2.6,
> and fixing it up later.

I'm unsure if it's a good idea it's broken only on broken HW, people
wanting stable box try to buy sane HW and don't enable CONFIG_BROKEN
so they will never see if their HW are starting to be out of spec.

Perhaps rewording the option help and the error message to say it's
known to report false positive...


regards,
Phil

2004-03-04 04:02:58

by Joseph Fannin

[permalink] [raw]
Subject: Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident

On Tue, Mar 02, 2004 at 09:55:54PM +0000, Dave Jones wrote:
> On Tue, Mar 02, 2004 at 07:00:16PM +0100, Davi Leal wrote:
>> What about this message?. Note that the system works. I have not had to
>> reboot. What meens the below message?.
>>
>
> The original plan behind that option was to find hardware faults early,
> but it seems to trigger a lot of false positives for various reasons.
> Part of this problem is that MCEs can also be generated on some hardware
> by doing something silly like reading from a reserved part of your
> motherboard chipset..

The MCE stuff truly did find a hardware fault early for me; my
Athlon system was MCE'ing and I ignored it, and later I got sig11
errors and fs corruption, which I finally traced to a failing stick
of memory.

> There are also CPU errata that can cause them to falsely trigger in
> some unusual cases, but I've not had time to go through the various
> errata datasheets to blacklist affected CPUs unfortunatly.
>
> I'm toying with the idea of marking it CONFIG_BROKEN for 2.6,
> and fixing it up later.

I wouldn't be so quick to write off MCEs as bugs or errata,
especially if the exceptions have only just begun showing up.
Running CPUBurn, memtest86 and the like is still probably a good
idea, especially if you value the data on your file system.

--
Joseph Fannin
[email protected]

"Anyone who quotes me in their sig is an idiot." -- Rusty Russell.


Attachments:
(No filename) (1.39 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-03-04 11:40:04

by Andi Kleen

[permalink] [raw]
Subject: Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident

Dave Jones <[email protected]> writes:

> I'm toying with the idea of marking it CONFIG_BROKEN for 2.6,
> and fixing it up later.

I would actually suggest to switch over to the rewritten MCE handler
from x86-64 for i386 too. IMHO it is much better. It is race free,
does not panic the machine if not needed, CPU independent, follows the
Intel and AMD recommendations, run time sysfs configurable, logs to a
separate device and does lots of other things much better
[of course I'm biased on that a bit]. Disadvantage is that it isn't
as well tested.

I haven't tried it on i386, but i wrote it to be easily portable
to 32bit too. It does periodic MCEs too, but with a much lower
frequency and they could be turned off. I'm considering to turn
them off for x86-64 too, because they seem to only log one bit
ECC errors all the time. But with the new separate log device it's much
less of a problem.

The only thing you would lose is the support for P5 MCEs, but these
could be relatively easily readded if that should be a problem.
Extended register logging for P4 is also not implemented right now,
but that hardly seems like a needed feature.

-Andi

2004-03-04 15:02:49

by David Weinehall

[permalink] [raw]
Subject: Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident

On Thu, Mar 04, 2004 at 12:39:59PM +0100, Andi Kleen wrote:
> Dave Jones <[email protected]> writes:
>
> > I'm toying with the idea of marking it CONFIG_BROKEN for 2.6,
> > and fixing it up later.
>
> I would actually suggest to switch over to the rewritten MCE handler
> from x86-64 for i386 too. IMHO it is much better. It is race free,
> does not panic the machine if not needed, CPU independent, follows the
> Intel and AMD recommendations, run time sysfs configurable, logs to a
> separate device and does lots of other things much better
> [of course I'm biased on that a bit]. Disadvantage is that it isn't
> as well tested.

Well, the only way to solve that problem is to test it, right? And what
better way to test it than to switch i386 over to it too :-)

> I haven't tried it on i386, but i wrote it to be easily portable
> to 32bit too. It does periodic MCEs too, but with a much lower
> frequency and they could be turned off. I'm considering to turn
> them off for x86-64 too, because they seem to only log one bit
> ECC errors all the time. But with the new separate log device it's much
> less of a problem.
>
> The only thing you would lose is the support for P5 MCEs, but these
> could be relatively easily readded if that should be a problem.

Well, losing functionality would be bad.

> Extended register logging for P4 is also not implemented right now,
> but that hardly seems like a needed feature.

No opinion here.


Regards: David Weinehall
--
/) David Weinehall <[email protected]> /) Northern lights wander (\
// Maintainer of the v2.0 kernel // Dance across the winter sky //
\) http://www.acc.umu.se/~tao/ (/ Full colour fire (/