2005-04-30 20:20:32

by Andy Lutomirski

[permalink] [raw]
Subject: [x86_64] how worried should I be about MCEs?

Every now and then, after rebooting, the kernel notices some MCEs.
Should I be worried about this?

(mcelog attached)

Thanks,
Andy


MCE 0
CPU 0 0 data cache from boot or resume
ADDR 480b0c84df48
Data cache ECC error (syndrome c8)
bit46 = corrected ecc error
bit57 = processor context corrupt
bit61 = error uncorrected
bit62 = error overflow (multiple errors)
STATUS f66440000000438d MCGSTATUS 0
MCE 1
CPU 0 1 instruction cache from boot or resume
ADDR 75e2bb87ec57f8e0
Instruction cache ECC error
bit32 = err cpu0
bit33 = err cpu1
bit35 = res3
bit43 = res11
bit45 = uncorrected ecc error
bit46 = corrected ecc error
bit55 = res23
bit56 = res24
bit57 = processor context corrupt
bit59 = misc error valid
bit61 = error uncorrected
bit62 = error overflow (multiple errors)
STATUS ffe4681bd0e45d81 MCGSTATUS 0
MCE 2
CPU 0 3 load/store unit from boot or resume
MISC 8005003b8005003b
bit57 = processor context corrupt
bit59 = misc error valid
bit61 = error uncorrected
bit62 = error overflow (multiple errors)
STATUS fa0000000000d0c5 MCGSTATUS 0
MCE 3
CPU 0 4 northbridge from boot or resume
ADDR 102000020
Northbridge ECC error
ECC syndrome = 0
bit32 = err cpu0
bit33 = err cpu1
bit40 = error found by scrub
bit45 = uncorrected ecc error
bit57 = processor context corrupt
bit61 = error uncorrected
STATUS b600215300001e0f MCGSTATUS 0


2005-04-30 20:44:24

by bert hubert

[permalink] [raw]
Subject: Re: [x86_64] how worried should I be about MCEs?

On Sat, Apr 30, 2005 at 01:16:49PM -0700, Andy Lutomirski wrote:
> Every now and then, after rebooting, the kernel notices some MCEs.
> Should I be worried about this?

If these reports are true, they would be worrying. But I find them a bit
hard to believe - the bit combinations don't appear to make sense.

I have an AMD64 machine which logs 'MCE reported' every once in a while but
otherwise functions perfectly and I haven't yet coaxed it into telling me
the content of the errors.

Might there be a bug here? How did you create this log?

> STATUS f66440000000438d MCGSTATUS 0
> MCE 1
> CPU 0 1 instruction cache from boot or resume
> ADDR 75e2bb87ec57f8e0
> Instruction cache ECC error
> bit32 = err cpu0
> bit33 = err cpu1
> bit35 = res3
> bit43 = res11
> bit45 = uncorrected ecc error
> bit46 = corrected ecc error
> bit55 = res23
> bit56 = res24
> bit57 = processor context corrupt
> bit59 = misc error valid
> bit61 = error uncorrected
> bit62 = error overflow (multiple errors)

This would be one hell of an error - both corrected and uncorrected.

Regards,

bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2005-04-30 21:02:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [x86_64] how worried should I be about MCEs?

bert hubert wrote:
> On Sat, Apr 30, 2005 at 01:16:49PM -0700, Andy Lutomirski wrote:
>
>>Every now and then, after rebooting, the kernel notices some MCEs.
>>Should I be worried about this?
>
>
> If these reports are true, they would be worrying. But I find them a bit
> hard to believe - the bit combinations don't appear to make sense.

True.

>
> I have an AMD64 machine which logs 'MCE reported' every once in a while but
> otherwise functions perfectly and I haven't yet coaxed it into telling me
> the content of the errors.
>
> Might there be a bug here? How did you create this log?

This is from mcelog 0.3, dumped with a daily cron job to
/var/log/mcelog. I think it came from 2.6.11-gentoo-r6 (which should be
essentially 2.6.11.7).

The machine is Athlon 64 3200+ (754), on an MSI K8T Neo-FIS2R, running a
moderately old BIOS but one that has erratum #93 (or whatever it was) fixed.

Anything I should attach to provide more info?

I just upgraded to mcelog-0.4, but at this rate I don't expect a new
dump for awhile.

Thanks,
Andy

2005-04-30 21:47:29

by bert hubert

[permalink] [raw]
Subject: possibly bogus AMD64 MCE reporting.

On Sat, Apr 30, 2005 at 02:02:04PM -0700, Andy Lutomirski wrote:

> Anything I should attach to provide more info?
>
> I just upgraded to mcelog-0.4, but at this rate I don't expect a new
> dump for awhile.

I'll investigate the MCE reports from my opteron machine in the coming days
and report back if they are bogus as well

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2005-05-02 16:53:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [x86_64] how worried should I be about MCEs?

Andy Lutomirski <[email protected]> writes:

> Every now and then, after rebooting, the kernel notices some
> MCEs. Should I be worried about this?


>
> (mcelog attached)
>
> Thanks,
> Andy
>
>
> MCE 0
> CPU 0 0 data cache from boot or resume
> ADDR 480b0c84df48
> Data cache ECC error (syndrome c8)

These are harmless. I have one machine that generates them too.
I think they happen because the BIOS either does something
incorrectly while booting the POSting the CPU or these are
expected and it forgets to clear them. Only a few BIOS
seem to do it, so it is probably a BIOS bug.

You see them because the MCE code logs boot MCEs now.
That is because it is the only way to log MCEs that
cause the system to reboot is to log them after the reboot.

Some of the bit combinations are clearly non sensical, like
corrected ECC error with error uncorrected and the Address
is bogus.

I have been pondering to add some filter to remove
these bogus MCEs, but I have not come up with
a good heuristic yet. Perhaps ignore all MCEs at resume
with addresses that are beyond the physical memory.
But that would not have caught the last one.

-Andi

[intentional full quote for Mark]

> bit46 = corrected ecc error
> bit57 = processor context corrupt
> bit61 = error uncorrected
> bit62 = error overflow (multiple errors)
> STATUS f66440000000438d MCGSTATUS 0
> MCE 1
> CPU 0 1 instruction cache from boot or resume
> ADDR 75e2bb87ec57f8e0
> Instruction cache ECC error
> bit32 = err cpu0
> bit33 = err cpu1
> bit35 = res3
> bit43 = res11
> bit45 = uncorrected ecc error
> bit46 = corrected ecc error
> bit55 = res23
> bit56 = res24
> bit57 = processor context corrupt
> bit59 = misc error valid
> bit61 = error uncorrected
> bit62 = error overflow (multiple errors)
> STATUS ffe4681bd0e45d81 MCGSTATUS 0
> MCE 2
> CPU 0 3 load/store unit from boot or resume
> MISC 8005003b8005003b
> bit57 = processor context corrupt
> bit59 = misc error valid
> bit61 = error uncorrected
> bit62 = error overflow (multiple errors)
> STATUS fa0000000000d0c5 MCGSTATUS 0
> MCE 3
> CPU 0 4 northbridge from boot or resume
> ADDR 102000020
> Northbridge ECC error
> ECC syndrome = 0
> bit32 = err cpu0
> bit33 = err cpu1
> bit40 = error found by scrub
> bit45 = uncorrected ecc error
> bit57 = processor context corrupt
> bit61 = error uncorrected
> STATUS b600215300001e0f MCGSTATUS 0