LinuxLists.cc - I do not know if this is the correct place to ask about this but...

2011-02-08 09:32:12

by db

Subject: I do not know if this is the correct place to ask about this but...

I do not know if this is the correct place to ask about this but...
I have only seen the following output output twice and both times have
been when I was running a 2.6.37 kernel.

[152399.816058] [Hardware Error]: MC4_STATUS: Corrected error, other
errors lost: no, CPU context corrupt: no, CECC Error
[152399.816075] [Hardware Error]: Northbridge Error, node 0: , core:
1L3 ECC data cache error.
[152399.816086] [Hardware Error]: Transaction: RD, Type: GEN, Cache
Level: L3/GEN
[152399.816092] Disabling lock debugging due to kernel taint
[152399.816099] [Hardware Error]: Machine check events logged

I assume it is just a coincidence. Also, I am not exactly sure what
the message "means". (Yes I can read the text - but I haven't found
good documentation which describes the impact it). Note: I submitted a
bug[0] regarding 'the output' the first time this occurrence.

[0] - https://bugzilla.kernel.org/show_bug.cgi?id=27332

2011-02-08 09:51:41

by Clemens Ladisch

[permalink] [raw]

Subject: Re: I do not know if this is the correct place to ask about this but...

dave b wrote:
> I do not know if this is the correct place to ask about this but...
>
> [Hardware Error]:

This is a hardware error that was detected by the kernel.

> I have only seen the following output output twice
>
> ... Corrected error
> ... L3 ECC data cache error.

There was a wrong bit in your CPU's level 3 cache, but with the help of
the redundant error correction bits, this was caught and corrected.

(If possible, enable background scrubbing of the caches in the BIOS ECC
settings to catch these errors earlier.)

If this is an overclocked CPU or one with an unlocked core, you deserve
what you got. Otherwise, if this happens repeatedly, it indicates
a hardware defect, and your warranty should cover this.

Regards,
Clemens

2011-02-08 10:00:50

by Borislav Petkov

[permalink] [raw]

Subject: Re: I do not know if this is the correct place to ask about this but...

On Tue, Feb 08, 2011 at 08:31:50PM +1100, dave b wrote:
> I do not know if this is the correct place to ask about this but...
> I have only seen the following output output twice and both times have
> been when I was running a 2.6.37 kernel.
>
> [152399.816058] [Hardware Error]: MC4_STATUS: Corrected error, other
> errors lost: no, CPU context corrupt: no, CECC Error
> [152399.816075] [Hardware Error]: Northbridge Error, node 0: , core:
> 1L3 ECC data cache error.
> [152399.816086] [Hardware Error]: Transaction: RD, Type: GEN, Cache
> Level: L3/GEN
> [152399.816092] Disabling lock debugging due to kernel taint
> [152399.816099] [Hardware Error]: Machine check events logged
>
> I assume it is just a coincidence. Also, I am not exactly sure what
> the message "means". (Yes I can read the text - but I haven't found
> good documentation which describes the impact it). Note: I submitted a
> bug[0] regarding 'the output' the first time this occurrence.

This is a L3 cache correctable error on an AMD F10h machine I'd guess.

You could go and install x86info from
http://codemonkey.org.uk/projects/x86info/ and do as root

for i in $(seq 0 3); do echo -e "\nCPU$i:"; lsmsr -c $i -a; done > lsmsr.log

[ ($seq 0 3) assumes you have 4 cores, adjust it according to your
machine. Also, you need msr.ko module support, i.e. CONFIG_X86_MSR in
your kernel .config. ]

and send me the lsmsr.log file to check whether there is some more info
about the L3 error.

If you don't have the msr.ko support (or CONFIG_X86_MSR is not set
to y in your config) that tool won't help. In that case, I'd suggest
you upgrade your kernel to 2.6.38-rc4 which is stable enough, enable
CONFIG_X86_MSR and catch the error again. Then retry the small bash
oneliner above again.

That should be all for now, feel free to ask questions should anything
be not clear.

Thanks.

--
Regards/Gruss,
Boris.

2011-02-08 10:03:54

by db

[permalink] [raw]

Subject: Re: I do not know if this is the correct place to ask about this but...

On 8 February 2011 20:52, Clemens Ladisch <[email protected]> wrote:
> dave b wrote:
>> I do not know if this is the correct place to ask about this but...
>>
>> [Hardware Error]:
>
> This is a hardware error that was detected by the kernel.
>
>> I have only seen the following output output twice
>>
>> ... Corrected error
>> ... L3 ECC data cache error.
>
> There was a wrong bit in your CPU's level 3 cache, but with the help of
> the redundant error correction bits, this was caught and corrected.

Yep I got that.

> (If possible, enable background scrubbing of the caches in the BIOS ECC
> settings to catch these errors earlier.)

Ok I will look into that.

> If this is an overclocked CPU or one with an unlocked core, you deserve
> what you got. Otherwise, if this happens repeatedly, it indicates
> a hardware defect, and your warranty should cover this.

Ok fair enough.

Notes:
The cpu is not overclocked.
I forgot to list the hardware specifications:
cpu is a AMD Phenom(tm) II X6 1055T Processor
running on a ASUS M4A88TD-M motherboard.