Hello,
I've just got the following on an AMD A10 5800K:
------
[ 8395.999581] [Hardware Error]: CPU:0
MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
[ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203
[ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
during data load from IC.
[ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
------
Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
Can someone enlight me about what might be wrong with my (new) system
(memtest didn't show an errors)?
What IC is meant? As far as I know, this processor doesn't support ECC,
so I wonder where that parity error does come from.
Regards,
Alexander
Am 02.11.2012 11:50, schrieb Alexander Holler:
> Hello,
>
> I've just got the following on an AMD A10 5800K:
>
> ------
> [ 8395.999581] [Hardware Error]: CPU:0
> MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
> [ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203
> [ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
> during data load from IC.
> [ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
> ------
>
> Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
>
> Can someone enlight me about what might be wrong with my (new) system
> (memtest didn't show an errors)?
>
> What IC is meant? As far as I know, this processor doesn't support ECC,
> so I wonder where that parity error does come from.
I assume IC means Instruction Cache. ;)
As the kernel didn't reboot or halt, this seems to have been a
correctable error.
Which leads me to another question. I have mcelog running, but it
doesn't seem to receive the error. With my previous Intel-HW and an
older kernel, mcelog received MCE errors (trip temperatur). But since
the kernel now decodes those message themself, that doesn't seem to
happen anymore. mcelog is silent, but now I've seen the above message on
all my consoles.
So now I have two question:
- First, if the error is something I should ask AMD about,
- Second, if the kernel could mention that it is an recoverable error.
And if so and if such errors aren't something to get panic (e.g. it
isn't unusual to receive such), if the kernel could output that message
with another priority.
Regards,
Alexander
On Fri, Nov 02, 2012 at 02:53:45PM +0100, Alexander Holler wrote:
> Am 02.11.2012 11:50, schrieb Alexander Holler:
> >Hello,
> >
> >I've just got the following on an AMD A10 5800K:
> >
> >------
> >[ 8395.999581] [Hardware Error]: CPU:0
> >MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
> >[ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203
> >[ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
> >during data load from IC.
> >[ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
> >------
> >
> >Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
> >
> >Can someone enlight me about what might be wrong with my (new) system
> >(memtest didn't show an errors)?
> >
> >What IC is meant? As far as I know, this processor doesn't support ECC,
> >so I wonder where that parity error does come from.
>
> I assume IC means Instruction Cache. ;)
It says so earlier in the sentence: "Instruction Cache Error" :)
> As the kernel didn't reboot or halt, this seems to have been a
> correctable error.
Yes, it is (the "CE" bit in MC1_STATUS). Btw, I have reworked this code
to spit human-readable information first. It also says what the error
severity is now.
> Which leads me to another question. I have mcelog running, but it
> doesn't seem to receive the error. With my previous Intel-HW and an
> older kernel, mcelog received MCE errors (trip temperatur). But
> since the kernel now decodes those message themself, that doesn't
> seem to happen anymore. mcelog is silent, but now I've seen the
> above message on all my consoles.
Yes, AMD doesn't use mcelog.
> So now I have two question:
>
> - First, if the error is something I should ask AMD about,
Not really, it is a single bit flip which got corrected, simply watch
out if you get more of those.
> - Second, if the kernel could mention that it is an recoverable
> error. And if so and if such errors aren't something to get panic
> (e.g. it isn't unusual to receive such), if the kernel could output
> that message with another priority.
As I said above, it got corrected. If it were critical, it would've
either panicked or you wouldnt've seen it at all (probably only after
reboot).
HTH.
--
Regards/Gruss,
Boris.
Am 03.11.2012 05:49, schrieb Borislav Petkov:
> On Fri, Nov 02, 2012 at 02:53:45PM +0100, Alexander Holler wrote:
>> Am 02.11.2012 11:50, schrieb Alexander Holler:
>>> Hello,
>>>
>>> I've just got the following on an AMD A10 5800K:
>>>
>>> ------
>>> [ 8395.999581] [Hardware Error]: CPU:0
>>> MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
>>> [ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203
>>> [ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
>>> during data load from IC.
>>> [ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
>>> ------
>>>
>>> Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
>>>
...
>> So now I have two question:
>>
>> - First, if the error is something I should ask AMD about,
>
> Not really, it is a single bit flip which got corrected, simply watch
> out if you get more of those.
>
>> - Second, if the kernel could mention that it is an recoverable
>> error. And if so and if such errors aren't something to get panic
>> (e.g. it isn't unusual to receive such), if the kernel could output
>> that message with another priority.
>
> As I said above, it got corrected. If it were critical, it would've
> either panicked or you wouldnt've seen it at all (probably only after
> reboot).
Hmm, exactly that just happened twice in a row. Unfortunately the screen
was already disabled (screen saving mode), so I couldn't see any
message, if there was any. Just a dead box (not overclocked, I don't do
such, I even had enabled the power saving mode in the BIOS, which seems
to mean max. 3800 MHz). I think I should start getting nervous. :(
What I meant with another priority is using something else than
pr_emerg(), because pr_emerge() causes the message to become displayed
on every console, at least on my F17 with default settings.
Of course, I'm happy it was displayed using pr_emerg() so I haven't
missed it. Now I know that even if ECC isn't available for users which
don't want or need power hungry and loud servers, at least some parity
is used to verify the operations with the internal memory (cache).
But on the other way, if that message isn't really critical, something
else than pr_emerge() should be used.
Thanks for the answer.
Regards,
Alexander
On Sat, Nov 03, 2012 at 11:45:25AM +0100, Alexander Holler wrote:
> Hmm, exactly that just happened twice in a row. Unfortunately the
> screen was already disabled (screen saving mode), so I couldn't see
> any message, if there was any. Just a dead box (not overclocked, I
> don't do such, I even had enabled the power saving mode in the BIOS,
> which seems to mean max. 3800 MHz). I think I should start getting
> nervous. :(
How do you know this happened twice if you couldn't see any message?
Also, can you enable netconsole or serial console, if possible, and try
to catch full dmesg from the boot and up until it happens.
Also, catch the dmesg of the box on the next time it reboots after the
freeze (btw, try doing a warm reboot because this is the only way we can
preserve valid error information) - if all works ok, it should decode
the MCE before the freeze (if the MCE caused it and it actually is the
reason for the freeze).
> What I meant with another priority is using something else than
> pr_emerg(), because pr_emerge() causes the message to become
> displayed on every console, at least on my F17 with default
> settings.
This is needed because we want those errors to actually be seen.
Thanks.
--
Regards/Gruss,
Boris.
Am 04.11.2012 16:21, schrieb Borislav Petkov:
> On Sat, Nov 03, 2012 at 11:45:25AM +0100, Alexander Holler wrote:
>> Hmm, exactly that just happened twice in a row. Unfortunately the
>> screen was already disabled (screen saving mode), so I couldn't see
>> any message, if there was any. Just a dead box (not overclocked, I
>> don't do such, I even had enabled the power saving mode in the BIOS,
>> which seems to mean max. 3800 MHz). I think I should start getting
>> nervous. :(
>
> How do you know this happened twice if you couldn't see any message?
I was remotely logged in and there aren't that many faults which lead to
complete stand still of hw (no reset).
But as you said I can't know, the only thing I know is that a box with
new mb, memory and apu come to a complete stand still, and such shortly
after I've received an emergency message which told me that a bit inside
the cpu switched unexpected. Adding to that, the box did the same as
what it did while it received the MCE, a backup from a sata-atached ssd
to an usb3-hd which includes compression and encryption which keeps all
cores at work most of the time for several hours.
> Also, can you enable netconsole or serial console, if possible, and try
> to catch full dmesg from the boot and up until it happens.
As I was logged in remotely by network, I know it wasn't the same MCE as
before (just a disconnect and dead hw). But I don't know what else it
was. And as I recently got hit by a broken RAM module, which was a pain
to find, I'm not very happy that I have to go through similiar pain
again with new HW.
The probability to get a working HW and SW combination just has become
to low in the last years. All the (IT) companies better should spend the
money they now give their lawyers their QA and engineering departments
instead.
Sorry for the rant, also I'm used to live with hw and sw errors (as a
sw-dev), I'm currently just a bit annoyed. ;)
I will setup something to monitor the box through the serial and will
let it backup itself all the time, trying to catch some usefull information.
Regards,
Alexander
On Sun, Nov 04, 2012 at 06:19:32PM +0100, Alexander Holler wrote:
> I was remotely logged in and there aren't that many faults which
> lead to complete stand still of hw (no reset).
Right, can you retry triggering the freeze without the fglrx driver?
Simply remove it completely so that even the possibility to load it is
not there.
> But as you said I can't know, the only thing I know is that a box
> with new mb, memory and apu come to a complete stand still, and
> such shortly after I've received an emergency message which told me
> that a bit inside the cpu switched unexpected. Adding to that, the
> box did the same as what it did while it received the MCE, a backup
> from a sata-atached ssd to an usb3-hd which includes compression and
> encryption which keeps all cores at work most of the time for several
> hours.
So do you get that MCE each time you execute that same workload?
Thanks.
--
Regards/Gruss,
Boris.
Am 06.11.2012 10:10, schrieb Borislav Petkov:
> On Sun, Nov 04, 2012 at 06:19:32PM +0100, Alexander Holler wrote:
>> I was remotely logged in and there aren't that many faults which
>> lead to complete stand still of hw (no reset).
>
> Right, can you retry triggering the freeze without the fglrx driver?
> Simply remove it completely so that even the possibility to load it is
> not there.
Will do. But I don't think it is fglrx. I'm using it since several years
(just with an external graphics card before) and never had a problem
with it. Besides that, during the hangs nothing on the display happened,
I was logged out and just had a remote ssh session on.
>> But as you said I can't know, the only thing I know is that a box
>> with new mb, memory and apu come to a complete stand still, and
>> such shortly after I've received an emergency message which told me
>> that a bit inside the cpu switched unexpected. Adding to that, the
>> box did the same as what it did while it received the MCE, a backup
>> from a sata-atached ssd to an usb3-hd which includes compression and
>> encryption which keeps all cores at work most of the time for several
>> hours.
>
> So do you get that MCE each time you execute that same workload?
No, up to now the MCE only was visible once. But stressing the box
yesterday (with loads of 3 for several hours and such) revealed some
other serious failures which all look like the stuff which happens when
the cache (or memory) is broken (I don't know how many bits of the cache
can be corrected until something else happens or what happens). E.g. the
checksum of a backup is wrong, or bzip2 failed with an error which it
suggests is because of an HW failure like bad RAM (I've never seen that
error from bzip2 before).
I've just done a memory test using memtest86+-4.20 for about 7h (3
complete passes of all 16GB), no errors, so the new memory itself seems
to be ok.
I will now to tests with leaving fglrx off.
Regards,
Alexander
Am 06.11.2012 12:18, schrieb Alexander Holler:
> I will now to tests with leaving fglrx off.
s/to/do/ ;)
That was gone fast. Disabled fglrx, started tests, full halt without any
visible on the serial (I needed to press the reset button):
------------------------
[ 77.360180] EXT4-fs (dm-0): mounted filesystem with ordered data
mode. Opts: data=ordered,nodelalloc
[ 461.503107] EXT4-fs (sdd3): mounted filesystem with ordered data
mode. Opts: (null)
[ 473.869770] EXT4-fs (dm-1): mounted filesystem with ordered data
mode. Opts: data=ordered,nodelalloc
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.6.6-00008-gd13f937
([email protected]) (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2)
(GCC) ) #274 SMP Mon Nov 5 17:26:01 CET 2012
[ 0.000000] Command line: ro root=/dev/sdb2 rootfstype=ext4
enforcing=0 cgroup_disable=memory vga=0x346 video=vesafb:mtrr:3,ywrap
radeon.modeset=0 earlycon=uart8250,io,0x3f8,115200n8
console=ttyS0,115200n8 console=tty0
------------------------
Regards,
Alexander
Am 06.11.2012 12:44, schrieb Alexander Holler:
> Am 06.11.2012 12:18, schrieb Alexander Holler:
>> I will now to tests with leaving fglrx off.
>
> s/to/do/ ;)
>
> That was gone fast. Disabled fglrx, started tests, full halt without any
> visible on the serial (I needed to press the reset button):
One after another, now I've got this:
[ 5698.640830] [Hardware Error]: CPU:0
MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136
[ 5698.649866] [Hardware Error]: MC2_ADDR: 0x0000000002299678
[ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error on
data fills.
[ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
I think it's now really an RMA and I can stop doing further tests.
Regards,
Alexander
On Tue, Nov 06, 2012 at 02:14:46PM +0100, Alexander Holler wrote:
> Am 06.11.2012 12:44, schrieb Alexander Holler:
> >Am 06.11.2012 12:18, schrieb Alexander Holler:
> >>I will now to tests with leaving fglrx off.
> >
> >s/to/do/ ;)
> >
> >That was gone fast. Disabled fglrx, started tests, full halt without any
> >visible on the serial (I needed to press the reset button):
>
> One after another, now I've got this:
>
> [ 5698.640830] [Hardware Error]: CPU:0
> MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136
> [ 5698.649866] [Hardware Error]: MC2_ADDR: 0x0000000002299678
> [ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error
> on data fills.
> [ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
>
> I think it's now really an RMA and I can stop doing further tests.
Are you sure the temperature conditions of the box are optimal? IOW,
there's nothing overheating in there?
Thanks.
--
Regards/Gruss,
Boris.
Am 06.11.2012 15:47, schrieb Borislav Petkov:
> On Tue, Nov 06, 2012 at 02:14:46PM +0100, Alexander Holler wrote:
>> Am 06.11.2012 12:44, schrieb Alexander Holler:
>>> Am 06.11.2012 12:18, schrieb Alexander Holler:
>>>> I will now to tests with leaving fglrx off.
>>>
>>> s/to/do/ ;)
>>>
>>> That was gone fast. Disabled fglrx, started tests, full halt without any
>>> visible on the serial (I needed to press the reset button):
>>
>> One after another, now I've got this:
>>
>> [ 5698.640830] [Hardware Error]: CPU:0
>> MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540c000040136
>> [ 5698.649866] [Hardware Error]: MC2_ADDR: 0x0000000002299678
>> [ 5698.655443] [Hardware Error]: Combined Unit Error: Fill ECC error
>> on data fills.
>> [ 5698.662849] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
>>
>> I think it's now really an RMA and I can stop doing further tests.
>
> Are you sure the temperature conditions of the box are optimal? IOW,
> there's nothing overheating in there?
Yes. At least if the boxed fan is enough, which I have to assume.
Environment temperature is around 18° C or even colder.
Regards,
Alexander