Message-ID: <5096A3A4.8070602@ahsoftware.de>
Date: Sun, 04 Nov 2012 18:19:32 +0100
From: Alexander Holler <holler@ahsoftware.de>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121016 Thunderbird/16.0.1
MIME-Version: 1.0
To: Borislav Petkov <bp@alien8.de>, linux-kernel@vger.kernel.org
Subject: Re: AMD A10: MCE Instruction Cache Error
References: <5093A592.9070605@ahsoftware.de> <5093D069.20901@ahsoftware.de> <20121103044929.GB21829@liondog.tnic> <5094F5C5.1000000@ahsoftware.de> <20121104152133.GA16116@x1.osrc.amd.com>
In-Reply-To: <20121104152133.GA16116@x1.osrc.amd.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2358
Lines: 50

Am 04.11.2012 16:21, schrieb Borislav Petkov:
> On Sat, Nov 03, 2012 at 11:45:25AM +0100, Alexander Holler wrote:
>> Hmm, exactly that just happened twice in a row. Unfortunately the
>> screen was already disabled (screen saving mode), so I couldn't see
>> any message, if there was any. Just a dead box (not overclocked, I
>> don't do such, I even had enabled the power saving mode in the BIOS,
>> which seems to mean max. 3800 MHz). I think I should start getting
>> nervous. :(
>
> How do you know this happened twice if you couldn't see any message?

I was remotely logged in and there aren't that many faults which lead to 
complete stand still of hw (no reset).

But as you said I can't know, the only thing I know is that a box with 
new mb, memory and apu come to a complete stand still, and such shortly 
after I've received an emergency message which told me that a bit inside 
the cpu switched unexpected. Adding to that, the box did the same as 
what it did while it received the MCE, a backup from a sata-atached ssd 
to an usb3-hd which includes compression and encryption which keeps all 
cores at work most of the time for several hours.

> Also, can you enable netconsole or serial console, if possible, and try
> to catch full dmesg from the boot and up until it happens.

As I was logged in remotely by network, I know it wasn't the same MCE as 
before (just a disconnect and dead hw). But I don't know what else it 
was. And as I recently got hit by a broken RAM module, which was a pain 
to find, I'm not very happy that I have to go through similiar pain 
again with new HW.

The probability to get a working HW and SW combination just has become 
to low in the last years. All the (IT) companies better should spend the 
money they now give their lawyers their QA and engineering departments 
instead.

Sorry for the rant, also I'm used to live with hw and sw errors (as a 
sw-dev), I'm currently just a bit annoyed. ;)

I will setup something to monitor the box through the serial and will 
let it backup itself all the time, trying to catch some usefull information.

Regards,

Alexander
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/