2006-05-15 09:42:45

by Stephan von Krawczynski

[permalink] [raw]
Subject: mcelog ?

Hello,

can some kind soul please shortly explain what this message tells me:

HARDWARE ERROR
CPU 1: Machine Check Exception: 4 Bank 4: b60a200170080813
TSC 89cfb4725b17 ADDR 1025cb3f0
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check



Of course I ran mcelog but I don't quite understand how the additional info
helps me finding the problem.
Is this a problem with RAM? And if, which one?

The box is a dual opteron with two banks of mem (4 sockets each), each socket
holding a 1 GB mem module.

Thanks for any hints.
--
Regards,
Stephan


2006-05-15 15:20:10

by Tim Hockin

[permalink] [raw]
Subject: Re: mcelog ?

On Mon, May 15, 2006 at 11:42:43AM +0200, Stephan von Krawczynski wrote:
> HARDWARE ERROR
> CPU 1: Machine Check Exception: 4 Bank 4: b60a200170080813
> TSC 89cfb4725b17 ADDR 1025cb3f0
> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor
> Kernel panic - not syncing: Machine check
>
> Of course I ran mcelog but I don't quite understand how the additional info
> helps me finding the problem.
> Is this a problem with RAM? And if, which one?

It sounds like a memory error, but there are some other bank4 errors that
can crop up. What does mcedecode say?

2006-05-15 15:45:10

by Andi Kleen

[permalink] [raw]
Subject: Re: mcelog ?

Stephan von Krawczynski <[email protected]> writes:

> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor

Since when is linux-kernel your hardware vendor?
Would it help if the message was written all upper case?

-Andi

2006-05-16 09:37:05

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: mcelog ?

On Mon, 15 May 2006 08:20:08 -0700
[email protected] wrote:

> On Mon, May 15, 2006 at 11:42:43AM +0200, Stephan von Krawczynski wrote:
> > HARDWARE ERROR
> > CPU 1: Machine Check Exception: 4 Bank 4: b60a200170080813
> > TSC 89cfb4725b17 ADDR 1025cb3f0
> > This is not a software problem!
> > Run through mcelog --ascii to decode and contact your hardware vendor
> > Kernel panic - not syncing: Machine check
> >
> > Of course I ran mcelog but I don't quite understand how the additional info
> > helps me finding the problem.
> > Is this a problem with RAM? And if, which one?
>
> It sounds like a memory error, but there are some other bank4 errors that
> can crop up. What does mcedecode say?

Well, here it is:

HARDWARE ERROR
CPU 1 4 northbridge TSC 89cfb4725b17
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 7014
bit32 = err cpu0
bit45 = uncorrected ecc error
bit57 = processor context corrupt
bit61 = error uncorrected
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS b60a200170080813 MCGSTATUS 4
This is not a software problem!


Is this some sort of mem error?

Thank you for your help
--
Regards,
Stephan

2006-05-22 11:00:12

by Bernd Pfrommer

[permalink] [raw]
Subject: Re: mcelog ?

Stephan von Krawczynski <skraw <at> ithnet.com> writes:

>
> Hello,
>
> can some kind soul please shortly explain what this message tells me:
>
> HARDWARE ERROR
> CPU 1: Machine Check Exception: 4 Bank 4: b60a200170080813
> TSC 89cfb4725b17 ADDR 1025cb3f0
> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor
> Kernel panic - not syncing: Machine check
>
> Of course I ran mcelog but I don't quite understand how the additional info
> helps me finding the problem.
> Is this a problem with RAM? And if, which one?
>
> The box is a dual opteron with two banks of mem (4 sockets each), each socket
> holding a 1 GB mem module.
>
> Thanks for any hints.


I got a very similar error on a supermicro H8QC8+ (4way dual-core opteron)
during heavy disk writes. It only happened once so far. The error message also
mentioned
4 Bank 4: b608a00100000813 (strange that the last 4 digits agree).

Bernd