LinuxLists.cc - MCE Errors, Bad CPU, Memory or Motherboard?

2006-03-10 11:44:48

Subject: MCE Errors, Bad CPU, Memory or Motherboard?

This is the first time I have seen an MCE error, googling the EIP value at
the time of the panic does not return any useful results.

Does anyone know whether it is the CPU or MEMORY that is bad in this
machine? As it shows some problems with BANK4; however, if the CPU is
bad, then it is possible to get all sorts of unpredictable results, right?

Dec 9 23:21:25 box CPU 0: Machine Check Exception
Dec 9 23:21:25 box Bank 4: f62ba001c0080813 at 00000000a6e6c2c0
Dec 9 23:21:25 box Kernel panic: CPU context corrupt
Dec 9 23:21:25 box kernel: CPU 0
Dec 9 23:21:25 box kernel: CPU 0
Dec 9 23:21:25 box kernel: Bank 4
Dec 9 23:21:25 box kernel: Bank 4
Dec 9 23:21:25 box kernel: Kernel panic
Dec 9 23:21:25 box kernel: Kernel panic
Dec 9 23:21:26 box kernel BUG at panic.c:66!
Dec 9 23:21:26 box invalid operand: 0000
Dec 9 23:21:26 box CPU: 0
Dec 9 23:21:26 box EIP: 0010
Dec 9 23:21:26 box EFLAGS: 00010282
Dec 9 23:21:26 box eax: f895c1d0 ebx
Dec 9 23:21:26 box esi: 00000415 edi
Dec 9 23:21:26 box ds: 0018 es
Dec 9 23:21:26 box Process java (pid: 6852, stackpage=e0ec5000)
Dec 9 23:21:26 box Stack: 04000000 e0ec5fa4 c010fc28 c02942b8 00000005
c0080813 00000417 00000416
Dec 9 23:21:26 box 00000005 00000000 00000004 e0ec4000 00000000
c010fd00 e0ec5fb4 c010fd11
Dec 9 23:21:26 box e0ec5fc4 00000000 bfffc090 c0108ed4 e0ec5fc4
00000000 00000023 44841510
Dec 9 23:21:26 box Call Trace: [<c010fc28>] [<c010fd00>] [<c010fd11>]
[<c0108ed4>]
Dec 9 23:21:26 box
Dec 9 23:21:26 box Code: 0f 0b 42 00 28 6a 29 c0 b9 00 e0 ff ff 21 e1 8b
51 30 c1 e2

Thanks,

Justin.

2006-03-11 09:51:16

by Marco Roeland

[permalink] [raw]

Subject: Re: MCE Errors, Bad CPU, Memory or Motherboard?

On Friday March 10th 2006 Justin Piszcz wrote:

> This is the first time I have seen an MCE error, googling the EIP value at
> the time of the panic does not return any useful results.
>
> Does anyone know whether it is the CPU or MEMORY that is bad in this
> machine? As it shows some problems with BANK4; however, if the CPU is
> bad, then it is possible to get all sorts of unpredictable results, right?

If it's the memory you can try swapping RAM sticks or taking them out
altogether, if you have more than one.

Recently I saw these kind of MCE events also happen on a 3 year old
Athlon 2000 machine, which was on 24/7. There it finally turned out that
I ran an "athcool" program which enables powersaving mode on idle state.
Although this saves energy and reduces the fan noise, switching back to
work mode apparently wasn't fast enough anymore after these years. After
disabling the powersaving the MCE errors were gone.

The difference between these MCE's on memory or on the ACPI powersaving
(if you use that at all) is really simple to diagnose: for memory they
will occur on a very busy system, whereas the powersaving errors will
occur on a completely idle system.

But you might have real hardware problems of course. Then use the usual
routines: memtest(86), check the cooling, power surges, etc.
--
Marco Roeland

2006-03-11 22:08:05

by Justin Piszcz

[permalink] [raw]

Subject: Re: MCE Errors, Bad CPU, Memory or Motherboard?

Thanks for this response.

On Sat, 11 Mar 2006, Marco Roeland wrote:

> On Friday March 10th 2006 Justin Piszcz wrote:
>
>> This is the first time I have seen an MCE error, googling the EIP value at
>> the time of the panic does not return any useful results.
>>
>> Does anyone know whether it is the CPU or MEMORY that is bad in this
>> machine? As it shows some problems with BANK4; however, if the CPU is
>> bad, then it is possible to get all sorts of unpredictable results, right?
>
> If it's the memory you can try swapping RAM sticks or taking them out
> altogether, if you have more than one.
>
> Recently I saw these kind of MCE events also happen on a 3 year old
> Athlon 2000 machine, which was on 24/7. There it finally turned out that
> I ran an "athcool" program which enables powersaving mode on idle state.
> Although this saves energy and reduces the fan noise, switching back to
> work mode apparently wasn't fast enough anymore after these years. After
> disabling the powersaving the MCE errors were gone.
>
> The difference between these MCE's on memory or on the ACPI powersaving
> (if you use that at all) is really simple to diagnose: for memory they
> will occur on a very busy system, whereas the powersaving errors will
> occur on a completely idle system.
>
> But you might have real hardware problems of course. Then use the usual
> routines: memtest(86), check the cooling, power surges, etc.
> --
> Marco Roeland
>

2006-03-11 22:40:41

by Randy Dunlap

[permalink] [raw]

Subject: Re: MCE Errors, Bad CPU, Memory or Motherboard?

On Fri, 10 Mar 2006 06:44:46 -0500 (EST) Justin Piszcz wrote:

> This is the first time I have seen an MCE error, googling the EIP value at
> the time of the panic does not return any useful results.
>
> Does anyone know whether it is the CPU or MEMORY that is bad in this
> machine? As it shows some problems with BANK4; however, if the CPU is
> bad, then it is possible to get all sorts of unpredictable results, right?
>
> Dec 9 23:21:25 box CPU 0: Machine Check Exception
> Dec 9 23:21:25 box Bank 4: f62ba001c0080813 at 00000000a6e6c2c0
> Dec 9 23:21:25 box Kernel panic: CPU context corrupt
> Dec 9 23:21:25 box kernel: CPU 0
> Dec 9 23:21:25 box kernel: CPU 0
> Dec 9 23:21:25 box kernel: Bank 4
> Dec 9 23:21:25 box kernel: Bank 4
> Dec 9 23:21:25 box kernel: Kernel panic
> Dec 9 23:21:25 box kernel: Kernel panic
> Dec 9 23:21:26 box kernel BUG at panic.c:66!
> Dec 9 23:21:26 box invalid operand: 0000
> Dec 9 23:21:26 box CPU: 0
> Dec 9 23:21:26 box EIP: 0010
> Dec 9 23:21:26 box EFLAGS: 00010282
> Dec 9 23:21:26 box eax: f895c1d0 ebx
> Dec 9 23:21:26 box esi: 00000415 edi
> Dec 9 23:21:26 box ds: 0018 es
> Dec 9 23:21:26 box Process java (pid: 6852, stackpage=e0ec5000)
> Dec 9 23:21:26 box Stack: 04000000 e0ec5fa4 c010fc28 c02942b8 00000005
> c0080813 00000417 00000416
> Dec 9 23:21:26 box 00000005 00000000 00000004 e0ec4000 00000000
> c010fd00 e0ec5fb4 c010fd11
> Dec 9 23:21:26 box e0ec5fc4 00000000 bfffc090 c0108ed4 e0ec5fc4
> 00000000 00000023 44841510
> Dec 9 23:21:26 box Call Trace: [<c010fc28>] [<c010fd00>] [<c010fd11>]
> [<c0108ed4>]
> Dec 9 23:21:26 box
> Dec 9 23:21:26 box Code: 0f 0b 42 00 28 6a 29 c0 b9 00 e0 ff ff 21 e1 8b
> 51 30 c1 e2

I suppose you tried the obvious tools (parsemce and maybe mcelog)...

http://www.codemonkey.org.uk/projects/parsemce/
ftp://ftp.x86-64.org/pub/linux-x86_64/tools/mcelog/

---
~Randy
Please use an email client that implements proper (compliant) threading.
(You know who you are.)