2007-06-16 21:02:55

by Mr. James W. Laferriere

[permalink] [raw]
Subject: Machine Check Exception: 0...04

Hello All , Does anoyone know howto identify a cause for these(*) ?
Or of any tools to help in the identification of the cause ?
So far the Machine checks only happen when I am running bonnie++ against
my software raid6 array .

I have done everything I know to do to attempt to ascertain what is
causing the machine checks .
ie:
1) memtest86+ for days , no errors .
2) cpuburnP6 , The tests run were 'cpuburnP6 E' & 'cpuburnP6 H' for ~
60 minutes each . All CPU's & HT were at 96+<->100% for 60+ Minutes ,
no excessive heating or lockups . In single user mode of course .
I know cpuburn is old but it can excersize the comms between l1 & cpu
& l1 & l2 -> cpu if done right .

(*)
CPU 5: Machine Check Exception: 0000000000000004
CPU 4: Machine Check Exception: 0000000000000004
Kernel panic - not syncing: Unable to continue
<system reboots>

root@(none):~ # uname -a
Linux (none) 2.6.21.5 #1 SMP Fri Jun 15 04:37:23 UTC 2007 i686 pentium4 i386 GNU/Linux

Complete serial console log of a boot to single user mode is here .

http://www.baby-dragons.com/test-2.6.21.5-mptscsi-4.00.10.00-2007006161326.log

Tia , JimL
--
+-----------------------------------------------------------------+
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | 663 Beaumont Blvd | Give me Linux |
| [email protected] | Pacifica, CA. 94044 | only on AXP |
+-----------------------------------------------------------------+


2007-06-17 22:38:54

by Mr. James W. Laferriere

[permalink] [raw]
Subject: Re: Machine Check Exception: 0...04

Hello All , As a continuation .

On Sat, 16 Jun 2007, Mr. James W. Laferriere wrote:
> Hello All , Does anoyone know howto identify a cause for these(*) ?
> Or of any tools to help in the identification of the cause ?
> So far the Machine checks only happen when I am running bonnie++
> against
> my software raid6 array .
>
> I have done everything I know to do to attempt to ascertain what is
> causing the machine checks .
> ie:
> 1) memtest86+ for days , no errors .
> 2) cpuburnP6 , The tests run were 'cpuburnP6 E' & 'cpuburnP6 H' for ~
> 60 minutes each . All CPU's & HT were at 96+<->100% for 60+ Minutes
> ,
> no excessive heating or lockups . In single user mode of course .
> I know cpuburn is old but it can excersize the comms between l1 & cpu
> & l1 & l2 -> cpu if done right .
>
> (*)
> CPU 5: Machine Check Exception: 0000000000000004
> CPU 4: Machine Check Exception: 0000000000000004
> Kernel panic - not syncing: Unable to continue
> <system reboots>
>
> root@(none):~ # uname -a
> Linux (none) 2.6.21.5 #1 SMP Fri Jun 15 04:37:23 UTC 2007 i686 pentium4 i386
> GNU/Linux
>
> Complete serial console log of a boot to single user mode is here .
>
> http://www.baby-dragons.com/test-2.6.21.5-mptscsi-4.00.10.00-2007006161326.log
>

Hopefully more useful information . Again only if I am doing any HEAVY
disk activity ie: bonnie++ .
I am not able to find a tool to disassamble the EIP: portion of the
CPU 4: output .

eth2: after: tx_done_idx=125 free_idx=3 cmdsts=800005ea
CPU 5: Machine Check Exception: 0000000000000004
CPU 4: Machine Check Exception: 0000000000000004
CPU 4: EIP: c0100c72 EFLAGS: 00000246
eax: 00000000 ebx: 00000000 ecx: 00000000 edx: 00000000
esi: 00000000 edi: f7c58000 ebp: f7c59f68 esp: f7c59f5c
Kernel panic - not syncing: Unable to continue

Tia , JimL
--
+-----------------------------------------------------------------+
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | 663 Beaumont Blvd | Give me Linux |
| [email protected] | Pacifica, CA. 94044 | only on AXP |
+-----------------------------------------------------------------+

2007-06-17 22:43:00

by Jesper Juhl

[permalink] [raw]
Subject: Re: Machine Check Exception: 0...04

On 18/06/07, Mr. James W. Laferriere <[email protected]> wrote:
> Hello All , As a continuation .
>
> On Sat, 16 Jun 2007, Mr. James W. Laferriere wrote:
> > Hello All , Does anoyone know howto identify a cause for these(*) ?
> > Or of any tools to help in the identification of the cause ?
> > So far the Machine checks only happen when I am running bonnie++
> > against
> > my software raid6 array .
> >
> > I have done everything I know to do to attempt to ascertain what is
> > causing the machine checks .
> > ie:
> > 1) memtest86+ for days , no errors .
> > 2) cpuburnP6 , The tests run were 'cpuburnP6 E' & 'cpuburnP6 H' for ~
> > 60 minutes each . All CPU's & HT were at 96+<->100% for 60+ Minutes
> > ,
> > no excessive heating or lockups . In single user mode of course .
> > I know cpuburn is old but it can excersize the comms between l1 & cpu
> > & l1 & l2 -> cpu if done right .
> >
> > (*)
> > CPU 5: Machine Check Exception: 0000000000000004
> > CPU 4: Machine Check Exception: 0000000000000004
> > Kernel panic - not syncing: Unable to continue
> > <system reboots>
> >

An MCE is an error reported by the hardware. It is most likely not a
software problem, not much kernel people can do about it.

Google for "parsemce.c" to find a program that'll decode most MCE's for you.

You may also want to contact your hardware vendor to get an exact
explanation for the error.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2007-06-22 06:13:39

by Joachim Deguara

[permalink] [raw]
Subject: Re: Machine Check Exception: 0...04

On Saturday 16 June 2007 22:40:53 Mr. James W. Laferriere wrote:
> Hello All , Does anoyone know howto identify a cause for these(*) ?
> Or of any tools to help in the identification of the cause ?
> So far the Machine checks only happen when I am running bonnie++ against
> my software raid6 array .

You should run mcelog to decode what the machine checks mean. These do point
to a hardware problem. Just briefly judging from you description of when
the MCEs happen I wonder if the power supply is getting maxed out driving all
of those disks. Either way, use mcelog to find out what the MCEs are first.

-Joachim