Hi,
I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D
Chipset / MSI-6321 M/B + ) and the following message popped up after which
the system hardlocked (no SysRQ input). What does this message mean?
CPU 1: Machine Check Exception: 0000000000000004
Bank 1: b200000000000115
Kernel panic: CPU context corrupt
Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ...
delta kernel: CPU 1: Machine Check Exception: 0000000000000004
Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ...
delta kernel: Bank 1: b200000000000115
Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ...
delta kernel: Kernel panic: CPU context corrupt
--
Vibol Hou
Followup to: <[email protected]>
By author: "Vibol Hou" <[email protected]>
In newsgroup: linux.dev.kernel
>
> Hi,
>
> I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D
> Chipset / MSI-6321 M/B + ) and the following message popped up after which
> the system hardlocked (no SysRQ input). What does this message mean?
>
> CPU 1: Machine Check Exception: 0000000000000004
> Bank 1: b200000000000115
> Kernel panic: CPU context corrupt
>
> Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ...
> delta kernel: CPU 1: Machine Check Exception: 0000000000000004
>
> Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ...
> delta kernel: Bank 1: b200000000000115
>
> Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ...
> delta kernel: Kernel panic: CPU context corrupt
>
It means your hardware is bad.
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
> I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D
> Chipset / MSI-6321 M/B + ) and the following message popped up after which
> the system hardlocked (no SysRQ input). What does this message mean?
>
> CPU 1: Machine Check Exception: 0000000000000004
> Bank 1: b200000000000115
> Kernel panic: CPU context corrupt
It means your processor flagged a fault. The b2....115 number decodes to info
about the fault cause if you grab the PIII manual.
Stupid things like overheating. wrong voltages can also trigger it
On Sat, Jul 07, 2001 at 10:41:23PM +0100, Alan Cox wrote:
It means your processor flagged a fault. The b2....115 number
decodes to info about the fault cause if you grab the PIII manual.
Stupid things like overheating. wrong voltages can also trigger it
Is there any reason why, with proper MCE checking for both K7 and PIII
we can't automatically off-line processors when they start doing bad
things?
Sure, its a pretty lousy thing to do, but if you buys you a few
minutes and allows userland to initiate some kind of remedy
(pager("HELP"); system("shutdown"); sort of thing)...
Also, I'm pretty sure I was seeing overheating problems or something
on a K7 at one point, but never saw MCE; I take it this code only
exists fully in -ac kernels? I looked in Linus' tree and couldn't see
anything.
--cw
> Is there any reason why, with proper MCE checking for both K7 and PIII
> we can't automatically off-line processors when they start doing bad
> things?
Architectural limitations. Its entirely possible that the cache of the dying
processor contains exclusive copies of arbitary data.
> Also, I'm pretty sure I was seeing overheating problems or something
> on a K7 at one point, but never saw MCE; I take it this code only
> exists fully in -ac kernels? I looked in Linus' tree and couldn't see
> anything.
Only -ac has K7 MCE enabled right now - also MCE is not guaranteed to catch
problems.
On Sun, 8 Jul 2001, Alan Cox wrote:
> Only -ac has K7 MCE enabled right now - also MCE is not guaranteed to catch
> problems.
Actually you merged that with Linus a few revisions back iirc.
regards,
Dave.
--
| Dave Jones. http://www.suse.de/~davej
| SuSE Labs
On Sun, Jul 08, 2001 at 05:33:59PM +0200, Dave Jones wrote:
Actually you merged that with Linus a few revisions back iirc.
I don't see it for K7/AMD:
cw:tty5@tapu(kernel)$ pwd
/home/cw/wk/linux/linux-2.4.7-pre2+O_DIRECT/arch/i386/kernel
cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c
static void intel_machine_check(struct pt_regs * regs, long error_code)
static void pentium_machine_check(struct pt_regs * regs, long error_code)
static void winchip_machine_check(struct pt_regs * regs, long error_code)
static void unexpected_machine_check(struct pt_regs * regs, long error_code)
void do_machine_check(struct pt_regs * regs, long error_code)
--cw
On Mon, 9 Jul 2001, Chris Wedgwood wrote:
> Actually you merged that with Linus a few revisions back iirc.
> I don't see it for K7/AMD:
> cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c
> static void intel_machine_check(struct pt_regs * regs, long error_code)
There is no K7 specific implementation. It's the same as the Intel MSRs.
>From the comment in the file:
case X86_VENDOR_AMD:
/*
* AMD K7 machine check is Intel like
*/
if(c->x86 == 6)
intel_mcheck_init(c);
break;
regards,
Dave.
--
| Dave Jones. http://www.suse.de/~davej
| SuSE Labs
On Sun, Jul 08, 2001 at 07:09:11PM +0200, Dave Jones wrote:
There is no K7 specific implementation. It's the same as the Intel
MSRs.
Ah thanks, missed that.
--cw
Hrm,
First off, thanks for the direction Alan, Peter, and Chris.
So I've flipped through the Intel docs, and read up on the MCA for P2/3
processors. I've decoded the info from the MC0_STATUS register that was
given to me in the Bank 1: b200000000000115 line. The 0115 MCA code
indicates a DCACHEL1_RD error, so it seems the L1 cache is bad, though this
does not seem to be heat-related since lm_sensors indicate similar
temperature readings for both CPUs (within .3 degress celcius of each other
~30 dC).
That probably explains why the system hardlocked quickly each time there was
heavy I/O and processing with SMP mode enabled with the full 1GB memory in
it. Only after removing one of the memory sticks did the system begin
spitting out OOPs and MCEs.
I also wonder, however, if this could be due to the 2nd processor not
getting enough voltage. I don't know the S-SPEC of the processor, but I
think it's the same as the 1st. However, the voltage reading for CPU 2 is
.05v lower at 1.65v. Any processor gurus here?
Thanks,
Vibol
-----Original Message-----
From: Chris Wedgwood [mailto:[email protected]]
Sent: Sunday, July 08, 2001 12:28 AM
To: Alan Cox
Cc: Vibol Hou; Linux-Kernel
Subject: Re: Machine check exception? (2.4.6+SMP+VIA)
On Sat, Jul 07, 2001 at 10:41:23PM +0100, Alan Cox wrote:
It means your processor flagged a fault. The b2....115 number
decodes to info about the fault cause if you grab the PIII manual.
Stupid things like overheating. wrong voltages can also trigger it
Is there any reason why, with proper MCE checking for both K7 and PIII
we can't automatically off-line processors when they start doing bad
things?
Sure, its a pretty lousy thing to do, but if you buys you a few
minutes and allows userland to initiate some kind of remedy
(pager("HELP"); system("shutdown"); sort of thing)...
Also, I'm pretty sure I was seeing overheating problems or something
on a K7 at one point, but never saw MCE; I take it this code only
exists fully in -ac kernels? I looked in Linus' tree and couldn't see
anything.
--cw
Followup to: <[email protected]>
By author: "Vibol Hou" <[email protected]>
In newsgroup: linux.dev.kernel
>
> I also wonder, however, if this could be due to the 2nd processor not
> getting enough voltage. I don't know the S-SPEC of the processor, but I
> think it's the same as the 1st. However, the voltage reading for CPU 2 is
> .05v lower at 1.65v. Any processor gurus here?
>
That sounds a bit suspicious indeed, and could certainly cause that
kind of errors.
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
Followup to: <[email protected]>
By author: Dave Jones <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Mon, 9 Jul 2001, Chris Wedgwood wrote:
>
> > Actually you merged that with Linus a few revisions back iirc.
> > I don't see it for K7/AMD:
>
> > cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c
> > static void intel_machine_check(struct pt_regs * regs, long error_code)
>
> There is no K7 specific implementation. It's the same as the Intel MSRs.
>
> From the comment in the file:
>
> case X86_VENDOR_AMD:
> /*
> * AMD K7 machine check is Intel like
> */
> if(c->x86 == 6)
> intel_mcheck_init(c);
> break;
>
>
Note that I released a patch to make bluesmoke a lot more generic
quite a while ago. Linus was in the "I don't want to even hear about
anything but critical bugfixes" mode at that point, so it didn't get
integrated.
If anyone is interested, it is at:
http://www.kernel.org/pub/linux/kernel/people/hpa/bluesmoke-2.4.0-test11-pre5-3.diff.gz
Let me know if you want me to bring it forward.
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt