2001-02-06 20:40:43

by Carlos Carvalho

[permalink] [raw]
Subject: Re: CPU error codes

Alan Cox ([email protected]) wrote on 31 January 2001 15:23:
>> > In the intel databook. Generally an MCE indicates hardware/power/cooling
>> > issues
>>
>> Doesn't an MCE also cover some hardware memory problems - parity/ECC
>> issues etc?
>
>Parity/ECC on main memory is reported by the chipset and needs seperate
>drivers or apps to handle this

Really? I thought it could be because of RAM. Here's the story:

The kernel is 2.2.18pre24.

I'm having VERY frequent of this (sometimes once a day, sometimes once
a week, sometimes twice a day, on a much used machine)

CPU 1: Machine Check Exception: 0000000000000004
Bank 4: b200000000040151<0>Kernel panic: CPU context corrupt

CPU 0: Machine Check Exception: 0000000000000004
Bank 4: b200000000040151<0>Kernel panic: CPU context corrupt

CPU 0: Machine Check Exception: 0000000000000004
Bank 4: b200000000040151<0>Kernel panic: CPU context corrupt

This is on an ASUS P2B-DS with two PIII 700MHz and 100MHz FSB, 1GB of
RAM. The mce happens with both processors (the above is just part of
it).

I've already changed the motherboard and processors, and it continued.
Then I changed the memory, and it continues. I also changed the
power supply just in case, to no avail...

It happens with PC100 and PC133 memory. I increased the memory latency
(the SPD says it's cl2, I put it 3T and 10T DRAM) but the problem
persists.

Since I changed the main board and processor, I think the most likely
cause is ram. It seems the x86 can access ram directly, so if there's
a NMI there what will happen?

This is happening on a CRITICAL machine, so any help will be much
appreciated.


2001-02-07 08:53:41

by Alan

[permalink] [raw]
Subject: Re: CPU error codes

> Really? I thought it could be because of RAM. Here's the story:

RAM talks to the chipset so I dont think it could (unless it confused the
chipset)

> CPU 1: Machine Check Exception: 0000000000000004
> Bank 4: b200000000040151<0>Kernel panic: CPU context corrupt

Ok that decodes as:
Status valid
Uncorrect Error
Error Enabled
Processor Context Corrupt

Memory Heirarchy Error
Instruction Fetch
L1 cache

More than that I can't really say. Power and heat problems can certainly
trigger MCE's. I don't know if I/O devices can influence them.