2009-01-06 13:00:23

by Zdenek Kabelac

[permalink] [raw]
Subject: MCE error log

Hi

I've noticed mcelog with weird content:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC 57976afd
STATUS 88380100 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC 53e61034
STATUS 88370100 MCGSTATUS 0

I'm running T61 - 2GB - in this directory
/sys/devices/system/machinecheck/machinecheck1
I could only see bank0ctl ... bank5ctl - so where is bank 128 ?
(as there are no time stamps, I could hardly guess how often this happens)

Is it kernel bug or chipset bug ?
Should I start to worry about the stability of my machine ?

Zdenek


2009-01-06 18:42:22

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: MCE error log

On Tue, 06 Jan 2009 14:00:03 +0100, Zdenek Kabelac said:

> CPU 1 BANK 128 TSC 57976afd

> I could only see bank0ctl ... bank5ctl - so where is bank 128 ?

I've had bank 128 reported before. Turned out it was for thermal events caused
by dust bunnies clogging a cooling vent. I never did find an official
statement that's what 128 is for, but I did find a bunch of hints....

What does lm_sensors say the CPU temp is sitting at?


Attachments:
(No filename) (226.00 B)

2009-01-06 18:49:50

by Andi Kleen

[permalink] [raw]
Subject: Re: MCE error log

"Zdenek Kabelac" <[email protected]> writes:

> /sys/devices/system/machinecheck/machinecheck1
> I could only see bank0ctl ... bank5ctl - so where is bank 128 ?

Update your mcelog. Newer versions decode it.

-Andi

--
[email protected]

2009-01-07 17:44:40

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: MCE error log

2009/1/6 Andi Kleen <[email protected]>:
> "Zdenek Kabelac" <[email protected]> writes:
>
>> /sys/devices/system/machinecheck/machinecheck1
>> I could only see bank0ctl ... bank5ctl - so where is bank 128 ?
>
> Update your mcelog. Newer versions decode it.

Ok

I've replaced binary with the latest code from you.

Here is new trace

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 THERMAL EVENT TSC 54ab7bf6
Processor core below trip temperature. Throttling disabled
STATUS 88380100 MCGSTATUS 0

So what does this message means now ?

Zdenek

2009-01-25 23:52:12

by Vegard Nossum

[permalink] [raw]
Subject: Re: MCE error log

On Tue, Jan 6, 2009 at 7:42 PM, <[email protected]> wrote:
> On Tue, 06 Jan 2009 14:00:03 +0100, Zdenek Kabelac said:
>
>> CPU 1 BANK 128 TSC 57976afd
>
>> I could only see bank0ctl ... bank5ctl - so where is bank 128 ?
>
> I've had bank 128 reported before. Turned out it was for thermal events caused
> by dust bunnies clogging a cooling vent. I never did find an official
> statement that's what 128 is for, but I did find a bunch of hints....
>
> What does lm_sensors say the CPU temp is sitting at?

I get this also:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 THERMAL EVENT TSC dc963a087
Processor core below trip temperature. Throttling disabled
STATUS 882c0100 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 THERMAL EVENT TSC dc970c0d0
Processor core below trip temperature. Throttling disabled
STATUS 882d0200 MCGSTATUS 0

and in kernel log:

Machine check events logged
CPU0: Temperature/speed normal
CPU1: Temperature/speed normal

This is happening since I installed a x86_64 kernel instead of 32-bit.
Maybe this explains those weird (never fatal) APIC errors I always
used to get before (error 40, invalid vector received AFAIR)? In any
case, the APIC errors are not to be seen now, and the frequency of the
MCEs are about that of the APIC errors. What I can say is that it
seems they appear sooner when there is a lot of interrupts, e.g. disk
or network activity. What is the correlation?

Temperature seems completely normal whenever it happens:

# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +58.0°C (high = +100.0°C, crit = +100.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Core 1: +59.0°C (high = +100.0°C, crit = +100.0°C)

Anyway, system works fine, so it's not much to worry about. But I am curious...


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036