2008-10-24 12:51:54

by Felix von Leitner

[permalink] [raw]
Subject: MCEs

I am getting frequent MCEs on my Linux desktop, when I am encoding TV
recordings to H.264 using mencoder. It is a dual core box, I am using
2.6.27 (but have had the problem for a while now).

This is the kind of MCE that freezes the box and causes a panic. The
trace does not end up in syslog. I found a program called mcelog which
I am supposed to call regularly from cron, but how can that help me when
the first MCE I get insta-panics the box?

Now the most common causes for MCEs are apparently heat issues and bad
memory. I can rule out both. Could this be an artifact of some bad
ACPI tables?

How do you debug this kind of problem?

Thanks,

Felix


2008-10-24 16:24:38

by Tony Vroon

[permalink] [raw]
Subject: Re: MCEs

On Fri, 2008-10-24 at 14:45 +0200, Felix von Leitner wrote:
> Now the most common causes for MCEs are apparently heat issues and bad
> memory. I can rule out both.

Are you sure? I have had MCEs and instability for a while now, and using
mcelog --k8 --dmi /dev/mcelog

I finally got a clear "this component is at fault" message, pinpointing
DIMM 4 on CPU 2. I shuffled the DIMMs around and then used the machine
again.
The message shifted with the DIMM, to DIMM 1 on CPU 1. Memtest86+
doesn't appear to stress the hardware enough to provoke single or
multi-bit errors though. (So, a few successful passes in memtest86+ does
not rule out a RAM problem)

Temperatures can also get high at locations in the machine that have no
sensors (specifically voltage regulators). To check for heat problems
you could operate your tower case whilst lying on the floor, so hot air
rises up past the PCI/PCIe cards instead of getting trapped underneath
them.

Note that LKML isn't the friendliest place to get MCE debugging, as it
will be considered a hardware fault and thus off-topic.
Consider an MCE like a 'check engine' light in your car. It doesn't tell
you what's wrong, just that it's bad and should be investigated.

Regards,
Tony V.


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2008-10-24 18:04:36

by Andi Kleen

[permalink] [raw]
Subject: Re: MCEs

Felix von Leitner <[email protected]> writes:

> This is the kind of MCE that freezes the box and causes a panic. The
> trace does not end up in syslog. I found a program called mcelog which
> I am supposed to call regularly from cron, but how can that help me when
> the first MCE I get insta-panics the box?

When you do a warm boot (not power cycle, but reset button or
panic=30) then the panic mce will be logged after reboot.

> Now the most common causes for MCEs are apparently heat issues and bad
> memory. I can rule out both. Could this be an artifact of some bad
> ACPI tables?
>
> How do you debug this kind of problem?

It's some sort of hardware problem, debugging it typically
either involves fixing the cooling or exchanging components.

-Andi

2008-10-24 21:45:34

by Felix von Leitner

[permalink] [raw]
Subject: Re: MCEs

> > This is the kind of MCE that freezes the box and causes a panic. The
> > trace does not end up in syslog. I found a program called mcelog which
> > I am supposed to call regularly from cron, but how can that help me when
> > the first MCE I get insta-panics the box?
> When you do a warm boot (not power cycle, but reset button or
> panic=30) then the panic mce will be logged after reboot.

Maybe I'm doing something wrong here.
I run mcelog from the shell after the boot.
Nothing happens, I get the shell prompt right back.

How do I know it's really a MCE? So far my symptoms are: the machine
freezes, then a panic dump scrolls by, and the only text I can see on my
screen are the stack dump lines, which contain something to the tune of
machine_check() or so. Nothing shows up in syslog, probably the kernel
decides the box is too hosed to log anything.

Felix

2008-10-24 23:07:59

by Felix von Leitner

[permalink] [raw]
Subject: Re: MCEs

> It's some sort of hardware problem, debugging it typically
> either involves fixing the cooling or exchanging components.

Using an older kernel I get actual MCE messages that I can write down
and decode using mcelog.

It looks like the MCE handler is buggy in current kernels and will cause
a panic instead of the MCE messages (or maybe in addition to and the
panic obscures them by scrolling them out of my 80x25 screen).

Is there a howto or wiki somewhere on how to parse mcelog decoded
messages?

Felix

2008-10-25 06:52:43

by Andi Kleen

[permalink] [raw]
Subject: Re: MCEs

On Sat, Oct 25, 2008 at 01:07:48AM +0200, Felix von Leitner wrote:
> > It's some sort of hardware problem, debugging it typically
> > either involves fixing the cooling or exchanging components.
>
> Using an older kernel I get actual MCE messages that I can write down
> and decode using mcelog.

You mean when you switch back to the old kernel it's recoverable
and then switch to the new kernel it is not?

If yes then bisect it please.

>
> It looks like the MCE handler is buggy in current kernels and will cause
> a panic instead of the MCE messages (or maybe in addition to and the

The CPU tells the kernel if a machine check should cause a panic or not.
Usually when you actually get an MC _E_xception it's panic time.

> panic obscures them by scrolling them out of my 80x25 screen).

Panic should be one line, unless you're bitten by the 2.6.27
smp_call_function in panic bug (see
http://bugzilla.kernel.org/show_bug.cgi?id=11569)

Please post concrete logs from a serial or netconsole.

-Andi

--
[email protected]

2008-10-25 10:05:28

by Felix von Leitner

[permalink] [raw]
Subject: Re: MCEs

Thus spake Andi Kleen ([email protected]):
> > Using an older kernel I get actual MCE messages that I can write down
> > and decode using mcelog.
> You mean when you switch back to the old kernel it's recoverable
> and then switch to the new kernel it is not?

No it's not recoverable but I get the error messages and can write them
down. Previously I just got the stack dump.

> > panic obscures them by scrolling them out of my 80x25 screen).
> Panic should be one line, unless you're bitten by the 2.6.27
> smp_call_function in panic bug (see
> http://bugzilla.kernel.org/show_bug.cgi?id=11569)

That's exactly what happened.

Felix