LinuxLists.cc - 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

2009-06-21 17:08:26

Subject: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Tested kernel version: 2.6.30-git16 and 2.6.30-git17
Last known good: 2.6.30

System hangs few minutes after resume from suspend to disk. I have
tried bisection and here is result:

4efc0670baf4b14bc95502e54a83ccf639146125 is first bad commit
commit 4efc0670baf4b14bc95502e54a83ccf639146125
Author: Andi Kleen <[email protected]>
Date: Tue Apr 28 19:07:31 2009 +0200

x86, mce: use 64bit machine check code on 32bit

The 64bit machine check code is in many ways much better than
the 32bit machine check code: it is more specification compliant,
is cleaner, only has a single code base versus one per CPU,
has better infrastructure for recovery, has a cleaner way to communicate
with user space etc. etc.

Use the 64bit code for 32bit too.

This is the second attempt to do this. There was one a couple of years
ago to unify this code for 32bit and 64bit. Back then this ran into some
trouble with K7s and was reverted.

I believe this time the K7 problems (and some others) are addressed.
I went over the old handlers and was very careful to retain
all quirks.

But of course this needs a lot of testing on old systems. On newer
64bit capable systems I don't expect much problems because they have been
already tested with the 64bit kernel.

I made this a CONFIG for now that still allows to select the old
machine check code. This is mostly to make testing easier,
if someone runs into a problem we can ask them to try
with the CONFIG switched.

The new code is default y for more coverage.

Once there is confidence the 64bit code works well on older hardware
too the CONFIG_X86_OLD_MCE and the associated code can be easily
removed.

This causes a behaviour change for 32bit installations. They now
have to install the mcelog package to be able to log
corrected machine checks.

The 64bit machine check code only handles CPUs which support the
standard Intel machine check architecture described in the IA32 SDM.
The 32bit code has special support for some older CPUs which
have non standard machine check architectures, in particular
WinChip C3 and Intel P5. I made those a separate CONFIG option
and kept them for now. The WinChip variant could be probably
removed without too much pain, it doesn't really do anything
interesting. P5 is also disabled by default (like it
was before) because many motherboards have it miswired, but
according to Alan Cox a few embedded setups use that one.

Forward ported/heavily changed version of old patch, original patch
included review/fixes from Thomas Gleixner, Bert Wesarg.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
Signed-off-by: Hidetoshi Seto <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>

:040000 040000 3ed45ebe46fdbb0df7f4190400fa4640be9f4c6c
e1fbb6da0ce70b944894d47c7e6fef0d30b5ff71 M arch

Unfortunately, because system hangs, I haven't any information in logs.

/proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz
stepping : 13
cpu MHz : 1200.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est
tm2 ssse3 cx16 xtpr pdcm lahf_lm
bogomips : 3999.98
clflush size : 64
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz
stepping : 13
cpu MHz : 1200.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est
tm2 ssse3 cx16 xtpr pdcm lahf_lm
bogomips : 3999.72
clflush size : 64
power management:

dmesg, config from 2.6.30-git17:
http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/

--
Maciej Rutecki
http://www.maciek.unixy.pl

2009-06-21 18:43:17

by Andi Kleen

[permalink] [raw]

Subject: Re: 2.6.30-git(16 and 17) system hangs after resume from suspend to disk, mce related?

Maciej Rutecki wrote:
> Tested kernel version: 2.6.30-git16 and 2.6.30-git17
> Last known good: 2.6.30
>
> System hangs few minutes after resume from suspend to disk.

Thanks for the report.

I assume it runs stable for hours without resume from disk?
And you made sure you don't use stale data from
a different kernel for resume from disk?

It is strange that resume from disk affects machine check.
How is your resume setup?
Do you have any init scripts that change machine check state
before the resume from disk runs?

Is there any chance you could configure netconsole or similar
to get output during the hang?

> I have
> tried bisection and here is result:

I assume you have CONFIG_X86_NEW_MCE enabled, correct?
Does it still happen with CONFIG_X86_OLD_MCE instead?

Also a "a few minutes" suggest something might be going wrong
with the poll handler. Does the problem still happen
with you use CONFIG_X86_NEW_MCE again, but before
resume do

echo 0 > /sys/device/system/machinecheck/machinecheck0/check_interval

On the other hand you should get a crash very fast with

echo 1 > /sys/device/system/machinecheck/machinecheck0/check_interval

If we confirm it's the poll handler I can send you a debugging patch
to narrow it down further if I can't reproduce it.
But that would need console output during the crash.

Your dmesg also doesn't have anything related to resume from disk?

Thanks,

-Andi

2009-06-21 20:13:30